Bohemian Accuracy

Insider’s view

December 7, 2023

In his last article, War of the Words, Recordsure’s CTO Kit Ruparel explored the battleground between the makers of Large Language Models (LLMs) and the capabilities of GPTs to accept and generate human language.

Much has evolved in the intervening five months. The most significant development is that some of these LLMs have been promoted by the major cloud vendors into their accessible and robust operating infrastructures, opening the floodgates for secure enterprise use.

In his latest article, Kit explores the strengths and weaknesses of Gen-AI – in comparison to more traditional machine learning solutions – for reliably finding and presenting information from internal documents and other written or transcribed data sources.

The linguistic artistry of Generative AI is not suitable for applications that demand certainty, as current state-of-the-art techniques for integrating Generative AI with enterprise data sources leave plenty of room for interpretive error.

“Is this the real truth?

Is this just fantasy?

Facts should be verified –

no escape from ChatGPT.

Open your eyes,

look to A.I. with veracity…”

Generative Pre-trained Transformers (GPTs) have a place in the enterprise AI ensemble – just don’t let them sing solo.

I believe in syllables

In the beginning, computers taught us to trust them.

Our spreadsheets always summed our columns of data correctly; our secure messaging systems ensured that what was received was always the same as what was sent; and our GPS always got us to our destination (well, nearly always).

Then came the Web with its Internet search tools – and our trust relationship had to change.

Suddenly came the need for a ‘human in the loop’.

We could no longer assume that the top-ranked ‘answer’ to our research question was correct. We found ourselves rephrasing our query multiple ways and reading through numerous result links in order to aggregate a consensus answer from the various sources that we deemed authoritative based on our own education and biases.

Chat based Gen-AI, however, tempts us to regress – to believe in the accuracy of the words responded given the words that we supplied. In the War of the Words post, I presented a key reason for this: that for the first time we are using human language to interact with our information sources – and in doing so we assign trust to the suddenly human persona of these systems. It’s not our fault. It’s human nature.

Additionally, the impression these systems give, that they are doing all of the research for us, aggregating an answer from all of the information sources available, rather than us having to rephrase our search multiple times and crawl through innumerable potential links, makes us further believe that those answers are authoritative.

We shouldn’t… and I shall endeavour to explain why.

It started with a miss

The first record to set straight is any belief that when you use Gen-AI to search your enterprise documents or data sets, it’s considering all of your data when generating an answer.

Unless you’ve retrained the GPT on all of your data, it simply isn’t.

Instead, you (or your engineers) will be using Retrieval Augmented Generation (RAG) or similar techniques to pass a tiny subset of your data into the Gen-AI for consideration, for each query made.

Where does this subset come from?

Well: search is the answer, usually – and more specifically, something called ‘vector search’.

Effectively, the first step in the chain is that the system will be doing the same as you would if you were doing an Internet search – or in this enterprise use-case, an Intranet search.

Note for technical readers:

This article assumes the common method of using RAG techniques over enterprise document chunks stored in a vector database. Although it’s pertinent also to the emerging X-of-Thought methods that all, at their heart, use vector-embeddings similarity searches to supply grounding context to the Generative LLM.

I’ve used Microsoft Azure’s OpenAI GPT LLMs and vector embeddings on text block lengths of between 150 and 2,000 words for the experiments that lead to this article. Further research suggests that my findings would be similar for any of the other language-focused GPT models available.

The system is taking your question, performing a search to find snippets of your documents that have ‘semantic similarity’ (jargon for ‘similar meaning’) and then passing the first page of snippets of search results (not the whole of the documents found, just the search result snippets) into the Gen-AI LLM – and is effectively saying “just base your answer on these result snippets.”

That’s a very hit-and-miss approach to educating an AI on the facts it’s allowed to use in answering your question.

You inherently know this from the number of times you have to rephrase your Internet search query, until you eventually find the results you believe are most relevant.

The answer to the question “Which is the best song by the band Hot Chocolate?” is subjective, but whether or not you believe the answer is ‘It Started with a Kiss’ or ‘You Sexy Thing’ – none of the words in those alternative answers are in the question – so the vector search will only find candidate answers to your question if your documents have similar words matching your question in close proximity to all of the candidate answers.

Therefore, the Gen-AI is only going to stand a chance of getting the correct answer, if:

All of the candidate information it needs to consider is in close proximity in your documents to words that are similar to your question.

All of the information that needs to be considered can be returned in the first ‘page’ of search result snippets that your question matches against.

Professor Noah Giansiracusa eloquently covered AI bias in his recent article, and this is effectively what we’re considering here – a biased answer is given because the Gen-AI is ‘trained’ (in realtime, so OK, ‘grounded’) with a small sample of information, skewed somewhat randomly to the words in your question – but it’s not trained upon a set of correct-versus-incorrect answers.

The technical reason for all of the above is that each Gen-AI LLM has a constraint on the amount of information it can work with per query. And while these ‘token limit’ constraints are getting higher by the month as new models are released – whilst we’re limited to techniques such as RAG, we’re simply not going to be in a position whereby the Gen-AI can consider all of the information available to it in a typical enterprise scenario.

You say potato, I say tomato.
Let’s call the whole thing off

Another challenge for Gen-AI applications, perhaps surprisingly, is how good vector embeddings and LLMs now are at performing semantic similarity matches.

Researching this article, some of the experiments I tried involved trying to get factual answers from documents and conversations around an individual’s current tax position.

Using OpenAI (securely, via Azure’s Cognitive Services implementation) I found that when used in discrete statements, phrases with the terms ‘Individual Savings Account’ and its abbreviation ‘ISA’ had about a 90% semantic similarly in their embedding model to sentences with the words ‘Capital Gains Tax Allowance’.

This is hardly surprising, given the fact that they’re both recommended UK methods for reducing your tax position.

However, when these discrete terms were included with other surrounding words in the source documents in context block sizes of 200 words, the surrounding words significantly diluted the uniqueness of those shorter discrete phrases, so now often providing similarity scores within 0.1% of each other, given the query made.

Therefore, asking a question such as “Has the customer used their Capital Gains Tax Allowance for the last year?” stood absolutely no chance of getting the correct answer if the customer had used their ISA allowance.

I just want to make words for you

Let’s jump forward in the process now to your Gen-AI actually formulating an answer.

If I ask you a question, your brain’s cognition interprets my words conceptually then draws on its undoubtedly vast sum of knowledge and experience to conceptualise an answer. Finally, you turn that conceptual answer into words, to convey back to me.

That’s not how LLMs work. They have no method of cognition, nor any concept of a concept, nor any concept of an ‘answer’ – at least in the way that you or I understand it.

I’m trivialising their methods somewhat in the following statement, however, they are essentially going straight to words. Given your question, they decide what the best first word to present back to you is. Then they decide what the next most logical word should be in the sequence and give you that. Then the third word, and so forth…

Because GPTs are not actually providing an ‘answer’, merely a sequence of likely words, they have no way of measuring their own confidence in the accuracy of the answer they give you.

If you’ve worked with machine learning at all, you’ll know that neural network ‘confidence’ is absolutely the key to determining the ‘threshold’ – the level of trust – you’re going to put in an AI model.

Data scientists will thoroughly test a model, and based on the business needs will prescribe a confidence threshold, essentially saying: “We will only trust the output of the AI if it is accompanied by a confidence score of 92.36% or above.”

You can’t do that with Gen-AI as there is no ‘answer’, there is no ‘confidence’ and there is therefore nothing to test against in order to set a routine threshold for the accuracy of the outputs.

Whatever will be, won’t be

Another major problem with most Gen-AI LLMs is their inconsistency.

Traditional machine learning models are deterministic – that is, if you give them the same inputs, they will give you the same outputs.

If you’ve experimented with the freely available online chat products, you’ll have experienced this. Ask the same question twice and you will get a different answer each time.

There are two known reasons for this.

The first is technical; and this is down to the structure of the LLMs. It is widely believed, although not confirmed by OpenAI, that their latest GPT models use ‘Sparse Mixture of Experts (MoE)’ techniques.

In plainer English, this means that they’re not one big model but actually a composite of smaller models that compete to give the ‘best’ answer, each having been given only a subset of the input data. Due to the parallel and competing manner in which that works, sometimes one of the mini-models wins. Sometimes another wins.

The second reason for inconsistency is that it is deliberate. Please remember that these generative models have been designed to be creative. Whether you want to generate song lyrics, marketing copy, presentation bullets or pictures of mermaids, the creators of the GPT models need them to demonstrate their free-spirited variety, their eccentric (bohemian, if you will) artistry, the many-sidedness to their nature. After all, we can’t have every student turning in the same essay, can we?

Whether you want to generate song lyrics, marketing copy, presentation bullets or pictures of mermaids, the creators of the GPT models need them to demonstrate their free-spirited variety.

Kit Ruparel, Chief Technology Officer, Recordsure

However, what’s good for creativity is bad for determinism. Solutions looking to establish facts, robustly and consistently, using the current breed of Gen-AI will fail.

It ain’t what you say, it’s the way that you say it

For completeness I have to mention ‘Prompt Engineering’ – although I won’t go into any detail, given it’s probably the most widely written-about topic when it comes to Gen-AI.

Suffice to say, in the same way you get out what you put in with your standard Internet search query, this is even more so when it comes to Gen-AI queries and prompts.

To get good results from Gen-AI, behind the scenes you end up with some very long prompts, often running into many hundreds of words. Again, easily understood if you search ‘prompt engineering’ in your favourite Internet trawler but what is less widely written about for enterprise uses is that the longer your prompt, the less words (~tokens) you then have available to feed the Gen-AI your grounding context, i.e. your input data.

Therefore, the more accurate you try and make your answer, the less data you can give your Gen-AI to consider!

You ain't seen nothin' yet

Above, I’ve demonstrated that Gen-AI is non-deterministic (inconsistent), biased (discriminative), incapable of measuring the quality of its responses and generally not suited to any task where you have one chance to get a factually correct (authoritative) result from unseen enterprise data, documents or transcripts.

Does that mean it doesn’t have a role in the enterprise AI stack? Of course not – it mostly means that it must be constrained to the ‘human in the loop’ use cases for which it was intended: query/chat-based research and generative creativity.

Why ‘mostly’? …Gen-AI’s ability with natural human language does make a convenient and pleasant computer-to-human user interface – either way around.

And this is the area where Recordsure is making the most of GPTs in our ReviewAI product suite.

But do we trust Gen-AI with our core product tasks of ‘finding the right stuff’? For all the reasons described above: no, we don’t. It’s not what the current cohort of Gen-AI is designed for and it simply cannot be made to work reliably.

Our clients require proven accuracy and demonstrable consistency, whilst our expertise with ‘traditional’ AI remains the method to continue to deliver that; and no company working on Responsible AI should tell you otherwise.

However, now that Gen-AI is secure, safe and affordable for use by enterprises (thanks largely to the major cloud vendors), Recordsure is blending traditional machine learning and Gen-AI to improve the user experience, as Gen-AI has a good ability to ‘understand’ and generate human language.

We’d be foolish not to – but rather than as a solo artist, GPTs will likely remain a harmony singer in our AI troupe for a little while yet.

And fingers crossed, with the rate of industry evolution at present, it may not be too long until the mega-AI companies will crack the three big generative issues: determinism, discrimination and authority.

Images courtesy of Generative AI. Words wholly not so – although, it would probably have done a better job at song title puns than this author.

Bohemian Accuracy

The linguistic artistry of Generative AI is not suitable for applications that demand certainty, as current state-of-the-art techniques for integrating Generative AI with enterprise data sources leave plenty of room for interpretive error.

I believe in syllables

It started with a miss

Note for technical readers:

You say potato, I say tomato. Let’s call the whole thing off

I just want to make words for you

Whatever will be, won’t be

There are two known reasons for this.

It ain’t what you say, it’s the way that you say it

You ain't seen nothin' yet

More resources

Ready to get started?

Book a demo with us to experience the power of ReviewAI in action.

Review AI

Sectors

About

Quick Links

You say potato, I say tomato.
Let’s call the whole thing off