Homepage for Stephen Temple

Making AI Platforms More Trustworthy

LOW-COST ERROR SUPPRESSION PROTOCOL WHEN USING AI LLM PLATFORMS FOR IMPORTANT WORK

Summary

One cannot trust a single AI Platform results for work where accuracy is vital. But that does not mean AI Platforms cannot be used for important work.

The solution is to add a quality control system that detects and suppresses errors made by the LLMs. Commercial systems exist to do this but are expensive. The one we describe here is free and can be used by anybody.

It works on the principle of using more than one AI Platform. We recommend using three. The three independent AI Platforms are used in three very different functional modes.

  • In the first functional mode the AI Platforms are tasked to do the same analysis independently (or whatever the job is).
  • In the second functional mode the AI Platforms are redirected to being forensic detectives. They are given the results of all the outputs produced in the first mode. They are instructed to ruthlessly hunt for any errors. They do this independently.
  • In the third functional mode they are each given the reports of all the error that have been identified in their outputs. They are told to either accept they have made an error and correct it or reject the claim in the error reports and justify why.

The approach results in three error suppressed independent outputs to triangulate between. It was used for the research phase for the book "The Graveyard of Good Intentions" with a fourth mode added, termed "the shootout", where there existed significant divergence between the three outputs.

1. Introduction

The reasons why we used an AI Platform for evaluating case studies against the Ten Golden Rules in the first place were:

  1. To Take Human Bias completely Out of the Loop - the only operational interaction had to be solely between the Case Study being evaluated and the Ten Golden Rules
  2. To Ensure Consistency of the Evaluation of Different Case Studies- so the outcomes from different case studies could be compared
  3. Be able to reproduce the results of a case study evaluation
  4. To Do Complex Evaluations Speedily

But the current generation of LLMs are prone to making errors, drifting and hallucinating. That was unacceptable. The whole purpose of our book was to stimulate debate of the research results and not discussion of our methodology. There is already widely available commercial quality control solutions for LLMs being use for critical industrial applications. They orchestrate running tasks across multiple AI Platforms, apply harnesses to enforce processes, apply consistency checks and provide an audit log. But they were beyond our means. We developed our own.

It is based on some earlier modelling research using two AI Platforms at the University of Surrey 6GIC. The research there showed that asking the two independent AI Platforms to work cooperatively together to agree what the right answer was failed to eliminate the errors. They were more likely to arrive at “split the difference” compromises. The only thing that worked effectively was a “confrontation” approach - where the function of the independent LLMs was changed from their analysis function to challenging the results from the other AI Platforms.

We applied this principle to our own desk top error suppression protocol. In order to reach industrial grade error suppression we used three independent Large Language Models. The magic of using "three" is that the results of each LLM is attacked by two LLMs both acting as independent forensic detectives. Three results also allowed for determining an even more dependable outcome through triangulation.

The only disadvantage of this low cost error suppression protocol is that data has to be transferred manually between AI Platforms through copy and pasting results. Later we show how this was done efficiently so the end to end process only took twenty minutes.

2. Multiple-AI Platforms

The three AI Platforms we used were Anthropic Claude, Chat GPT and Gemini. All three turned out to be diligent forensic detective. As with all work with LLMs, the phrasing of the instructions is critical - not only did the errors of fact need to be detected but also arithmetic error, error of logic and error of judgement.

We then turned this approach into a repeatable four-step quality control error suppression protocol.

3. The Four-Step Error Connection Protocol

The same task is run independently on each AI Platform. Hopefully the conclusions converge but how each got there will be different as their training data and operational settings will be different.

  • Step 1 – Attack phase. The character of all the LLMs is changed from deep analysts to forensic detectives. They are given instructions to rigorously seek out any errors of fact, logic, arithmetic and hallucinations in the results from all the AI Platforms.
  • Step 2 – Confrontation phase. The error reports from all three LLMs are consolidated and given back to all three LLMs. This time the instruction to the LLMs is to review all the errors attributed to itself and, in every case, either accept or reject it has made an error. If it rejects it has made an error, it has to justify why.
  • Step 3 - Correction phase. Each LLM is instructed to put right the errors it has owned up to, re-assess its evaluation, and produce a revised output. It must also set out why it rejects it has made an error the other AI Platforms claim to have identified.
  • Step 4 – Shoot-out between the extreme scorers. Our particular evaluation approach generated an overall score. This was not the principal output but was a helpful measure of output divergence. When the divergence is significant a shoot-out takes place between the two AI Platforms having the extreme ends of the divergence. It takes the form of instructing each AI Platform to make the best case for why the other Platform output is the correct one and then compare this with its original reasoning.

This last step is optional and possibly the most distinctive feature of our error suppression protocol compared with standard commercial approaches. What our “shoot out” step could expose was interpretative differences leading to different outcomes.

The traditional approach, in comparing the output of three evaluations, is to assume the minority LLM scorer had got it wrong. In one of our shoot-outs the minority scorer stood its ground, and both majority high scorers conceded their judgements had been wrong.

4. Evidence of the effectiveness of the error suppression protocol

Over this one evaluation run of the three case studies there were a total of thirty-one errors found and corrected. The number per LLM over the three case study evaluations ranges from seven to twelve. These numbers are given in the book to quantify the success of the error suppression protocol

Examples of LLM mistakes included:

  • An arithmetic error by one LLM. The excuse given was that, in heavily loaded conditions, the LLM was required to prioritised reasoning over simpler functions, like arithmetic.
  • An hallucination where the LLM picked off an example appearing in the prompt interpretation guidance and used it as if it existed in the case study to justify a score.
  • An error of fact where the LLM concluded that was no enacted commitment. It had missed the statement of a £21m AI Diagnostics fund.
  • Misclassifying something as a commitment that had been implemented as only being a future intention.

This got us to the point of having all the power of AI with its errors and misjudgments suppressed to the extent possible with the forensic abilities of the three LLMs we used.

Further, if any remained, it would be revealed when triangulating the corrected outputs of the three independent evaluation from the three AI Platforms.

  1. Manual Work Flow

It was easiest when working with the three AI Platforms to use a single blank Word document to paste the output from all three AI Platforms. As each output was long, each was pasted using a different text colour eg orange for Claude, blue for Chat GPT and green for Gemini. This made it very easy to find the start and finish of each output to add the AI platform titles. It also allowed speed scrolling to find a specific AI Platform result if that was required.

Then on top of the composite document the new Prompt instructions was added. Below are the Prompt instructions that were added to the top of the Word document for each protocol step. Then the entire Word document was pasted (or uploaded) back to all three platforms.

The first three steps would typically take twenty minutes. If the fourth step was required that would take longer but that would be time well spent as it gave insights to the judgmental differences at work.

5.1 Protocol Step 1 Error Detection

Prompt Instruction - The following are the analysis results of three AI Platforms on a case study to evaluate their alignment with the ten golden rules for global leadership. It is vitally important that the evaluations contain no errors. Evaluate with all of the three evaluation results, acting as a ruthless forensic detective, to find and identify all errors of fact, logic, arithmetic and from hallucinations. List the errors you have found and their nature under the title of the AI Platform that made the errors. Then give the number of errors you found in the evaluation reports of each AI Platform.

5.2 Protocol Step 2 Reconciliation of Errors Found

Prompt Instruction - Inspect the document uploaded. It is a consolidation of all the errors you have found and the errors the other two AI platforms found. Look at all the errors you and the other two AI Platforms have found in your evaluation report. Carefully examine them. If you accept that you have made these errors, then correct your evaluation report, including revising your overall score out of thirty if that was affected. Then list the errors that you have rejected with a one line explanation as to why. Then just summarise the number of errors you accepted and have corrected for in giving your revised evaluation and the number you have rejected.

5.3 Protocol Step 3 - Inspection of Errors Alleged but Rejected

The action is to look at the alleged errors each AI Platform has rejected and the reason. If any look consequential the action would be to go back to the AI Platform or Platforms making the allegation of an error with the reason given for the rejection and see if the AI Platform or Platforms alleging the error sustain its or their position. When this arises it is not usually not an error but a difference of judgement. That can be illuminating.

5.4 Protocol Step 4 - The Shoot-Out

The shoot out step was only taken when there was a large spread between the scores or one was significantly out of line with the other.

Prompt Instruction - The objective of the shootout is to ascertain whether the score differences are the result of an error not so far detected or as a result of a different in judgement. The shoot out requires you to look at your opponent’s score and reasoning and make a reasoned case why your opponent has the right score and yours is wrong. This is to test out difference in judgement and whether one or other AI Platform has the better reasoned judgement. Evaluate your judgement against the best reasoned judgement you have made for your opponent’s score. Then say if you stand by your score or you concede the other AI platform has either flushed out an error or has a superior reasoned judgement for its case study score. If you concede, then give your amended score and whether you are doing that because you have accepted you made an error or the other AI platform has the better reasoned judgement for this case study

Your opponent for the shootout is...

5.5 Triangulation

The application of the error suppression protocol will result in three independent outputs. Residual differences are likely to come down to difference in the training data and the judgmental reasoning of each AI Platform. The specific task will determine how to proceed. That is for the human brain directing the task to determine.

5.6 Full traceability - As this was a research project a fifth step was added in the phase of fine tuning the Prompt to collect all the errors found and accepted from all three Platforms in Step 2 and paste into a Word document as an error log record.

6. Conclusion

The importance of our error suppression protocol is to remove the barrier to more widespread day to day use of AI for complex evaluations in both the public and private sectors. In this way everyone can be empowered to reach higher performance standards, both in quality and speed.

If time is of the essence then step 4 would be dropped.

The protocol could work with only two AI Platforms and deliver a more certain result than from a single AI Platform. In which case the two platforms that would be best to couple would be Chat GPT (more creative) and Claude (more forensic).




All content here (c) copyright of Stephen Temple