LOW-COST ERROR SUPPRESSION PROTOCOL WHEN USING AI LLM PLATFORMS FOR IMPORTANT WORK
Summary
One cannot trust a single AI Platform results for work where accuracy is vital. But that does not mean AI Platforms cannot be used for important work.
The solution is to add a quality control system that detects and suppresses errors made by the LLMs. Commercial systems exist to do this but are expensive. The one we describe here is free and can be used by anybody.
It works on the principle of using more than one AI Platform. We recommend using three. The three independent AI Platforms are used in three very different functional modes.
The approach results in three error suppressed independent outputs to triangulate between. It was used for the research phase for the book "The Graveyard of Good Intentions" with a fourth mode added, termed "the shootout", where there existed significant divergence between the three outputs.
1. Introduction
The reasons why we used an AI Platform for evaluating case studies against the Ten Golden Rules in the first place were:
But the current generation of LLMs are prone to making errors, drifting and hallucinating. That was unacceptable. The whole purpose of our book was to stimulate debate of the research results and not discussion of our methodology. There is already widely available commercial quality control solutions for LLMs being use for critical industrial applications. They orchestrate running tasks across multiple AI Platforms, apply harnesses to enforce processes, apply consistency checks and provide an audit log. But they were beyond our means. We developed our own.
It is based on some earlier modelling research using two AI Platforms at the University of Surrey 6GIC. The research there showed that asking the two independent AI Platforms to work cooperatively together to agree what the right answer was failed to eliminate the errors. They were more likely to arrive at “split the difference” compromises. The only thing that worked effectively was a “confrontation” approach - where the function of the independent LLMs was changed from their analysis function to challenging the results from the other AI Platforms.
We applied this principle to our own desk top error suppression protocol. In order to reach industrial grade error suppression we used three independent Large Language Models. The magic of using "three" is that the results of each LLM is attacked by two LLMs both acting as independent forensic detectives. Three results also allowed for determining an even more dependable outcome through triangulation.
The only disadvantage of this low cost error suppression protocol is that data has to be transferred manually between AI Platforms through copy and pasting results. Later we show how this was done efficiently so the end to end process only took twenty minutes.
2. Multiple-AI Platforms
The three AI Platforms we used were Anthropic Claude, Chat GPT and Gemini. All three turned out to be diligent forensic detective. As with all work with LLMs, the phrasing of the instructions is critical - not only did the errors of fact need to be detected but also arithmetic error, error of logic and error of judgement.
We then turned this approach into a repeatable four-step quality control error suppression protocol.
3. The Four-Step Error Connection Protocol
The same task is run independently on each AI Platform. Hopefully the conclusions converge but how each got there will be different as their training data and operational settings will be different.
This last step is optional and possibly the most distinctive feature of our error suppression protocol compared with standard commercial approaches. What our “shoot out” step could expose was interpretative differences leading to different outcomes.
The traditional approach, in comparing the output of three evaluations, is to assume the minority LLM scorer had got it wrong. In one of our shoot-outs the minority scorer stood its ground, and both majority high scorers conceded their judgements had been wrong.
4. Evidence of the effectiveness of the error suppression protocol
Over this one evaluation run of the three case studies there were a total of thirty-one errors found and corrected. The number per LLM over the three case study evaluations ranges from seven to twelve. These numbers are given in the book to quantify the success of the error suppression protocol
Examples of LLM mistakes included:
This got us to the point of having all the power of AI with its errors and misjudgments suppressed to the extent possible with the forensic abilities of the three LLMs we used.
Further, if any remained, it would be revealed when triangulating the corrected outputs of the three independent evaluation from the three AI Platforms.
It was easiest when working with the three AI Platforms to use a single blank Word document to paste the output from all three AI Platforms. As each output was long, each was pasted using a different text colour eg orange for Claude, blue for Chat GPT and green for Gemini. This made it very easy to find the start and finish of each output to add the AI platform titles. It also allowed speed scrolling to find a specific AI Platform result if that was required.
Then on top of the composite document the new Prompt instructions was added. Below are the Prompt instructions that were added to the top of the Word document for each protocol step. Then the entire Word document was pasted (or uploaded) back to all three platforms.
The first three steps would typically take twenty minutes. If the fourth step was required that would take longer but that would be time well spent as it gave insights to the judgmental differences at work.
5.1 Protocol Step 1 Error Detection
Prompt Instruction - The following are the analysis results of three AI Platforms on a case study to evaluate their alignment with the ten golden rules for global leadership. It is vitally important that the evaluations contain no errors. Evaluate with all of the three evaluation results, acting as a ruthless forensic detective, to find and identify all errors of fact, logic, arithmetic and from hallucinations. List the errors you have found and their nature under the title of the AI Platform that made the errors. Then give the number of errors you found in the evaluation reports of each AI Platform.
5.2 Protocol Step 2 Reconciliation of Errors Found
Prompt Instruction - Inspect the document uploaded. It is a consolidation of all the errors you have found and the errors the other two AI platforms found. Look at all the errors you and the other two AI Platforms have found in your evaluation report. Carefully examine them. If you accept that you have made these errors, then correct your evaluation report, including revising your overall score out of thirty if that was affected. Then list the errors that you have rejected with a one line explanation as to why. Then just summarise the number of errors you accepted and have corrected for in giving your revised evaluation and the number you have rejected.
5.3 Protocol Step 3 - Inspection of Errors Alleged but Rejected
The action is to look at the alleged errors each AI Platform has rejected and the reason. If any look consequential the action would be to go back to the AI Platform or Platforms making the allegation of an error with the reason given for the rejection and see if the AI Platform or Platforms alleging the error sustain its or their position. When this arises it is not usually not an error but a difference of judgement. That can be illuminating.
5.4 Protocol Step 4 - The Shoot-Out
The shoot out step was only taken when there was a large spread between the scores or one was significantly out of line with the other.
Prompt Instruction - The objective of the shootout is to ascertain whether the score differences are the result of an error not so far detected or as a result of a different in judgement. The shoot out requires you to look at your opponent’s score and reasoning and make a reasoned case why your opponent has the right score and yours is wrong. This is to test out difference in judgement and whether one or other AI Platform has the better reasoned judgement. Evaluate your judgement against the best reasoned judgement you have made for your opponent’s score. Then say if you stand by your score or you concede the other AI platform has either flushed out an error or has a superior reasoned judgement for its case study score. If you concede, then give your amended score and whether you are doing that because you have accepted you made an error or the other AI platform has the better reasoned judgement for this case study
Your opponent for the shootout is...
5.5 Triangulation
The application of the error suppression protocol will result in three independent outputs. Residual differences are likely to come down to difference in the training data and the judgmental reasoning of each AI Platform. The specific task will determine how to proceed. That is for the human brain directing the task to determine.
5.6 Full traceability - As this was a research project a fifth step was added in the phase of fine tuning the Prompt to collect all the errors found and accepted from all three Platforms in Step 2 and paste into a Word document as an error log record.
6. Conclusion
The importance of our error suppression protocol is to remove the barrier to more widespread day to day use of AI for complex evaluations in both the public and private sectors. In this way everyone can be empowered to reach higher performance standards, both in quality and speed.
If time is of the essence then step 4 would be dropped.
The protocol could work with only two AI Platforms and deliver a more certain result than from a single AI Platform. In which case the two platforms that would be best to couple would be Chat GPT (more creative) and Claude (more forensic).
Read more
All content here (c) copyright of Stephen Temple