AI and the Gell-Mann amnesia effect
Practical quality assurance strategies for evaluators working with AI tools
Ever read an article or watched a news segment about a field you’re deeply familiar with, and spotted glaring mistakes or off-base interpretations? It’s irritating when reporters get their wires crossed and it really stands out when they didn’t do enough research.
Then moments later, you turn the page and relax into reading a subject outside your sphere of expertise, and devour it like it’s all true facts.
That’s called the Gell-Mann amnesia effect
The “Gell-Mann amnesia effect” is a colloquial reference to our tendency to critique reporting where we have expertise, but then trust reporting in domains we’re less familiar with even though it could be just as flawed.
Michael Crichton, the novelist and physician (author of Jurassic Park and many other blockbuster books) coined the term after watching his friend, physicist Murray Gell-Mann, experience the phenomenon.
This can also happen when we use AI tools
I often get my AI assistant1 to tackle a first draft of something (a conference abstract, a short bio, this post) on a topic I know well. The prompts are based on my ideation of the narrative. The tool’s output helps string my thoughts together. Its writing may be technically good - but it’s never, ever quite right. I always have to correct nuances and put things in my own voice before I’m satisfied. It’s easy to spot and correct what’s wrong, because the topic is my topic.
But what about the rest of the time? A few days ago at the supermarket, I asked Perplexity to recommend the best gluten-free flour for a soda water fish batter (it recommended white rice flour). I had no idea if this was a good recommendation, but I accepted it at face value and found out later (it was OK, crispy but a little bland; I’ll try besan next time).
Now, let’s up the stakes. Imagine you’re getting your favourite AI tool2 to help you search and summarise literature on a topic that you’re not an expert in, during the inception phase of a new evaluation project (you were engaged for your expertise in evaluation, not subject matter knowledge). The literature review will lay out important context and evidence underpinning the evaluation. If it contains misconceptions, misinterpretations, missing details or outright falsehoods, the whole evaluation will be built on shaky foundations. How will you know if the literature review is sound?
How it can go wrong
AI mustn’t carry out any task unsupervised. If you’re delegating something to AI, you are responsible for the quality of the work it produces. Ethically, we need to have oversight of everything we ask AI to do. A bit like working with a colleague who’s brilliant at some aspects of the job and also has weak spots, we need to understand what AI brings to the team and what it doesn’t.
Evaluators face significant risks when they hand over tasks to AI tools without assessing the quality of the outputs. While large language models (LLMs) can churn out text faster than you can read, their default setting seems to be outputs that are slick but shallow. For example, when an AI tool prepares a thematic summary of interview transcripts, the findings may over-summarise, producing glib overviews that fail to pick up on important nuances, outliers, and context-sensitive cues that a human evaluator would include.3
When evaluators rely on AI to automate tasks like literature review, or interview transcription and coding, errors (such as misused quotes, misclassified themes, or missed insights) can propagate through conclusions, creating systemic inaccuracies that may be hard to spot or undo. These risks are multiplied when evaluators use AI for rapid research in unfamiliar domains, where they may lack the subject expertise to apply the sniff test. Banal or hallucinated outputs can find their way into evaluation reports, contaminating findings and perpetuating biases.
So it’s on us to stay vigilant and own the critical oversight of what AI tools produce. The challenge goes beyond catching obvious howlers, and includes developing systems and processes for deep review, testing for plausibility, completeness, values-alignment, and more. Without diligent supervision, there’s a risk that the world’s quickest and cheapest assistant could become the quickest, most expensive and embarrassing source of error.

Practical strategies to stay ahead of AI in unfamiliar territory
When working with AI tools for research, analysis, and writing, a healthy dose of scepticism is essential. You almost have to assume it’s trying to trip you up. Here are some strategies I use to stay ahead of the AI tool.
1. Iterate
There may be roles for AI to assist at every stage of an evaluation process - e.g., it can help you prepare or respond to terms of reference, navigate contract negotiations, conduct background research, design stakeholder engagement processes, develop a theory of change, value proposition, criteria and standards, identify evidence sources, design data collection tools, clean and code data, analyse and synthesise evidence, compare evidence with criteria and standards to suggest performance ratings, report preparation, and more.
For each task, break it into logical steps that can be tackled sequentially. Keep the prompts focused. Approach each step as a series of cycles, with multiple iterations - e.g., ideating, fact-finding, drafting, fact-checking, gap-finding, exploring, critiquing, fine-tuning.
You’re in charge. AI can help, under your guidance, supervision, and accountability.
2. You first, then AI
You’ll notice that each cycle in the process illustrated above starts with a human brain. Begin each cycle by framing your own thoughts, questions, hypotheses, etc, before prompting the AI tool. This helps ground the work in your (and your colleagues’, and stakeholders’) contextual knowledge and intuition. It guards against anchoring bias; if AI went first, there’s a greater risk of accepting its outputs with insufficient critique. When you go first, AI can review your ideas and suggest gaps.
So: you first, then AI, then back to you. AI’s outputs should serve as inputs for further scrutiny and enrichment, not as definitive answers.
3. Involve stakeholders
Participatory development and review are essential anyway - for ethics, validity, credibility, contextual grounding, ownership, and evaluation use - and once AI is in the loop, it introduces additional compelling reasons to sense-check each step along the way.
Invite colleagues, stakeholders, and/or external experts to review the work. Share your specific uncertainties. Collective review surfaces blind spots that any one person might miss.
4. Triangulate sources
Check the AI’s work by clicking through to the documents it cited (Perplexity Pro’s in-line linked citations are one reason I favour this tool), and decide whether you agree with its summation. For starters, make sure the references actually exist, because LLMs have been known to make some up. Give yourself permission to slow down and double-check the evidence. Where possible, consult sources outside those it cited. Prompt the AI tool to conduct a wider search to satisfy yourself that the point holds.
5. Use a checklist
Use explicit quality criteria to check drafts. The criteria will depend on the task, but here’s my mental checklist for writing. I welcome your additions.
Logic: is the argument coherent and does it avoid logical fallacies?
Sourcing: are reputable sources cited?
Factuality: am I satisfied that statements are accurate and correct?
Nuance: do the words convey the intended meaning?
Balance: are counterarguments and alternative perspectives considered?
Bias: is there a risk of bias in the text?
Controversy: are any statements likely to be contested? (this isn’t a reason to omit them, but to check we’re satisfied with our claims)
Transparency: are values, assumptions, and limitations explicit?
Actionability: is it useful and actionable?
6. Interrogate and refine
Write stuff the way you’d say it. Then get AI to review it. Use guiding prompts to audit and refine the draft, treating the system more like a critical colleague than an auto-complete tool.
Good prompts are specific about the task, provide enough context to anchor the model in your evaluative purpose, and spell out the kind of output structure you want. The following prompts are examples, designed to reduce ambiguity, surface rich critique, and make the AI’s reasoning easy to inspect and challenge as you cycle between your own edits and fresh AI input.
“Act as a critical reviewer. Identify any logical problems in this draft, such as unclear premises, non-sequiturs, unjustified leaps, and common informal fallacies (for example, overgeneralisation or false dichotomy). For each issue, quote the relevant text, explain the problem in plain language, and suggest a clearer or more logically sound alternative.”
“Review each factual claim (e.g., numbers, dates, classifications, causal statements). Flag items that are: (a) inaccurate, (b) speculative or uncertain, (c) overstated relative to typical evidence, or (d) lacking a clear source. Label each flagged item with the applicable categories and briefly state why it is flagged, noting when you are unsure.”
“Identify sections where the discussion seems one-sided, oversimplified, or incomplete given what different stakeholder groups or disciplines might reasonably think. For each section, explain which additional perspectives or nuances are missing and briefly sketch how the text could acknowledge them without losing clarity.”
“Identify points where a conclusion is stated without sufficient support. For each, specify whether what is missing is: (a) empirical evidence, (b) explanation of causal mechanisms or theory, (c) evaluative reasoning linking evidence to explicit criteria and standards; or (d) explanation of why alternative interpretations were rejected. Suggest what kind of material would be needed to make the argument adequately supported.”
“Identify potential framing or language biases in how findings and stakeholders are described (for example, loaded terms, one-sided portrayal of program actors, implicit value judgements). For each, suggest more neutral or balanced wording.”
“Identify statements that are likely to be contested by reasonable stakeholders. Distinguish between: (a) empirical contestation (where other credible evidence or interpretations exist) and (b) values-based contestation (where people with different values or interests would disagree). For each, indicate which type(s) of contestation apply and briefly describe possible alternative positions.”
“Identify claims that rely on broad generalisations or ‘common sense’ about how programs, systems, or people behave (for example, ‘X always leads to Y’ or ‘stakeholders inevitably…’). For each, explain why it might be relying on assumed consensus rather than explicit evidence, and suggest language that makes uncertainty or evidence limits more transparent.”
“Assess how clearly the text states: (a) key value judgements (for example, what counts as ‘success’ or ‘value for money’); (b) major assumptions (for example, causal, contextual, methodological); and (c) limitations (for example, data quality, generalisability). Point out where each of these is missing, implicit, or underdeveloped, and propose concrete sentences or brief sections that could improve transparency.”
“Identify statements, findings, or recommendations that are too vague, abstract, or high-level to guide decisions or action by commissioners, implementers, or other users. For each, explain why it is not actionable and suggest how it could be made more specific, feasible, and decision-relevant.”
None of these examples are failsafe, but iteratively prompting the AI tool to review your drafts in this way can help to incrementally tighten the work and reduce risks.
7. Own it
After enough iterations, finalise the work with your own expertise, judgement, and prose, especially for high-stakes outputs. Get a real person to peer review the final draft for you.
Bottom line
AI is here. It’s already in use. We need to learn when to use it (and when not to) and how to use it effectively. Perhaps it can save you a bit of time, but not as much as you might think. It’s there to assist, not replace you, and it needs quite a lot of supervision.
The Gell-Mann amnesia effect is a reminder that AI’s confident tone isn’t a replacement for reliability. As real human evaluators, our comparative advantage is in our human traits like values, curiosity, sociability, contextual understanding, intuition, creativity, adaptability, collective sense-making, and judgement. Those are ours to use, and evaluation would be diminished without them.
Ethically, we remain accountable for the evaluations we conduct. “But the AI said…” isn’t a valid defence. Evaluation quality remains in human hands, and the value we add lies in the questions we ask, the standards we hold, and the responsibilities we own.
Thanks for reading!
Also see
New UK Evaluation Society AI Guidelines
Launched 18 November, these guidelines are built around four key principles: 1) transparency, accountability and competence; 2) human control and proportionate AI use; 3) active risk management and harm prevention; 4) quality assurance and verification. It includes explicit “do not proceed” scenarios, and a checklist for evaluators. Check it out.
AI - a problem wrapped in a solution
A thought provoking read from Saville Kushner. “Many people are experimenting with Al platforms, often as an inquiry aid. This paper reports on direct interactions with ChatGPT and takes a critical look at its troubling interference with ‘trust’ and ‘validity’. We will see ChatGPT interacting directly with data and confronting complex ethical dilemmas.” Check it out.
Evaluator-free evaluation
You can prompt an AI tool to conduct an evaluation in as little as a minute. But is it any good? I took the idea for a test drive and found it’s not terrible, but it’s not good enough. Without careful human oversight, stakeholder input, and iterative sense-making, rapid AI-generated findings aren’t fit for consequential public investment decisions.
AI-enhanced evaluation
Keeping humans in the mix is essential for good evaluation - but what’s the right mix? I think an economic principle called comparative advantage is key. If we’re clear about what humans and AI each do best, and if such a mix can meet program evaluation standards while delivering more quickly at lower cost, everybody wins (including evaluators, if it boosts demand for evaluations).
Throughout this post, I use the term “AI” for simplicity. But these systems aren’t “intelligent” in the way we apply this term to humans. The commonest AI tools most of us are using are more accurately referred to as Large Language Models (LLM), and using LLMs as research, analysis and writing assistants is what I’m talking about in this post. LLMs work by analysing huge volumes of text and predicting the most likely next word or phrase based on patterns in their training data. They don’t possess understanding, consciousness, or intent. While some newer AI models incorporate forms of “reasoning”, it’s still fundamentally different from human reasoning. They follow statistical and pattern-based processes, and their reasoning is limited to structured tasks and information present in their training data. They can’t truly comprehend, reflect, or apply judgement as people do. This can be a strength and a weakness, depending how you choose to use the technology.
I’m not going to tell you which AI tool you should use, but I will tell you that my current favourite is Perplexity Pro because it nails several things that matter to me. First, its responses include clearly cited sources - inline, expandable, and easy to verify. This helps me trace back to the original material for fact-checking and transparency. Second, I can choose between multiple leading AI models (GPT-4o, Claude Sonnet, Gemini, Sonar, and more), giving flexibility depending on the nature of the question or task - whether I want deep “reasoning”, quick synthesis, or creative “brainstorming”. Third, the Pro version provides unlimited secure uploads for documents, images, and datasets. Fourth, privacy-wise, Perplexity Pro has taken clear steps to assure confidentiality over sensitive material. Opt-out from model training is the default on paid plans, and chat histories are deleted after a set period. That said, absolute privacy is never possible with cloud-based LLMs, so I don’t trust it with identifiable, confidential, or commercially sensitive data.
Hrdlickova, Z., & Wagstaff, T. (2025). Utilising AI in qualitative evaluations: Experimental innovation in OPM. Paper presented at the UK Evaluation Society Annual Conference, Glasgow.





