Evaluator-free evaluation
What happens when an AI tool does the evaluation work - completely on its own?
I make a point of not missing Gerard Atkinson’s conference presentations - always a highlight. Especially his ongoing exploration of artificial intelligence (AI) developments and implications for evaluation.1
At the 2025 Australian Evaluation Society Conference in Canberra, Gerard mentioned the potential for AI to enable:
Evaluation without evaluators: This possibility already exists in the form of “self-service” monitoring and evaluation, where the project or program manager takes responsibility for these functions without engaging specialist skills. The use of AI tools could enable more of this.
Evaluation without a real-world evaluand (e.g., policy or program): The emergence of AI simulation tools, such as agentic AI, digital twins, and predictive analytics, introduces the possibility of modelling and evaluating potential interventions using synthetic data, market research-style personas, or simulated environments, without needing to pilot them in the real world.
This raises a provocation: if there are no evaluators and no real evaluand, is it “evaluation-free evaluation”? Gerard noted we shouldn’t dismiss the possibility as necessarily a bad thing. Simulations that can be run quickly and cheaply, and potentially generate insights as useful (or better) for decision-makers as traditional methods, deserve serious consideration. Even if it isn’t really evaluation.2
I am certainly not advocating for evaluator-free evaluation as a routine replacement for human involvement. See my recent post, envisioning a future in which we might achieve more, better, quicker, lower-cost evaluation through human-AI teamwork based on the principle of comparative advantage - with humans remaining morally and ethically accountable.
However, I simply had to put evaluator-free evaluation to the test, just to see what would happen.
I conducted a one-minute evaluation and a one-hour evaluation of the same thing.
But first I had to choose something to evaluate. I chose Te Pae, the new Christchurch Convention Centre. Te Pae is a recent, large-scale public infrastructure development in New Zealand. I love the design, but does it provide good value for the resources invested? I thought there should be a decent volume of published information about Te Pae on the web that my AI tool, Perplexity Pro, could use to scope, design and conduct an evaluation. For both evaluations, I prompted the tool to follow the 8 steps of the Value for Investment (VfI) approach.
Disclaimer: This exercise explores the limits of an AI tool as a replacement for human evaluators and is not a formal assessment of Te Pae. Results are illustrative only and should not be taken as professional findings. See full disclaimer at the end of this post.
The one-minute evaluation set out to test what Perplexity Pro would do with just one prompt:
Using the attached process [diagram below], evaluate the new Christchurch Convention Centre. Use information available on the internet. Follow the 8 steps. Show your working. Develop context-specific value proposition, detailed criteria, sub-criteria, and standards. Criteria will be cost-effectiveness (creating enough value to justify the investment), effectiveness (real changes in people, groups, places, or things, caused by the investment), efficiency (productivity and ways of working), economy (stewardship of resources) and equity (fair allocation of resources, delivery, outcomes, and value). Standards will be excellent (exceeding expectations), good (meeting expectations), adequate (meeting minimum requirements and showing acceptable progress) and poor (not meeting minimum requirements or not showing acceptable progress). Identify evidence sources. Analyse and synthesise the evidence. Make judgements using the evidence, criteria, and standards. Present the findings clearly, justifying the judgements you make.
The one-hour evaluation used a sequence of 13 prompts, prepared in advance, cumulatively developing and applying a VfI framework:
What do you know about Te Pae?
To whom is Te Pae valuable, and how is it valuable to them?
Based on the potential value described above, create a rubric to evaluate Cost-effectiveness of Te Pae. The rubric must capture economic and financial value as well as intangible social and cultural value to stakeholders. It will have four levels - excellent (exceeding expectations), good (meeting expectations), adequate (meeting minimum requirements and showing acceptable progress) and poor (not meeting minimum requirements or not showing acceptable progress).
What inequities could Te Pae address, and how? What would fair and equitable allocation of resources, delivery, outcomes, and value look like?
Based on the information above, create a rubric for evaluating the performance of Te Pae on Equity.
What real changes in people, groups, places or things will Te Pae bring about?
Create a rubric for evaluating Effectiveness of Te Pae - key outcomes and impacts - from those described above.
What ways of working will maximise value from the investment in Te Pae?
Create a rubric for evaluating Efficiency of Te Pae, focusing on key ways of working described above.
What resources are invested in Te Pae, and by whom? Consider not only financial resources but also intangible resources. What does good stewardship of those resources look like?
Create a rubric for evaluating Economy of Te Pae, focusing on good stewardship of resources as described above.
Using all of the rubrics you created above for cost-effectiveness, equity, effectiveness, efficiency, and economy, gather evidence, analyse the evidence, synthesise the evidence through the lens of the rubrics, and make evaluative judgements about the performance and value of Te Pae. Present your judgements systematically. Be transparent about the availability and quality of evidence to address each of the criteria and standards. If there is insufficient evidence to make a judgement, say so - don’t guess.
Write a report. Spoilers at the front, working in the middle, detail at the back.
The two evaluations also used different search settings.
The one-minute evaluation was all about speed. It used the standard search function, which returned results almost instantly. The one-hour evaluation used Perplexity Pro’s Deep Research function, which typically took between 3-5 minutes to complete each step.
The two evaluations came up with different criteria.
The one-minute and one-hour evaluations defined criteria (aspects of performance and value to focus on) differently. This isn’t a limitation of AI-generated evaluation; human evaluators would do that too if briefed and resourced differently for the same job. Nonetheless, it’s interesting to see how they differed.

The two evaluations reached different conclusions.
No surprises there, considering they defined their criteria differently and searched the internet to differing levels of depth, basing their conclusions on different (but overlapping) evidence.

Full transcripts of both evaluations are available.
You can click through to them here. I also got Perplexity Pro to compare and contrast the two evaluations (because of course I did) and the comparison is available over there too.
Some reflections
Overall, the one-minute and one-hour, evaluator-free evaluations demonstrated potential, but ultimately served to highlight the importance of human evaluators remaining in charge. Despite performing impressively in some respects, AI can’t do all the work for you.
There are some obvious limitations that come with rapid, AI-heavy evaluations. A major limitation is the lack of stakeholder participation. Perplexity can suggest perspectives, but can’t meaningfully speak for actual stakeholders or communities about what matters to them, their lived experiences, or unanticipated consequences.
A less-obvious, but equally important limitation is the potential circularity of defining criteria entirely by what’s available on the internet (availability bias) and then using those criteria to evaluate performance based on the same data.
I was reasonably satisfied with the depth and breadth of the evaluations, bearing in mind the time it would take a human team to complete a comparable set of tasks. I would have scoped and framed some things differently, through iterative prompting, if I had opted to exercise more control. However, for the sake of this experiment, I took a hands-off approach aside from the pre-written prompts. Perhaps the results could be improved by fine-tuning the prompts over multiple successive evaluations.
A big flaw was that the one-hour evaluation conflated criteria and indicators, limiting some subcriteria to measurable features of performance and suggesting numerical cut-points between different levels in the rubrics, whereas I insist on keeping a clear distinction between criteria and indicators. In the 8-step Value for Investment process - used in both evaluations - criteria (aspects of value) and standards (levels of value) are defined at steps 2 and 3 respectively, with the objective of supporting transparent evaluative reasoning, while specific, measurable indicators (if used) are supposed to be identified as part of step 4 - determining the mix of necessary and credible evidence. This problem is fixable; if this were a real evaluation rather than an AI experiment, I would have clarified this requirement and prompted Perplexity to try again.
Nonetheless, I thought the AI tool’s performance in scoping the value proposition was pretty amazing. If this had been a real-world evaluation, I could imagine using Perplexity’s outputs to help identify and define appropriate criteria with stakeholder input.
It appears that one reason the one-hour evaluation rated some dimensions of performance lower than the one-minute evaluation (e.g., equity, efficiency) was that limited data was available on the internet to address some of the sub-criteria. As such, the conclusions may conflate limitations in evidence with limitations in performance. I advocate for keeping these two considerations separate: program quality and evidence quality are both important, and absence of evidence is not evidence of absence. This too would be fixable with more prompting.
For reasons I don’t understand, the one-minute evaluation included a benefit-cost ratio (BCR) as a dimension of cost-effectiveness (which is where I would have placed it myself) whereas the one-hour evaluation included it in economy. In a real evaluation, I would have corrected this.
It’s a no from me
As I’ve written previously, involving AI in evaluation could push out the production possibility frontier, enabling more-better-quicker-cheaper evaluation - but only if we get the balance of human and AI inputs right.
Evaluator-free evaluations are not the right balance. They drastically reduce time and costs, but involve unacceptable trade-offs in depth, transparency, inclusion, validity, credibility, ethics, and risk.
Evaluator-free evaluation may be viable in limited circumstances - e.g., where human expertise is scarce, available resources are minimal, speed is paramount, stakes are low - for example, helping me choose a decent value-for-money air fryer. But in policy and program evaluation, evaluator-free evaluation isn’t a substitute for things like human-led context-sensitivity, ethics, and stakeholder input for consequential judgements.
Extra comparisons are needed
This experiment compared a one-minute and one hour evaluation, neither of which were good enough. A more comprehensive trial would really benefit from expanding the comparison base in at least two ways:
One-day (to one-week) evaluation
An additional, and very interesting, point of comparison would be a one-day evaluation, where the AI platform could be guided step-by-step through each stage of the process with significantly more evaluator and stakeholder input. In this scenario, AI tools would be prompted to iteratively design the framework, collate evidence, make judgements and report findings, with humans guiding the process throughout, asking follow-up questions as needed, performing quality checks, and taking corrective action if the LLM strays from intended principles, processes or content. Humans would have the final say on evaluation design decisions and evaluative judgements.
A full-day approach would allow for a more thorough, reflective evaluation, exploring how repeated review and evidence collection might enhance the depth and reliability of findings - still within a much shorter timeframe than a humans-only evaluation. A single day wouldn’t be enough for some evaluations - it does depend, for example, on what evidence is already available and what extra evidence is needed. Even if it turned into a one-week evaluation, that still represents a significant time and cost advantage. Perhaps there are circumstances in which this general approach could provide a high return on effort - and make it affordable to conduct more evaluations. It ought to be tested.
50-day evaluation
Another important comparator is, of course, the baseline of an ‘old-school’ human-conducted evaluation. This would enable a direct comparison with reference to program evaluation standards, examining utility, feasibility, propriety, accuracy, accountability, and cost, for human versus AI-intensive approaches, revealing trade-offs.
Off the top of my head, I would expect to budget something in the order of 50 days (give or take) for a humans-only process spanning stakeholder engagement, design, mixed methods (including economic, quantitative and qualitative data gathering and analysis), synthesis, judgement-making, and reporting.
There may be a case for AI-free evaluation - and if so, it’ll be important to justify. Before long, someone will ask whether the advantages of purely human evaluation are proportionate to the extra cost (maybe 10-50x the level of effort), and whether the traditional approach still represents defensible value for the additional resources required. If program evaluation standards can be met through hybrid human-AI approaches, at a fraction of the cost, at what point does it become unethical not to use AI?
As I argued in my earlier post, humans remain better at some critical aspects of evaluation while AI may have a comparative advantage in others. There may be an optimal mix of human and AI involvement that achieves both improved quality and reduced time and costs. Empirical comparisons are needed to determine whether this is the case. A great research-on-evaluation PhD topic.
It also gets me thinking…
My explorations were conducted using Perplexity Pro, my current favourite LLM tool because it can access the live internet, cite its sources, provide a menu of AI models to choose from (including GPT-5, Claude 4.5 Sonnet, Sonar Large, Gemini 2.5 Pro), and keep our conversations confidential.3
Beyond LLMs, what else might be possible with more specialised simulation-based AI tools? For example, platforms that don’t just look up answers, but actually simulate real-world policy interventions, test “what-if” scenarios in virtual environments, and predict the likely impacts of different decisions, all before anything happens in reality. These tools can create digital models of complex systems, allowing evaluators and policymakers to explore, iterate, and refine ideas, even estimate future outcomes. We already evaluate policies and programs ex-ante (before they exist). How might AI tools extend these evaluations, testing ideas quickly and cheaply pre/sans-piloting?
What else do you notice?
In what ways were the one-minute and one-hour evaluations true to core VfI principles (interdisciplinary, mixed methods, evaluative reasoning, participatory)? In what ways were they not up to scratch?
What might they have missed, that a real, human-led evaluation could pick up on?
Is there a place for this kind of evaluation? For whom and in what circumstances might evaluator-free evaluation have a viable value proposition?
Bottom line
Neither the one-minute nor the one-hour AI evaluation produced findings that I would consider valid or actionable for a consequential public investment. Initial promise is there, but only with significant human oversight, stakeholder input, and iterative design can AI assistance approach the rigour needed for high-stakes evaluation. However, based on the potential shown, I am now curious about a one-day or one-week evaluation involving evaluators, stakeholders and AI.
Disclaimer
The results are not formal evaluations of Te Pae Christchurch Convention Centre. The one-minute and one-hour evaluations, performed by Perplexity Pro with basic prompting and no human edits, were generated as part of an experimental trial of rapid, heavily AI-supported evaluation methods, using publicly available information. The purpose is to explore the capabilities and limitations of the AI tool, not to assess or critique Te Pae. Findings should not be interpreted as professional conclusions or recommendations.
Thanks for reading!
I’m grateful to Gerard Atkinson for helpful peer review of this post. Errors and omissions are mine. All opinions are mine and are held lightly, especially in this rapidly evolving space.
One last job for Perplexity Pro…
Create an image for me: an art form that represents “evaluator-free evaluation”. Write an explanatory caption to go with it.

Throughout this post, I use the term “AI” for simplicity. But these systems aren’t “intelligent” in the way we apply this term to humans. The AI tools most of us are using are more accurately referred to as Large Language Models (LLM), and for the most part, that’s what I’m talking about in this post. However, AI also includes computer vision, speech and audio processing, robotics and more. I’m not excluding any of it, but I’m writing mainly with LLMs in mind. LLMs work by analysing loads of text and predicting the most likely next word or phrase based on patterns in their training data. They don’t possess understanding, consciousness, or intent. While some newer AI models incorporate forms of “reasoning”, it’s still fundamentally different from human reasoning. They follow statistical and pattern-based processes, and their reasoning is limited to structured tasks and information present in their training data. They can’t truly comprehend, reflect, or apply judgement as people do. This can be a strength and a weakness, depending how you choose to use the technology.
Instant coffee isn’t a bad drink. Its only crime is the use of the name “coffee”. It should have a different name to avoid misleading. Similarly, “evaluator-free evaluation” may at times provide useful and valid analysis - it just shouldn’t use the name “evaluation”. Personal opinion.
A note about privacy and confidentiality in Perplexity Pro: My queries and uploads are protected through encryption at rest and in transit. Perplexity maintains agreements to ensure third-party model providers (OpenAI, Anthropic, Google) can’t use my data for training. Files I upload to threads are automatically deleted after seven days. However, end-to-end encryption isn’t claimed, and standard privacy policies apply to non-enterprise use, which may vary in legal enforceability and practical guarantees. Although conversations aren’t published or discoverable by internet search engines, some privacy risk remains compared to platforms with zero-logging and anonymous search modes. For example, internal access by authorised Perplexity staff is still technically possible.



Enjoyed this - though I suspect resisting the urge to iteratively prompt must have been torture for someone who clearly knows that's how AI actually works!
Quick thought: (and one I think you already know but flagged anyway for your readers) you're tackling the wrong question. Rather than "can AI replace evaluators?" (spoiler: no - at least not yet!), the more useful one is "which evaluation tasks suit AI oversight, and which need human judgement?"
Your "one-minute program theory" post showed AI's strength at rapid evidence synthesis. But your Te Pae experiment confirms the limits: AI conflates criteria with indicators, can't speak for stakeholders, and suffers from circularity - defining criteria by what's on the internet, then evaluating using that same data.
The sequencing problem: you're testing full delegation before establishing what works under supervision. Better to start small with low-stakes tasks, measure effects on validity, document what works, *then* expand where justified. You don't learn to swim by diving into the rapids!
The real question isn't "AI versus humans" but "how do we use AI responsibly?" - which is what the UK Evaluation Society's new guidance on responsible and ethical use of AI in evaluation will address. We're releasing it in a few weeks with a launch webinar likely on the 18th November. Would be great to have your perspective there!
Your one-day evaluation proposal is spot on - but needs that groundwork first. We need systematic mapping: which tasks benefit from AI (evidence synthesis, pattern recognition) versus which require human judgement (navigating power dynamics, determining "value" when stakeholders disagree). But again, how will you assess its efficacy? By comparing it to the 50-day (traditional) evaluation? I've seen plenty of poor traditional evaluations too?
Looking forward to the one-day version!
Great one, Julian. Thinking of the potential use of the ai generated evaluand (evaluations without real world evaluand) beyond "dystopian" scenario. Maybe M&E capacity building (for organisations) and M&E professional development in general could benefit from it and use it as a some kind of simulator, M&E playground. Or this idea falls under ai agents and self service? Not sure.
Anyway, if we are moving towards evaluations without human evaluators, do we need such capacity building? 🤔