Discussion about this post

User's avatar
Jonathan Patrick's avatar

Enjoyed this - though I suspect resisting the urge to iteratively prompt must have been torture for someone who clearly knows that's how AI actually works!

Quick thought: (and one I think you already know but flagged anyway for your readers) you're tackling the wrong question. Rather than "can AI replace evaluators?" (spoiler: no - at least not yet!), the more useful one is "which evaluation tasks suit AI oversight, and which need human judgement?"

Your "one-minute program theory" post showed AI's strength at rapid evidence synthesis. But your Te Pae experiment confirms the limits: AI conflates criteria with indicators, can't speak for stakeholders, and suffers from circularity - defining criteria by what's on the internet, then evaluating using that same data.

The sequencing problem: you're testing full delegation before establishing what works under supervision. Better to start small with low-stakes tasks, measure effects on validity, document what works, *then* expand where justified. You don't learn to swim by diving into the rapids!

The real question isn't "AI versus humans" but "how do we use AI responsibly?" - which is what the UK Evaluation Society's new guidance on responsible and ethical use of AI in evaluation will address. We're releasing it in a few weeks with a launch webinar likely on the 18th November. Would be great to have your perspective there!

Your one-day evaluation proposal is spot on - but needs that groundwork first. We need systematic mapping: which tasks benefit from AI (evidence synthesis, pattern recognition) versus which require human judgement (navigating power dynamics, determining "value" when stakeholders disagree). But again, how will you assess its efficacy? By comparing it to the 50-day (traditional) evaluation? I've seen plenty of poor traditional evaluations too?

Looking forward to the one-day version!

Expand full comment
Ivan Tasic's avatar

Great one, Julian. Thinking of the potential use of the ai generated evaluand (evaluations without real world evaluand) beyond "dystopian" scenario. Maybe M&E capacity building (for organisations) and M&E professional development in general could benefit from it and use it as a some kind of simulator, M&E playground. Or this idea falls under ai agents and self service? Not sure.

Anyway, if we are moving towards evaluations without human evaluators, do we need such capacity building? 🤔

Expand full comment
1 more comment...

No posts