The comparability challenge: using rubrics to compare programs
Balancing specificity and generality in rubric design
Comparing things is challenging no matter what methods and tools you use. Rubrics can help keep the reasoning transparent.
This is a follow-up post to one I wrote last year about using ratios in value-for-money (VfM) assessment. It’s often assumed that ratios are easy to produce, and that they’ll tell us a lot about VfM - but in reality, there are multiple hurdles to meet, one of which is the comparability challenge. A ratio is just a number unless we have something to compare it with. Benchmarking or comparison with similar programs requires outputs or outcomes that are relatively homogeneous, access to comparator data, ratios calculated on a consistent basis for all comparators, and contextual differences accounted for. As I said in that earlier post:
That’s a stringent set of conditions to meet. But without them, ratios risk becoming an exercise in wishful thinking. The numbers may look objective, but without careful attention to context, comparability, and the underlying assumptions, we could be fooling ourselves.
In this post, I look at another tool that can support comparisons between programs: rubrics. In previous posts, I’ve explored different kinds of rubrics we can use for evaluation, ranging from generic to highly context-specific. Importantly, these approaches aren’t mutually exclusive. My go-to approach for assessing value for money (VfM) blends aspects of generic, analytic and holistic rubrics, as explained here. Today, I want to build on that foundation and tackle a thorny issue: can we use rubrics to compare different programs - and if so, how?
The trade-off between specificity and generality
Rubrics support transparent evaluative judgements by setting out explicit criteria (aspects of performance) and standards (levels of performance) that help us to make sense of evidence and support the validity of evaluative claims. One of the ways to enhance validity is to involve stakeholders in rubric development. Another is to make rubrics context-specific so they relate to the unique features of performance and value that matter in a particular program, at a particular time and place.
Rubrics are powerful in part because they reflect the values and priorities of stakeholders in a particular setting. Deeply contextualised, program-specific rubrics are good for clarity and support meaningful evaluation, but the flipside is that this strength can become a limitation when it comes to comparing performance between programs.
For example, the 2026 Magenta Book states that while rubrics offer flexibility to draw on a range of quantitative and qualitative evidence, “they will generally not produce a numerical value summarising the value for money of the intervention. This can limit the comparability of findings across different interventions”. This post represents a Cubist Evaluation of that claim: in my view, the Magenta Book’s statement isn’t completely wrong, but it isn’t completely right either.
The first thing to acknowledge is that such comparisons are inherently challenging, with or without rubrics.
Different programs operate in varied contexts, including geographic, demographic, cultural, and socio-economic factors, population needs, and local responses. Whether we’re comparing numerical or narrative data (or both), we need to identify contextual differences and take them into account.
Quantitative indicators like cost-efficiency, cost-effectiveness, and benefit-cost ratios can also struggle with comparability, as I’ve discussed before. Each cost-benefit analysis (CBA) design, for example, involves many analyst decisions (e.g., scope, perspective, time horizon, discount rate) and contextual factors that influence results. There are many ways to estimate monetary values of different impacts, leading to differences between studies. Differences in methods, data, study populations, and the types of impacts included (or omitted) mean that direct comparisons between studies require careful scrutiny.
The upshot is that benefit-cost ratios of different programs are often not as comparable as they may appear. Comparisons are most reliable when the analyses are based on equivalent methodological decisions like those noted above, occurred at the same time and place, and examined programs of a similar nature. The more they depart from these conditions, the more careful we have to be in making comparisons.
So, the comparability challenge isn’t new - and it certainly doesn’t disappear when we use rubrics.
Can comparisons still be useful?
Despite these challenges, comparisons (numerical or qualitative) can be valuable if we approach them thoughtfully. Instead of relying solely on tables of indicators, we need to engage with mixed-methods evidence and treat context as a crucial factor. This isn’t some sort of optional add-on to economic analysis - it’s absolutely crucial to make meaningful comparisons.
This is where rubrics can shine. They provide structured frameworks for deliberation, guiding evaluators to consider evidence carefully without forcing simplistic or inappropriate conclusions.
Designing rubrics for comparison: finding the right balance
Rubrics are tailored to fit the purpose of an evaluation. For example, we can design generic criteria and standards that can be used for any program, bespoke rubrics tuned to one program, or we can meet in the middle and develop rubrics that support comparisons in a specific context. Let’s unpack these three options a bit.
Option A: Generic criteria and standards that can be used for any program.
A generic set of standards can be as simple as:
Excellent: exceeding expectations
Good: generally meeting reasonable expectations
Adequate: meeting minimum requirements and showing acceptable progress
Poor: falling short of minimum requirements or acceptable progress.
These standards can be used with any set of criteria. To support like-with-like comparisons, a common set of criteria could be defined and used for all programs, such as:
Economy: good stewardship of resources
Efficiency: productivity and ways of working to maximise value
Effectiveness: contributing to positive outcomes that meet the needs of the target group
Equity: addressing inequities through fair and equitable allocation of resources, actions, and outcomes
Cost-effectiveness: meeting its value proposition to key stakeholder groups.
These criteria can be applied to any program. They take relatively little effort to develop. However, they don’t offer a lot of guidance for making evaluative judgements, increasing the cognitive burden and the potential for wide variation at the synthesis stage, when deliberating on the evidence.
As I have argued previously, this approach may be appropriate and provide just-enough guidance when working in highly complex contexts where processes and outcomes are more emergent. Generic rubrics can also be suited to making comparisons across multiple programs. However, a strong moderation process is needed to ensure judgements are as consistent as they appear, and to ensure important contextual differences are properly considered.
Overall, any up-front specificity we can reasonably add to make these rubrics slightly less generic will be helpful when it comes to making well-formed judgements.
Option B: Bespoke rubrics for each program
If a rubric is designed for a single program at a specific place and time, it’s well-attuned to that context but not intended for any other context. However, if we know in advance that we want to compare programs or make judgements on a consistent basis, we can align each program’s rubrics with a set of generic criteria and standards like the ones above. This way, even when each program has its own bespoke rubrics, they still reflect the same underlying definitions for each aspect and level of performance.
This is the approach we described in Oxford Policy Management’s guide to assessing value for money: terms like excellent and good are used consistently across multiple VfM assessments.
Option C: Meeting in the middle
To facilitate comparisons across a portfolio of programs with shared objectives and features, we can strike a balance between the generic rubrics of Option A and the bespoke rubrics of Option B. Here, the rubric is tailored enough to be meaningful, and general enough to support valid comparisons across multiple investments within a portfolio or sector. Here’s an example developed for a set of market development programs.

The technocratic-to-deliberative continuum
As I’ve said before, there’s more to evaluative reasoning than rubrics. Rubrics don’t make evaluative judgements - people do. Technocratic approaches to synthesising evidence with explicit criteria and standards (such as multi-criteria decision analysis, if-then statements, rubrics, etc) can provide structure for making transparent evaluative judgements, but evaluation also involves a set of cognitive and social processes - including mixed reasoning approaches that combine deliberative reasoning for inclusive rigour, all-things-considered approaches to weigh alternative conclusions, and tacit judgement to surface considerations that may not have been made explicit. The balance of reasoning strategies will vary.
Similarly, we can conceptualise potential approaches to comparison along a continuum from technocratic to deliberative:
Cost-effectiveness, cost-utility, and cost-efficiency ratios: These are highly technocratic - comparison “on rails.” They provide a ratio expressed as the average cost per unit of output or outcome (e.g., cost per student or per graduate), giving us a number we can use to compare options. Careful though: the scope and definitions of costs and outputs/outcomes need to be consistent across every intervention being compared. These methods use measurement and independent observation but still require judgement in design and interpretation, including attention to contextual differences. They can be used to compare across different interventions and health conditions, but work best when programs are similar in their objectives, populations, and contexts. This reduces the risk that observed differences in ratios are due to contextual or methodological factors rather than true differences in program value.
Benefit-cost ratios: Not quite on rails to the extent of cost-effectiveness analysis, but still following a paved highway. Still fairly technocratic, but with extra room for analyst judgement because the benefit measure is more flexible - for example, in determining which benefits to value in monetary units, and how to place monetary values on them. Benefit-cost ratios may look objective, but they’re shaped by subjective choices. This can make comparisons problematic. Moreover, rather than aiming for a single ratio, cost-benefit analysis is often better viewed as a “voyage of discovery, promoting systematic exploration of the evidence on policy impacts” - i.e., scenario analysis, producing a range of results under different assumptions. Using CBA in this exploratory way is a key strength of the method, but does place further limits on comparability. It is not only rubrics that face such trade-offs!
Tight rubrics (like Options B and C): Comparison on a well-marked offroad trail. The logic is clear, the scaffolding for evaluative claims is detailed, and the approach is transparent. People are making judgements, but those judgements can be decently comparable when the process is well designed and contextual differences are explicit.
Loose rubrics (like Option A): This is more like bush-bashing - harder to find your way, but generic criteria and standards serve as landmarks. More deliberation and effort are needed to build a logical argument from evidence to judgement. Comparisons may still be possible, but as the scaffolding gets looser, the reasoning has to compensate with meticulous attention to how the argument is constructed and warranted.
Everywhere on this continuum, context is important, and a strong moderation process is needed to ensure comparisons are valid, judgements are made in a consistent way, and the logic underpinning the arguments is well documented. No method removes the necessity of careful deliberation and judgement.
Importantly, these options are not mutually exclusive. For example, rubrics can be combined with economic ratios, providing an explicit basis for taking wider contextual factors into account.
Conclusion: meeting in the middle
The primary challenge isn’t in using rubrics to compare programs, but in using any method or approach (whether numerical or qualitative) to make valid, meaningful comparisons, using a framework that supports like-with-like analysis while recognising contextual differences.
No matter how you slice it, comparisons are fraught. Numbers may help, but blind faith in indicators won’t. Ratios may look objective, but they require just as much care and attention to context as narrative evidence. The beauty of rubrics is that they help to make this deliberation intentional and transparent.
In the end, the best comparisons are those that are both rigorous and reflective, embracing complexity rather than glossing over it. That’s the value rubrics and collaborative deliberation can bring to the table. And that’s why rubrics don’t “limit the comparability of findings across different interventions” as the Magenta Book claims: on the contrary, they can support better comparisons when designed with this purpose in mind.
Thanks for reading!
Substack shifting into 5th gear
You might notice a small change in rhythm over the next while. For the past three-and-a-half years, I’ve shown up here almost every week with a longform post. I set myself that challenge as a way of writing a book, non-sequentially, one puzzle piece at a time. Now that the book is nearly complete, I’m shifting Substack to a slightly roomier cadence of roughly one substantial post every two weeks.
My existing and future Substack posts will continue to fill a niche in the VfI resource library (alongside my web resources). While the book will provide a stand‑alone foundation for VfI, Substack will still be the home for deep dives, case examples, ongoing developments and exploring new questions.
This change in output will give each new piece more space - more time for you to read, use and share it, and more time for me to make it worth landing in your inbox. There will still be the occasional extra note when something timely or curious pops up. But if you notice a little more breathing space between posts, that’s intentional - and, I hope, useful.


