Measuring what matters: a chicken-egg problem in evaluation
“What matters” should determine what gets “measured” - not the other way round. But availability of evidence may constrain a practical set of criteria. How can we navigate this?
In program and policy evaluation, a sound evaluation design process should ensure that our work stays focused on what really counts, not just what’s easy to count. However, when evaluators come into a program mid-race (as so often happens), data collection decisions may have already been made. The available evidence may be constrained by legacy data systems, inherited indicators, and already-written reports. How can we stay principled and pragmatic?
An evaluation design process
I usually recommend the following sequence of steps, grounded in evaluative reasoning and practical experience:
Understand the program and its value proposition
Determine what aspects of performance matter enough to focus on (criteria)
Define what good performance looks like (standards)
and only then,Decide what evidence is necessary, and what mix of methods will gather and analyse it effectively.
Each step builds on the last. The sequence matters because criteria and standards should drive evidence collection, not the other way around. Jumping to indicators or data sources too early risks allowing the available evidence to dictate the focus of the evaluation. That runs the risk that we focus on what’s easy to measure or count, not what’s most important to understand.1
Of course, this process is a guide, not a straitjacket. Evaluation rarely unfolds in tidy, linear steps. It’s iterative, adaptive, and as Forrest Gump’s mother said, “like a box of chocolates; you never know what you’re gonna get”. Every evaluation that follows these steps will implement them a bit differently, iterating between stages while still observing the underlying logic of the process.2
One common challenge is timing.
As evaluators, we’re often brought in when a program is several years old, the data architecture horse has already bolted, and retrospective influence over data collection may be pretty limited.
For example, sometimes we’re tasked with making sense of value-for-money (VfM) reports from multiple implementation partners in large development programs. Superficially, these reports may have features in common like using the 5Es (economy, efficiency, effectiveness, cost-effectiveness and equity) as criteria. But on closer inspection, each follows its own indicator framework with limited comparability in terms of data definitions, scope, presentation of results etc - and no explicit basis for determining what the results mean collectively, nor whether the overall VfM should be judged as excellent, good, adequate or poor.
In these cases, following the recommended design sequence for evaluations - i.e., developing criteria and standards before considering evidence requirements - will likely reveal evidence gaps. That’s not a flaw in the process - it’s a feature. An evidence gap is an evaluation finding. Criteria and standards represent an agreed statement about ‘what matters’ and ‘what good looks like’. If something matters but isn’t being tracked, that’s a valuable evaluative insight. Our recommendation should be clear: start collecting evidence about what matters.
Yes, but we still have to deliver an evaluation, using the available data
The real world is an imperfect place and we have to meet it where it is, providing approximate answers to important questions. When working under data constraints, we can adapt the evaluation design process slightly without abandoning its principles. Here's a pragmatic adaptation to the design sequence from the process diagram above. It emphasises the importance of criteria guiding evidence needs (dark circles), while permitting availability of evidence to inform a practical evaluation scope (light circles).
Step 1: Understand the program
In addition to the usual work we would do in any evaluation to learn about the program, its context and stakeholders, this step can include the following process to scope data availability and limitations:
Define the program’s value proposition: To whom and in what ways is it valuable? How does the program create this value? What critical factors affect whether it creates a lot of value or a little?
Take stock of existing data: e.g., indicators, monitoring systems, previous evaluations. What evidence do we have to work with?
Together, the value proposition and data stocktake give us the foundation for a gap analysis, at the next stage…
Steps 2-3: Develop criteria and standards
Criteria development becomes a pragmatic exercise, identifying criteria that connect what’s important with the data that’s already available, while still allowing room to include criteria that can’t immediately be addressed from the available evidence:
Identify what matters: Setting the data stocktake aside for a moment, identify which aspects of the value proposition make the greatest difference to whether the investment creates a lot of value or a little, and which aspects matter the most to stakeholders. These aspects are good prima facie candidates for criteria.
Identify commonalities and gaps between criteria and evidence: Compare the list of potential criteria with the evidence available. Where are the points of alignment? Where do the available data and potential criteria diverge? In principle, the data being collected should address aspects of performance that matter - but this mustn’t be assumed. You’d be amazed at the indicators that get chosen because they’re easy to count.3 Just because something is being measured doesn’t mean it matters - and conversely, just because we don’t have data on something doesn’t mean it’s not important.
Design a practical set of criteria and standards: The selected criteria are informed by the availability of data, but not constrained by it. We must resist the temptation to let the evidence ‘tail’ wag the evaluation design ‘dog’. We should be comfortable including criteria and standards that can’t be addressed with existing data, on the basis that they define something crucial to the validity of the evaluation framework.
Step 4: Determine evidence needed
By now we know what data we have. But what do we need? The process above may identify critical gaps in the existing data. Some gaps can be addressed within the scope of the evaluation (or perhaps at a later stage in the evaluation); others become recommendations for future data strategies.
For example, suppose we are evaluating the effectiveness of a community health program. After defining criteria and standards, we discover that while there is robust quantitative data on program participation rates, there is a lack of qualitative data capturing participant experiences and barriers to access. As we have explicit criteria and standards in place before selecting data sources and methods, we can systematically assess whether these gaps are critical to answering our evaluation questions. In this case, we might decide to incorporate focus groups or interviews to fill the immediate gap in qualitative insights. However, if we also find that demographic data on non-participants is missing and cannot be feasibly collected during the current evaluation, this gap would be documented as a limitation and form the basis for a recommendation to improve data collection processes in future program cycles.
By establishing clear criteria and standards up front, we ensure that our evaluation design is both responsive to immediate information needs and proactive in guiding longer-term data strategy improvements.
Bottom line
Not everything that counts can be counted, and not everything that can be counted counts - William Bruce Cameron (1963).
Our job as evaluators is to illuminate what matters, not just what’s measurable. A principled, transparent, and adaptive design process helps us do exactly that - even when we come in halfway through the story.
Of course, the ultimate solution is to involve evaluators from the very beginning, so that program design and evaluation design are integrated, and the right data gets collected from the outset.
Acknowledgement
Many thanks to Zara Durrani for peer review. Errors and omissions are mine. This post represents my professional opinion alone and not those of any of the organisations I work with.
Thanks for reading!
Measuring is more than just counting. Counting tells us “how many.” Measuring tells us “to what extent,” “with how much variation,” and so on - giving us a deeper, more useful understanding of the world. Measuring often requires more effort than counting but it pays back in better-informed evaluation. And so does qualitative evidence. There is sometimes an assumption that numerical data is inherently objective and qualitative data is inherently subjective. Not true. Facts are value-laden. Any form of evidence can be used objectively or subjectively. For example, the selection of indicators can be highly subjective, as can the interpretation of indicator data. Using multiple evidence sources together can strengthen the evaluation.
For example, in complex, emergent, fast-moving environments it can be challenging to define criteria and standards in advance. Nonetheless, it is still possible and desirable to predetermine criteria and standards to a sensible degree of detail and specificity. The more fluid the situation, the less-detailed and less-specific the criteria and standards - and the more likely we may revisit them later, to update them based on contextual shifts or learning. To avoid accusations of fudging, amendments to criteria and standards should be documented with supporting rationale.
For example, consider the indicator: “Number of new regulations introduced”. Say the actual result was 6 and the target was 7. Is the result poor, adequate, or good? Whatever you think it is, if you based it on the number, you’re wrong. The number tells us nothing about the nature of the regulations - their relevance, strategic importance, reach, enforceability, stakeholder buy-in, compliance costs, potential or actual impacts (positive and negative), or economic value. One impactful regulation is better than 7 ineffectual ones. No regulations at all is better than a regulation that does net harm to people and the economy. The indicator may be easy to count, but that doesn’t mean it matters.