STARTING my job at Development Pathways a year ago meant an important change in my career: I started conducting much more (‘raw’) data analysis than I had ever done before. Sure, I had worked with micro- or household level data before on and off, but this was the ‘real thing’: Working with data, all day, every day (ok, ok, most days). Over the course of this year, I have learned three major lessons regarding data, writes Heiner Salomon.
I have learnt:
- There is much more (raw) data out there than most people would think, including in low-income countries;
- data is still very hard to access; and
- data is always political.
Abundance of raw data
But let’s start from the top. When people talk about big data, they usually mean swathes of data gathered through people’s smartphones, their browsers, or the ‘internet of things’, often in the context of high-income countries.
But the advent of cheap data gathering and storage opportunities – as a consequence of ever faster processors and plummeting hardware costs – has improved data collection affordability also for statistics offices, including those in low-income countries.
If I want to know, whether there is (micro-) data on a certain topic, I start with checking the International Household Survey Networks’ micro-data catalogue – it provides a great overview of and download options (or at least links) to all kinds of data sets relevant for my work. That includes the more standardised, high quality data sets that have been institutionalised in low-income countries over the last two decades or so. Good examples of that – for social statistics – are the World Bank’s Living Standard Measurement Surveys, the USAID’s Demographics Health Surveys and UNICEF’s Multiple Indicator Cluster Surveys.
Of course, data gaps still remain for too many countries and yes, data quality is still a critical issue (Morten Jerven eloquently points that out in his book Poor Numbers).
Data often not (easily) accessible
However, the major problem is the accessibility of all that data. It starts with the fact, that often the raw data is not available (especially for census data) for analysis, often due to (important and necessary) data protection laws.
But even beyond the ‘physical’ inaccessibility, there is also another, much bigger barrier: Even if one had access to the raw data, most people could not make sense of it, as not everyone is trained in data analysis. So even if the raw data was available, if no one has been paid to extract the relevant information from it, that raw data might as well not exist.
That is an important reason why, despite the availability of much of the raw data, many of the over 200 of the Sustainable Development Goal (SDG) indicators are not easily available (as this CGDev blog showed last year). This is a much larger issue still once one tries to disaggregate data to obtain information on indicators beyond the national averages, for instance by sex, rural and urban or sub-national regions.
And extracting the relevant indicators from the raw data is not always as easy as it sounds. While working on the SDG baseline report for children in Indonesia, we as analysts often had to make judgement calls despite the extensive metadata guidelines on each SDG indicator. Is a flushing pit latrine without a slab, where the faeces are disposed in a land hole comparable to a ‘flush or pour flush toilet to […] [a] pit latrine’ – hence an ‘improved sanitation facility’ – or not?
Also, even if a data analysis has been undertaken, is this easily accessible? Can one order the publication or download it? And even if one can download it, is it in a machine-readable format (such as .pdf) or not (like a scan)? Or is there even a possibility to download the data in tables separately? Or is one so lucky to have an entire website dedicated to present the data, (as is the case with the SDG baseline for children in Indonesia)?
Data is political
The level of availability has a lot to do with politics. Because, at the end of the day, it is vital to remember that data is always political. Whether data is collected in the first place is determined by whether it is deemed important. That is why we have information on GDP and GDP growth even in low-income countries, but hardly any on gender-based violence, for instance.
And even if you have collected some raw data, calculating the final indicator that you could present always involves a lot of choices, and those choices can lead to widely varying results. Calculating poverty rates is an excellent example of that: poverty rates are an important political number. Poverty – especially absolute poverty – going down is usually read as a sign that the government is doing great, whereas poverty going up is a sign the government is doing something wrong.
However, the choices included in setting (absolute) poverty lines are nearly uncountable: What goes into a bundle of minimum goods, that suffices for a minimal but acceptable standard of living? And even if one, as it is the case in most low-income countries, starts off by using some form of minimum calorie requirement to set such a line, the choices are hardly less complex: Which food do you include in the basket to fulfil the calorie requirements? More rice or more bread, more maize or more cassava etc. Then, there is the issue of price deflation: 1 kg of potatoes costs more now than it did 20 years ago, and it costs more in the capital than it might in the rural areas where it is produced. But what baskets of goods are the price deflators based on, and where has the data for the regional deflation corrections been collected (often it is only the capital or maybe a few largest cities)?
The above is obviously just an extremely condensed list. However, as all these choices influence the outcome of the poverty line and consequently the poverty rate in a country, analysts can easily take a number of readily justifiable decisions that might significantly affect the final poverty rate, if the political pressure is only high enough. For more detail, have a look at this illuminating blog post, for instance, that looked at the poverty rate in Rwanda earlier this year.
Bottom line: We need more data literacy
I hear the battle cry for ‘more data’ amongst everyone dealing with statistics and data analysis, and ‘international development workers’ more generally all the time. I support that sentiment wholeheartedly. But at the same time I think it is equally important still to educate everyone on how to interpret data. Because simply having more data just adds more noise. As the letter of the UK Statistics Authority on foreign secretary Boris Johnson’s claim that leaving the EU would save £350m a week shows, collecting and aggregating data on its own does not make for better policies if no one understands or trusts the statistics.