Lowest Common Denominator in Data Curation

A few months back, while participating in a World Bank study of national statistics offices (NSOs) in the context of government-sponsored open data initiatives and later, on a research project to provide guidance to NSOs on data dissemination platforms in the context of Open Data; I realized, not surprisingly though, ‘microdata’ (as a statistical product), emerging as an outlier.

Not that microdata can’t be ‘Open Data’ (machine readability, open standards and free for reuse – some of the essential prerequisites for Open Data – are not too difficult to put together) but, specifically from the perspective of data dissemination policies and the platforms used for distribution, ‘microdata’ – as a data type – certainly stands out in NSOs.

From a policy perspective, ‘microdata’ demands (and rightly so) a strong consideration around statistical disclosure control (e.g. anonymization), an essential consideration for data dissemination. NSOs, therefore, can’t have one standard dissemination policy applied to determine microdata access too! Instead, it requires special and additional terms.

From a platform angle as well, microdata distribution requires unique treatment. Of course, some of it (such as user authentication) is because of the access considerations, but some are due to the very characteristics of data (indivisible granularity) itself, requiring, for example, the use of a relational database to store as opposed to simple files systems. In practice, it is normal for NSOs to have a distinct platform for microdata (separate from one for indicators/time-series data, one for publications in PDFs and one for geospatial data).

Microdata handling, I think, invoke the lowest common denominator in data curation. If positioned strategically, it may play a significant role in harmonizing statistical data processing leading up to dissemination for all data types in NSOs.

At NISR, NADA or National Data Archive (an open-source web-based microdata cataloguing tool developed by the International Household Survey Network (IHSN)) has been in use since July 2012 for distributing microdata sets (along with detailed metadata). In addition to detailed metadata, it also incorporates the workflow conforming to the rules and procedures required for their reuse. Definitely, it offers more than just a simple means of distributing microdata.


NADA treats a survey or a census (‘study’) as a cornerstone and offers a logical flow to data curation, conforming to the ethos of the Generic Statistical Business Process Model with a direct impact on – getting data used.

I’m wondering, can NADA (with additional appropriate tweaking) be used as a generic framework of distributing all kinds of statistical data – comprehensively, or is it that NSOs have to continue struggling making constantly proliferating tools to work together to maintain the consistency and timeliness required in taking usable data to their intended users?

That is the question I’m pondering right now.

Comments are closed.