Lowest Common Denominator in Data Curation

Few months back, while participating in a World Bank study of National Statistics Offices (NSOs) in the context of government sponsored Open Data initiatives and later, on a research project to provide guidance to NSOs on data dissemination platforms in the context of Open Data; I realized, not surprisingly though, ‘microdata’ (as a statistical product), emerging as an outlier.

Not that, microdata can’t be ‘Open Data’ (machine readability, open standards and free for reuse – some of the key prerequisites for Open Data – are not too difficult to put together) but, specifically from the perspective of data dissemination policies and the platforms used for distribution, ‘microdata’ – as a data type – certainly stands out in NSOs.

From policy perspective, ‘microdata’ demands (and rightly so) a strong consideration around statistical disclosure control (or anonymization) for each microdata set. A key artefact of dissemination discourse, NSOs therefore, can’t have one standard dissemination policy, applied to determine microdata access too! It requires special and additional terms.

From a platform angle as well, microdata distribution requires unique treatment. Of course, some of it (such as, user authentication) is because of the access considerations, but some is due to the very characteristics of data (indivisible granularity) itself, requiring, for example, use of relational database to store as opposed to simple files systems. In practice, it is normal for NSOs to have a distinct platform for microdata (separate from one for indicators/time-series data, one for publications in PDFs and one for geo-spatial data).

Microdata handling, I think, invoke the lowest common denominator in data curation and if positioned strategically, it may play a significant role in harmonizing statistical data processing leading up to dissemination for all data types in NSOs.

At NISR, NADA or National Data Archive (an open source web-based, microdata cataloging tool, developed by the International Household Survey Network (IHSN)) has been in use since July 2012 for distributing microdata sets (along with detailed metadata). In addition to detailed metadata, it also incorporates the workflow conforming to the rules and procedures required for their reuse. Definitely it offers more than just a simple means of distributing microdata.



NADA treats a survey or a census (‘study’) as a corner stone and offers a logical flow to data curation, conforming to the ethos of the Generic Statistical Business Process Model with a direct impact on – getting data used.

I’m wondering, can NADA (with additional appropriate tweaking) be used as a generic framework of distributing all kinds of statistical data – comprehensively or is it that NSOs have to continue struggling making constantly proliferating tools to work together in order to maintain the consistency and timeliness required in taking usable data to their intended users?

That is the question I’m pondering right now.

Leave a Reply

Your email address will not be published. Required fields are marked *


Please enter the CAPTCHA text

This site uses Akismet to reduce spam. Learn how your comment data is processed.