What will it take to improve statistical data dissemination in the digital realm?

[Summary: In this blog post, I argue that the three fundamental ways to improve data dissemination using digital platforms in National Statistics Offices (NSOs) could be; instilling comprehensive data curation processes, automated data transfers from aggregated tables to the online statistical data dissemination platforms; and their seamless integration with Open Data portals.]

Dissemination is considered the last leg of the entire process of statistical work in a typical National Statistics Office (NSO). Over the years, at the National Institute of Statistics of Rwanda (NISR), I’ve witnessed the dissemination process evolve with time, undergoing significant changes. Yet I can’t help but think of challenges remaining.

The paradigm shift had started long before, though, I think, marked by the beginning of – data stored on digital media – being made available to the data seekers. Before that, the medium was only paper-based publications. In other words, the significant change in dissemination was the shift from paper-based publications to the digital media underpinned by the prospects of new contents (e.g. microdata) and then, in online data delivery mechanisms (Web-based) primarily shaped by the Internet, hitherto not possible.

In fact, I believe that it was only after statistical data started getting accessed on the Web did the realization dawned in NSOs on the enormity of the impact of this change. [Before that, the takeaway CD-ROMs or flash memory sticks (or other digital storage media) or even desktop versions of data dissemination software applications, were in use – digital, but still continuing some of the old pain points associated strongly with physical distribution (such as logistical complexities and/or higher costs) in addition to, or instead of, distributing paper-based publications.]

At NISR, providing data access on the Web started with making available online PDF copies of paper-based publications (A format that often receives criticism from the open data proponents in the context of constrained re-use of data contained in these). However, the game-changing functionalities offered by the mix of – digital media and the Web – was appropriately realized after introducing online statistical data dissemination platforms (such as DevInfo, Prognoz, NADA, Redatam (IMIS) and others), offering machine-readable data through databases online.

The platforms brought many accolades, but soon, challenges started to emerge.

In the case of ‘indicators’ type data (or aggregated data, as opposed to microdata) made available through the online platforms (localized adaptations) – DevInfo and Prognoz, the immediate challenge encountered was to maintain the harmony between online data vis-à-vis the data contained in the paper-based publications!

Because when data is entered manually (or in some cases, semi-manually), voluminous data is challenging to enter, these online platforms, having this limitation, consequently, failed to have in them, all data that is produced – giving rise to selective data entry practice (based on the expressed or perceived needs of data seekers). The manual entry brought in sporadic human errors too. Furthermore, the batched data entry adversely affected the timeliness of data updates online. Also, the simultaneous presence of two distinct platforms meant that the data entry had to take place twice for both platforms – multiplying the human efforts required to keep them in sync with each other.

Thankfully though, in parallel with these online platforms, the PDF copies of paper-based publications containing comprehensive data were also available online, serving the data seekers, albeit rather inconveniently.

Developed over a long period, the traditional dissemination regime in NSOs, underpinned by paper-based publications, has had a robust mechanism in place to ensure required due diligence, enabling many eyes to view the results before final publication. In other words, these reports go through draft stages where errors are flagged and corrected before release and broader dissemination. However, with the new set-up of IT tools for data dissemination, the old processes are struggling to keep up with the challenges posed by the new environment in ensuring the required quality assurance and in building trust in data made available through the online platforms.

These processes are required to be updated to reflect the changes in the scenario.

The situation, in fact, offers an opportunity to analyze the entire data management process appropriate changes in which hold the potential to fully utilize the functionalities offered by the modern tools and technologies in efficiently and effectively meeting the data dissemination obligation of NSOs.

In this context, I would argue that the introduction of data curation practice – as a distinct function – is an imminent step for the NSOs which would seek to manage data through its “lifecycle of interest and usefulness” – maintaining the quality, adding value and enabling its re-use through discovery and retrieval – online.

Use of digital means (with the application of software such as SPSS, Stata and R etc.) has had been long around for ‘data analysis’ – [ultimately resulting in statistical tables or indicators contained therein (as ‘born digital’ data)] – but with ‘data dissemination’ also becoming digital by adopting distribution channels such as DevInfo and Prognoz, these two processes can and should be now integrated digitally.

The image below describes the current flow of the process (As-is), showing from where aggregated data or indicators (contained in statistical tables) are populated in the online statistical data dissemination platforms (DevInfo and Prognoz). It also illustrates the alternative way (To-be) of directly feeding data to the online platforms from the same ‘source’ as the printed reports.

Source
Scenarios of sourcing (‘born digital’) aggregated data/indicators

In the current flow (As-is), for the indicators contained in statistical tables to appear in online databases, it requires manually entering data from the printed reports into DevInfo and Prognoz, as also illustrated below by ‘manual data entry’ between stage 1 and stage 2.

Breaks in data flow
Breaks in data flow

Therefore, an automated system of subsuming indicators directly from the statistical tables may not only reduce the time taken for comprehensively populating the online statistical data dissemination platforms but will also eliminate (or minimize) the errors currently caused by the manual data entry.

There is another leg in the dissemination chain which demands a mention here – the government-sponsored Open Data portals. Many indicators produced by the NSOs are typically the candidates also for the Open Data portals. However, there also remains a gap, as illustrated in the image above denoted by ‘manual data entry’ between stage 2 and stage 3.

The data transfer from online statistical data dissemination platforms to Open Data platforms is not currently seamless in most instances. This is another leapfrogging opportunity, with the potential of addressing two key issues associated deeply with some Open Data portals – lack of regular data updates (e.g. the case of Kenya’s Open Data portal) and missing metadata.

Complications remain, though, such as the difference in the ways indicators are organized in multidimensional structure (aka ‘data cube’) defined by a set of dimensions and observation values in these different online platforms (DevInfo/Prognoz), and that leads to varying degrees of complexities with which, interactions with these tools are performed. In contrast, a typical Open Data portal (in most cases, developed using either of these tools: CKAN, DKAN, OGPL, Junar and Socarata etc.) houses only a ‘slice’ of the ‘data cube’, requiring independent and ad-hock updates.

Bridging the data transfer gap between these two sets of tools is going to be challenging but hugely rewarding for sure!

As we prepare to meet the challenges of ‘data revolution’ in the post-2015 development debate, more than just means, these ‘solutions’, I think, could be the development imperatives in their own rights under the broader theme of improving the availability of and access to data and statistics (See the ‘data, monitoring and accountability’ section of the Goal 17 in the proposal of the Open Working Group for Sustainable Development Goals).

NSOs with comprehensive data curation processes, online data dissemination platforms obtaining data ‘directly’ from aggregated tables and providing seamless data access through ‘Open Data’ interfaces will be a good fit in improving statistical data dissemination in the digital realm.

4 comments on “What will it take to improve statistical data dissemination in the digital realm?

  1. You are certainly right that data transfer gaps (often leading to stale data) are a major challenge for governments and organizations. As a result, we concentrate with partners on having non-manual, automated ways to get data out of their systems and onto our Open Data platform … and crucially, scheduling automated updating of that data — helping them avoid the “ad hoc updates” that too often lead to no updates.

    We do this with a tool called DataSynch – which we have open sourced and is available on GitHub: http://socrata.github.io/datasync

    • Many thanks for the comments Jeff! DataSynch seems like a solution, National Statistical Offices may use for treading the difficult path of ‘supplying’ data to the government sponsored Open Data portals developed using Socrata platform – automatically and in fixed intervals. In my attempt to document the solutions in this space, this will be a valuable resource. Thanks!

  2. I widely agree.

    I am an old (experienced?) statistician who started producing statistics with a manual calculator having to transfer results to paper then producing with a pencil a paper copy of what a typist would have to type on a wax stencil before being able to reproduce and deliver to user; I had to draw graphs directly on stencils with special drawing instruments. I went through mainframe calculation, office offset printing, microcomputers and laser printing, networks and internet, wikis, … I have always seen colleagues in official statistics integrating the advanced technology that would bring efficiency to their mission; I bet it will continue. I learned computer programming in the mid 60s.

    What remains unchanged is that the final stage is still users reading graphs, texts with figures and tables of figures. The main difference now is that some of the users prefer to get the figures themselves and are able to do their own analysis but still at the end producing pictures, texts with figures and tables of quantitative data to communicate the results of there works, whether on a screen or on paper but always to be read. A kind of outsourcing the final stage with a machine to machine transfer of the data for further processing. Most of official statistics users are readers of the products. Access to micro-data is a demand by a very, very few users wanting, for their own needs, to do the whole processing of the source data (lack of confidence and/or specific needs); the issue is as you said “ensuring the required quality assurance and in building trust in data” when results are thereafter made public. Open data introduces a step further: getting micro-data before any further curation.

    Back to official statistics, I would suggest to consider the whole National (official) statistical system and not only the NSO. Also there is a need for a relevant segmentation of dissemination targets so as to adjust dissemination channels and products using the most adequate technology, in an environment where everything will soon be digital and there will be less and less manual data entry to machine. The work done on Modernisation of Statistical Production and Services is important (UNECE High-Level Group for the Modernisation of Statistical Production and Services – http://goo.gl/87hkiB) and I trust should percolate to developing countries and specially the Generic Statistical Business Process Model when it comes “to analyse the entire data management process”; SDMX is also part of it when it comes to “provide seamless data access through ‘Open Data’ interfaces”.

    At country level, anything that has a strategic character and needs time and resources has to be examined during the NSDS (the whole NSS) decision making process.

    Finally, DevInfo, Prognoz, NADA, Redatam (IMIS) and others like CountryStat are provided (free) to countries with the essential purpose to serve the needs of the provider and much less (never?) to be a building block of the national statistics data processing. Only countries can change that over the longer term, when politicians, IT specialists and statisticians are willing and in the position to drive the process. What about Rwanda?

    • Thank you Gérard for your detailed comments!

      Surely, these are the words of an experienced statistician. You have witnessed this ‘paradigm shift’ first-hand, which I’m refereeing to, in my post above.

      I absolutely agree with you, that gradually, the NSOs have been coming under unprecedented pressure from data users and in particular from ‘civic hackers’ – responsible for re-using the data in ways applicable to specific purposes – to release data openly and in machine-readable formats.

      However, ‘trust’ in the source data is where the buck stops. Until and unless the source data earns the trust, the resulting insights (even by the ‘civic hackers’) will have little usability except for being used in some fancy infographics/visualisations with little or no impact.

      It is also important to underscore that, micro-data and aggregated data should not be seen from the same ‘Open Data’ lens. The ‘openness’, in both these cases has to be defined distinctly using the artefacts of ‘statistical data types’ and not of absolute ‘open’ of Open Data.

      Only strategic approaches (NSDS and whole-of-NSS) towards high fidelity and trust worthy data are sustainable. Ad-hoc solutions (specially driven by tools) are not.

      You are right; we have many good examples to learn from and be prepared (in other pockets) with the insights to manage data seekers’ expectations. However, in general, the way ‘development’ is partitioned in regions and geographies, it is no easy task to avoid the silos.

      I believe we need a very strong knowledge management ‘culture’, in not only statistical development but in general in the development sphere.

      I think and to quote William Gibson, the future is already here, it’s just not evenly distributed!