The shape of data in digital systems: Modelling aggregated data for data portals

Data modelling on the whiteboard, Phnom Penh, March 2018
Data modelling on the whiteboard, Phnom Penh, March 2018

In many national statistical offices (NSOs), there is a strong desire to improve the dissemination of indicators or aggregated data using digital systems. This aspiration marks a paradigm shift for many NSOs, for whom, until now, print publications have been the dominant medium for disseminating such data.

Managing data digitally is fundamentally different to paper publications. In digital systems, one must think of back, and front ends. The back end is where you would store and arrange data, and the front end is where users interact with the data: including slicing, dicing and even combining related datasets. Data on paper publication is WYSIWYG (What You See Is What You Get). It is a system in which the content captured always appears the same when viewed.

I had the opportunity to explore many of the challenges faced by NSOs while working on improving indicator dissemination systems. In the process, it became increasingly clear that data modelling, i.e., arranging data in a form suitable for databases, was central to the process, as well as identifying more straightforward ways of explaining it.

Thus, in this blog post, I am sharing my experiences in data modelling and offering a few pointers that may be useful for NSOs embarking on this path.

In an attempt to unravel data modelling, I took a simple dataset from a print publication produced to disseminate data from a country’s socio-economic survey. The following dataset describes the country’s net attendance rate.


Net attendance rates in lower secondary school by geographic domain and sex (age 12-14 years), 2015. In percent.

DomainsWomenMenBoth sexes
National45.24243.5
Capital67.559.463.1
Other urban47.753.950.9
Other rural42.538.240.3
Table 1: Net attendance rate

Arranging data in print publications (like Table 1 above) vis-a-vis databases could be vastly different. The main objective of print publications is to optimise the presentation for human consumption. But when it comes to arranging data in a database, it is mainly for machines. Its presentation (e.g., with nicer visualisation) is on top of the data stored in a database. Therefore, the agility of the data presentation depends on the efficiency of the database.

A critical concern while designing data tables in a database is that they must contain all characteristics of the dataset in question and be arranged for a multitude of usages through the possibility of slicing, dicing, pivoting and allowing for potential combinations with datasets that share similar concepts. The first step in responding to this requirement is to know all the dataset’s features and incorporate or model the same in the database design.

To illustrate the process, let’s take Table 1 as a starting point. I first need to identify various underlying concepts in this table to model the dataset for a database.

Let’s start with the title – ‘Net attendance rates in lower secondary school by geographic domain and sex (age 12-14 years), 2015. In percent’. I note that the ‘net attendance rate’ is the indicator. Thelower secondary’ is the education level. The ‘12-14 years’ is the age group, and ‘2015’ is the time. And finally, ‘percent’, the unit of measure.

In the first column (Domains), noting the presence of ‘National’ and ‘Capital’, I identify the geographic domain. In the second and third columns, ‘Women’ and ‘Men’ respectively indicate sex as an existing concept. And then, I note the figures in the second, third and fourth columns as the observation (obs.) values.

On further investigation of the dataset, I see that the first column mentions the terms ‘urban’ and ‘rural’, which are not part of the geographic domain but of the concept of location.

Finally, I realise the need to clarify ‘other’ used with ‘urban’ and ‘rural’. It is to explain that the capital’s observation values do not include numbers from other urban and rural locations. A footnote, therefore, could be a valuable concept to explain the difference.

Now, I can design the data table in the database with the following concepts arranged as column titles like this.


Table 2: Modelled dataset

Indicator name Education level Geographic domain Location Sex Age Time Obs. value Unit of measure Footnote
Net attendance rate Lower secondary National Total Female 12-14 2015 45.2 Percent  
Net attendance rate Lower secondary National Total Male 12-14 2015 42 Percent  
Net attendance rate Lower secondary National Total Total 12-14 2015 43.5 Percent  
Net attendance rate Lower secondary Capital Total Female 12-14 2015 67.5 Percent  
Net attendance rate Lower secondary Capital Total Male 12-14 2015 59.4 Percent  
Net attendance rate Lower secondary Capital Total Total 12-14 2015 63.1 Percent  
Net attendance rate Lower secondary National Urban Female 12-14 2015 47.7 Percent Excluded in capital
Net attendance rate Lower secondary National Urban Male 12-14 2015 53.9 Percent Excluded in capital
Net attendance rate Lower secondary National Urban Total 12-14 2015 50.9 Percent Excluded in capital
Net attendance rate Lower secondary National Rural Female 12-14 2015 42.5 Percent Excluded in capital
Net attendance rate Lower secondary National Rural Male 12-14 2015 38.2 Percent Excluded in capital
Net attendance rate Lower secondary National Rural Total 12-14 2015 40.3 Percent Excluded in capital

Here you see the modelled dataset. This scalable structure could accommodate other observation values of the same indicator (net attendance rate) with different disaggregation, e.g., Table 3: ‘Net attendance rates in primary school (age 6-11)’ and Table 4: ‘Net attendance rates in upper secondary school (age 15-17)’. It would only require adding new rows.


Net attendance rates in primary school (national) by sex (age 6-11 years), 2015. In percent.

DomainsWomenMenBoth sexes
National90.584.587.3
Table 3: Net attendance rates in primary school (age 6-11)

Net attendance rates in upper secondary school (national) by sex (age 15-17 years), 2015. In percent.

DomainsWomenMenBoth sexes
National20.422.821.6
Table 4: Net attendance rates in upper secondary school (age 15-17)

The structure of Table 2 could also accommodate new but related indicators because it specifies (hopefully) all the relevant concepts. For example, it could accommodate data from Table 5, ‘Person aged 6-24 years attending school’ in existing columns.


Persons aged 6-24 years attending school by sex, 2015. In percent.

DomainsWomenMenBoth sexes
National55.556.656.1
Table 5: Persons aged 6-24 years attending school

In the original Table 2, the additional rows (representing data from Tables 3, 4 and 5) would look like the following.


Table 6: The additional rows in the modelled table

Indicator name Education level Geographic domain Location Sex Age Time Obs. value Unit of measure Footnote
Net attendance rate Primary National Total Female 6-11 2015 90.5 Percent  
Net attendance rate Primary National Total Male 6-11 2015 84.4 Percent  
Net attendance rate Primary National Total Total 6-11 2015 43.5 Percent  
Net attendance rate Upper secondary National Total Female 15-17 2015 20.4 Percent  
Net attendance rate Upper secondary National Total Male 15-17 2015 22.8 Percent  
Net attendance rate Upper secondary National Total Total 15-17 2015 21.6 Percent  
Persons attending school Total National Total Female 6-24 2015 55.5 Percent  
Persons attending school Total National Total Male 6-24 2015 56.6 Percent  
Persons attending school Total National Total Total 6-24 2015 56.1 Percent  

From the database point of view, this uniform format in Tables 2 and 6 above makes it efficient to perform operations like filtering, counting, calculating averages, etc., on the dataset. It lets users slice and dice the data and combine further related data (e.g., other age ranges and other education measures). Yes, it is long, but because databases are good at compressing large datasets, even a considerable number of rows will not be a problem for the machine.

When presenting the data to the end-users on dissemination platforms or printed publications, well-modelled data can be combined and generated in multiple ways and easy-to-read formats for human eyes. An example of that from Tables 2 and 6 is below.


Table 7: Net attendance rate

Net attendance rate
Geographic domain: National, Time: 2015, unit of measure: Percent
Sex
Education level Age Female Male Both
Primary 6-11 90.5 84.4 43.5
Lower-secondary 12-14 45.2 42 43.5
Upper secondary 15-17 20.4 22.8 21.6

Standards: SDMX

Data modelling could follow standards with specified technical specifications for consistency and comparability. For example, the SDMX (Statistical Data and Metadata eXchange) is one standard that facilitates the exchange of statistical data and metadata across organisations by arranging the data and metadata in standard formats.

How to model data in SDMX? Now, that is the subject of another blog post!

Comments are closed.