In many national statistical offices (NSOs), there is a strong desire to improve the dissemination of indicators or aggregated data using digital systems. This aspiration marks a paradigm shift for many NSOs, for whom, until now, print publications have been the dominant medium for disseminating such data.
Managing data digitally is fundamentally different to paper publications. In digital systems, one must think of back, and front ends. The back end is where you would store and arrange data, and the front end is where users interact with the data: including slicing, dicing and even combining related datasets. Data on paper publication is WYSIWYG (What You See Is What You Get). It is a system in which the content captured always appears the same when viewed.
I had the opportunity to explore many of the challenges faced by NSOs while working on improving indicator dissemination systems. In the process, it became increasingly clear that data modelling, i.e., arranging data in a form suitable for databases, was central to the process, as well as identifying more straightforward ways of explaining it.
Thus, in this blog post, I am sharing my experiences in data modelling and offering a few pointers that may be useful for NSOs embarking on this path.
In an attempt to unravel data modelling, I took a simple dataset from a print publication produced to disseminate data from a country’s socio-economic survey. The following dataset describes the country’s net attendance rate.
Net attendance rates in lower secondary school by geographic domain and sex (age 12-14 years), 2015. In percent.
Domains | Women | Men | Both sexes |
---|---|---|---|
National | 45.2 | 42 | 43.5 |
Capital | 67.5 | 59.4 | 63.1 |
Other urban | 47.7 | 53.9 | 50.9 |
Other rural | 42.5 | 38.2 | 40.3 |
Arranging data in print publications (like Table 1 above) vis-a-vis databases could be vastly different. The main objective of print publications is to optimise the presentation for human consumption. But when it comes to arranging data in a database, it is mainly for machines. Its presentation (e.g., with nicer visualisation) is on top of the data stored in a database. Therefore, the agility of the data presentation depends on the efficiency of the database.
A critical concern while designing data tables in a database is that they must contain all characteristics of the dataset in question and be arranged for a multitude of usages through the possibility of slicing, dicing, pivoting and allowing for potential combinations with datasets that share similar concepts. The first step in responding to this requirement is to know all the dataset’s features and incorporate or model the same in the database design.
To illustrate the process, let’s take Table 1 as a starting point. I first need to identify various underlying concepts in this table to model the dataset for a database.
Let’s start with the title – ‘Net attendance rates in lower secondary school by geographic domain and sex (age 12-14 years), 2015. In percent’. I note that the ‘net attendance rate’ is the indicator. The ‘lower secondary’ is the education level. The ‘12-14 years’ is the age group, and ‘2015’ is the time. And finally, ‘percent’, the unit of measure.
In the first column (Domains), noting the presence of ‘National’ and ‘Capital’, I identify the geographic domain. In the second and third columns, ‘Women’ and ‘Men’ respectively indicate sex as an existing concept. And then, I note the figures in the second, third and fourth columns as the observation (obs.) values.
On further investigation of the dataset, I see that the first column mentions the terms ‘urban’ and ‘rural’, which are not part of the geographic domain but of the concept of location.
Finally, I realise the need to clarify ‘other’ used with ‘urban’ and ‘rural’. It is to explain that the capital’s observation values do not include numbers from other urban and rural locations. A footnote, therefore, could be a valuable concept to explain the difference.
Now, I can design the data table in the database with the following concepts arranged as column titles like this.
Table 2: Modelled dataset
Indicator name | Education level | Geographic domain | Location | Sex | Age | Time | Obs. value | Unit of measure | Footnote |
Net attendance rate | Lower secondary | National | Total | Female | 12-14 | 2015 | 45.2 | Percent | |
Net attendance rate | Lower secondary | National | Total | Male | 12-14 | 2015 | 42 | Percent | |
Net attendance rate | Lower secondary | National | Total | Total | 12-14 | 2015 | 43.5 | Percent | |
Net attendance rate | Lower secondary | Capital | Total | Female | 12-14 | 2015 | 67.5 | Percent | |
Net attendance rate | Lower secondary | Capital | Total | Male | 12-14 | 2015 | 59.4 | Percent | |
Net attendance rate | Lower secondary | Capital | Total | Total | 12-14 | 2015 | 63.1 | Percent | |
Net attendance rate | Lower secondary | National | Urban | Female | 12-14 | 2015 | 47.7 | Percent | Excluded in capital |
Net attendance rate | Lower secondary | National | Urban | Male | 12-14 | 2015 | 53.9 | Percent | Excluded in capital |
Net attendance rate | Lower secondary | National | Urban | Total | 12-14 | 2015 | 50.9 | Percent | Excluded in capital |
Net attendance rate | Lower secondary | National | Rural | Female | 12-14 | 2015 | 42.5 | Percent | Excluded in capital |
Net attendance rate | Lower secondary | National | Rural | Male | 12-14 | 2015 | 38.2 | Percent | Excluded in capital |
Net attendance rate | Lower secondary | National | Rural | Total | 12-14 | 2015 | 40.3 | Percent | Excluded in capital |
Here you see the modelled dataset. This scalable structure could accommodate other observation values of the same indicator (net attendance rate) with different disaggregation, e.g., Table 3: ‘Net attendance rates in primary school (age 6-11)’ and Table 4: ‘Net attendance rates in upper secondary school (age 15-17)’. It would only require adding new rows.
Net attendance rates in primary school (national) by sex (age 6-11 years), 2015. In percent.
Domains | Women | Men | Both sexes |
---|---|---|---|
National | 90.5 | 84.5 | 87.3 |
Net attendance rates in upper secondary school (national) by sex (age 15-17 years), 2015. In percent.
Domains | Women | Men | Both sexes |
---|---|---|---|
National | 20.4 | 22.8 | 21.6 |
The structure of Table 2 could also accommodate new but related indicators because it specifies (hopefully) all the relevant concepts. For example, it could accommodate data from Table 5, ‘Person aged 6-24 years attending school’ in existing columns.
Persons aged 6-24 years attending school by sex, 2015. In percent.
Domains | Women | Men | Both sexes |
---|---|---|---|
National | 55.5 | 56.6 | 56.1 |
In the original Table 2, the additional rows (representing data from Tables 3, 4 and 5) would look like the following.
Table 6: The additional rows in the modelled table
Indicator name | Education level | Geographic domain | Location | Sex | Age | Time | Obs. value | Unit of measure | Footnote |
Net attendance rate | Primary | National | Total | Female | 6-11 | 2015 | 90.5 | Percent | |
Net attendance rate | Primary | National | Total | Male | 6-11 | 2015 | 84.4 | Percent | |
Net attendance rate | Primary | National | Total | Total | 6-11 | 2015 | 43.5 | Percent | |
Net attendance rate | Upper secondary | National | Total | Female | 15-17 | 2015 | 20.4 | Percent | |
Net attendance rate | Upper secondary | National | Total | Male | 15-17 | 2015 | 22.8 | Percent | |
Net attendance rate | Upper secondary | National | Total | Total | 15-17 | 2015 | 21.6 | Percent | |
Persons attending school | Total | National | Total | Female | 6-24 | 2015 | 55.5 | Percent | |
Persons attending school | Total | National | Total | Male | 6-24 | 2015 | 56.6 | Percent | |
Persons attending school | Total | National | Total | Total | 6-24 | 2015 | 56.1 | Percent |
From the database point of view, this uniform format in Tables 2 and 6 above makes it efficient to perform operations like filtering, counting, calculating averages, etc., on the dataset. It lets users slice and dice the data and combine further related data (e.g., other age ranges and other education measures). Yes, it is long, but because databases are good at compressing large datasets, even a considerable number of rows will not be a problem for the machine.
When presenting the data to the end-users on dissemination platforms or printed publications, well-modelled data can be combined and generated in multiple ways and easy-to-read formats for human eyes. An example of that from Tables 2 and 6 is below.
Table 7: Net attendance rate
Net attendance rate | ||||
Geographic domain: National, Time: 2015, unit of measure: Percent | ||||
Sex | ||||
Education level | Age | Female | Male | Both |
Primary | 6-11 | 90.5 | 84.4 | 43.5 |
Lower-secondary | 12-14 | 45.2 | 42 | 43.5 |
Upper secondary | 15-17 | 20.4 | 22.8 | 21.6 |
Standards: SDMX
Data modelling could follow standards with specified technical specifications for consistency and comparability. For example, the SDMX (Statistical Data and Metadata eXchange) is one standard that facilitates the exchange of statistical data and metadata across organisations by arranging the data and metadata in standard formats.
How to model data in SDMX? Now, that is the subject of another blog post!