Courrier des statistiques N9 - 2023
Presentation of the issue
Each issue of the Courrier des statistiques follows its own logic. Admittedly, the random nature of the papers proposed does not automatically achieve thematic consistency and the logic of the issue only becomes apparent in retrospect. Issue N9 is characterised by a selection of papers notable for their technical nature and the fact that they cover subjects not normally addressed by this journal. We will therefore attempt to provide a clearer understanding of key topics that appear impenetrable at first glance.
Let’s start at the end for a change, with the final three papers. These are thematically linked and could wrongly be considered better suited to an information technology journal than an Official Statistics journal. In reality, they are essential in a “world of data” that will see statisticians increasingly draw upon third‑party data, such as administrative data, for their own use.
Let’s take paper no. 5, written by Alexis Dondon and Pierre Lamarche, which discusses data formats. At first glance, this may appear to be an ancillary subject and purely operative in nature, secondary to statistical use; however, this is not the case. Whether they are imposed on the user or deliberately chosen, formats have properties that present both limitations and opportunities. The authors explain that there is no ideal format, but rather that each format meets varying needs and constraints. They discuss the lesser known but more recent Parquet format, which is suitable for very large volumes of data.
These data are obtained from “elsewhere” within administrative data sources. It seems there is a lot to say about this concept of source, but let’s start with what we already have and see how statistics are compiled from that. The data must be integrated and then transformed to ensure that they are suitable for use in the statistical production process. It is this little‑known transformation phase that Franck Cotton and Olivier Haag describe in paper no. 6. They break down the various stages, including recoding, control, pseudonymisation, renaming, characterisation of statistical units and filtering. They also emphasise the critical importance of automated and replicable processing, a veritable pipeline driven by metadata. In this regard, format management is one of the key aspects of the transformation and control process.
Did you say control, transformation? Bertrand Dubrulle, Olivier Rosec and Christian Sureau (CNAV) are also interested in this topic, but in a very different context: the mass exchange of data in the field of social security, for example to feed repositories such as the Single Career Management Directory (Répertoire de gestion des carrières uniques — RGCU), or for administrative declarations. In order to manage the data flows transmitted, and achieve automated processing, the expected data structure and the rules that the data follow must be very clearly defined: this is referred to as an exchange standard. Given that this structure is prone to frequent change due to regulatory developments, the CNAV has developed a tool (Saturne), which makes it possible to formally describe a standard that is used as a basis to automatically generate all associated documentation and control tools. Such an approach is particularly relevant when it comes to ensuring data quality, which is a crucial matter for statisticians.
Let’s go back a step in the table of contents where there are two papers (no. 3 and 4) relating to the issue of data confidentiality and the way this can be efficiently managed.
Paper no. 3 by Patrick Redor provides an analytical framework by framing data confidentiality as a key challenge for Official Statistics due to the risks associated with any breaches of this requirement for confidentiality. However, statistical activities usually require identification information of individuals, so they cannot simply be removed. There is therefore a need for protection measures and a legal framework, which is being enriched as time goes on: the law of 1951, the Data Protection Act (Loi Informatique et libertés), the Law for a Digital Republic (Loi pour une République numérique) and the General Data Protection Regulation (GDPR). The rules for statistical confidentiality, where they are not prescribed by law, are subtle in their application and include “secondary confidentiality” in particular. This all takes place against an evolving backdrop, where the demand for data is growing continuously, and one may be led to wonder whether confidentiality and open data are at odds with one another. Promoting broad access to data can have the paradoxical consequence of… limiting access to certain statistics.
One option to ensure the protection of confidential data while taking advantage of the wealth of data available from different sources is to use a “Non‑Significant Statistical Code” (code statistique non signifiant — CSNS). This allows the identifying elements to be removed from each of the files to be matched, with just this code being retained. This will serve as a pivot for matching while ensuring that it is not possible to trace a file back to the individual. As Yves‑Laurent Bénichou, Lionel Espinasse and Séverine Gilles explain in their paper, the CSNS is more than a mere code, it is a true “service” provided by INSEE to the Official Statistical System as a whole. It can be applied to the NIR (national insurance number), in which case the operation is pure encryption, or to identity traits (last name, first name, date and place of birth). In the latter case, an identification algorithm must first be established, i.e. by determining the NIR based on the identity traits. The paper explains the different stages of this algorithm and measures the quality of the identification made. This measure is essential, as the CSNS procedure is fully automated: armed with the quality levels, users can decide which thresholds to apply.
Working our way backwards through the table of contents, we arrive at paper no. 2, which tackles a sophisticated subject that is both multifaceted and innovative: distributive national accounts. Mathias André, Jean‑Marc Germain and Michaël Sicsic provide a concise explanation of the ins and outs of this subject, which gives our brains a real workout. We initially need to revert to the “traditional” framework for the national accounts (admittedly a simplified version thereof) … before immediately sidestepping by asking new questions and devising a new framework that focuses on households. Whether directly or (very) indirectly, households are the final recipients of income and transfers from other institutional sectors, such as through services provided by the public authorities (healthcare, education, etc.). This framework therefore allows for the establishment of “pre‑transfer” and “post‑transfer” income. We then move on to redistribution mechanisms (expanded redistribution, in addition to the usual monetary approach), according to standard of living, by age cohort and by socio‑professional category, thereby reconciling the national accounts with social statistics. The authors explain how this approach came about, the sources used and the calculation method, and they provide details of the main assumptions. The paper puts forward some lessons to be drawn from this and traces the operational prospects for this internationally recognised method with a promising future.
It is not the future, but the past that is discussed in paper no. 1. Gaël de Peretti and Béatrice Touchelay tell us a story, that of Official Statistics in the 40 years following the creation of INSEE, from the perspective of its integration into the social and political debate. During the initial “construction” period, we will lay foundations that will shape how the institute functions: “household” surveys with the aim of studying the living conditions of households, the framework of national accounts, the law of 1951 and statistical coordination, in a context in which interest in official statistics is far from a given. Controversies surrounding price indices, for example, materialised very early on. However, the audience was still limited since the institute was assisting only a few decision‑makers. During the second “consolidation” period, new audiences and new openings emerged in the 1960s: the creation of Regional Economic Observatories (Observatoires économiques régionaux — OER) and the National Council for Statistics (conseil national de la statistique), which went on to become the National Council for Statistical Information (conseil national de l’information statistique — CNIS). This also marked the opening up of the institute to the general public, with the creation of a dissemination department and growing recognition as a result of several editorial successes. Meanwhile, we have moved from manual data processing to digital data processing.
The adventure will be continued in a future issue of the journal, most likely in 2024!
Paru le :29/10/2024
See the paper entitled “Statisticians and Administrative Sources” in issue N1 published in December 2018.