This chapter presents an overview of different models for using administrative data to supplement data collected in statistical surveys. It shows how a mixed-source approach can be used to produce statistics at lower cost, better quality, or both.
Many of the issues relating to using and linking statistical and administrative data have already been covered in Chapters 4 and 6, so are not repeated here. Instead this chapter focuses on the different models for using data from a mixture of administrative and statistical sources to produce statistical outputs.
8.2 Mixed-source Models
1) The Split Population Approach
In this model the statistical population is split into two or more parts for data collection purposes. This approach is very similar to that used for the maintenance of the Australian statistical business register, as described in Chapter 7.3. Data from administrative sources are used for units where these data are of sufficient quality, and statistical sources are used for the remainder of the units.
A typical scenario for a business survey is that data for relatively small businesses with simple structures are taken or derived from tax returns, whereas surveys are used to collect data from the key units (usually those that are largest and/or have the most complex structures). For the section of the population for which tax data are used, the statistical and administrative units are likely to be identical, or very similar, and the impact of the difference between statistical concepts and classifications and their administrative counterparts is likely to be minimal, or at least can be easily modelled.
The remainder of the businesses are typically those that have the greatest individual impact on the quality of the statistics, and therefore are the ones for which it is most important to have accurate data. These units are also likely to be the ones with the most complex structures, often requiring profiling (as described in Chapter 4.5) in order to define the correct statistical units for which data are required. These statistical units are often combinations of administrative units, or parts thereof, and whilst some variables such as employment can often simply be summed to give the correct total, others, such as sales and certain other financial variables can not, as they include a certain amount of intra-unit trade, such that a simple summation would result in over-counting.
A practical example of the split population approach in business surveys is the Unified Enterprise Survey conducted by Statistics Canada. This brings together annual business data requirements, combining several previous surveys. Administrative data are used instead of data collected through statistical questionnaires for over half of the enterprises in the survey that have a simple structure, resulting in reductions in the statistical response burden of almost 40%.
Where the statistical population is people or households, it may be the case that surveys are needed for special groups such as students, migrant workers or those with two or more residencies. These are all potential examples of units for which administrative data may not be sufficiently up to date or accurate, particularly concerning location.
As mentioned several times in previous chapters, consideration must also be given to units not covered by administrative registers, such as illegal immigrants or businesses operating in the informal economy. Statistical surveys are likely to be only of limited use for such groups, so an element of estimation may be needed, thus introducing a third source to be used in the production of the required statistics. This model is illustrated in Figure 8.1 below.
Figure 8.1 – The Split Population Model
2) The Split Data Approach
In this approach, a population of statistical units, and a data requirement are identified, for example the population could be all persons living in a particular country, and the data requirement could be the usual set of variables required for a population census. Instead of providing all of the variables for part of the population, as in the split population model above, under the split data approach, administrative sources are used to provide some of the variables for all of the population (a third approach is also possible where administrative sources provide some of the variables for some of the population).
The split data approach does not, therefore reduce the number of questionnaires or interviews required to collect the data, but does reduce the volume of data to be collected in each questionnaire or interview. It is usually most relevant for large and complex data collections where many variables are required, hence the example of the population census. Administrative and survey data need to be integrated for each individual unit in order to produce the data set used for statistical outputs.
The split data approach is often used during the transition to the sort of register-based statistical system described in the next chapter. Typically, the variables in the statistical data collection are replaced by their equivalents from administrative sources over a number of survey periods. Table 8.1 below illustrates this process showing data sources for the Finnish population and housing census.
Table 8.1 – The Split Data Approach in the Finnish Population and Housing Census 1960-2000
Key: Q = Statistical questionnaire
Q/R = Statistical questionnaire supplemented by administrative register
R/Q = Administrative register supplemented by statistical questionnaire
R = Administrative register
n/c = Not collected
Source: This table is a condensed version of Appendix 2 of the paper “Use of Registers and Administrative Data Sources for Statistical Purposes – Best Practices of Statistics Finland”: http://unstats.un.org/unsd/EconStatKB/KnowledgebaseArticle10169.aspx
3) Pre-filled Questionnaires
This approach is really a special case of the split data model in that statistical questionnaires are still used to collect data about statistical units, but those questionnaires are pre-filled using administrative data as far as possible, with respondents being asked to merely check and correct these data where necessary. Figure 8.2 shows an excerpt from a pre-filled questionnaire, this example is from a British statistical survey of enterprises.
Figure 8.2 – An Excerpt from a Pre-filled Survey
Pre-filled questionnaires have three main benefits:
The main disadvantage is the risk of bias introduced because some respondents may simply accept the pre-filled data without checking them, or may choose not to spend time correcting errors.
4) Using Administrative Data for Non-responders
This approach can be seen as a variant of both the split source and the split data models. In this case, the statistical survey remains the primary means of data collection. However, statistical surveys tend to suffer from varying degrees of non-response, which affects the efficiency of the sampling process, and the quality of the resulting statistics. Non-response typically takes one of two forms, “unit non-response”, in which no data are supplied for the unit concerned, or “item non-response”, in which a partial return is provided, but some data items are blank.
Dealing with non-response can be very costly for a statistical agency, as this typically involves repeat contacts by post or telephone to try to collect the missing data. This process is usually known as “response chasing”, and tends to be very resource intensive.
A cheaper alternative may be to decide that if data not provided by a particular date, particularly for units that are not vital to the survey results (e.g. smaller businesses in a business survey), they are instead taken or derived from administrative sources. This allows any response chasing resources to be focused on the units that are considered most important, which should mean that any bias from using administrative data rather than survey data is minimized. This can also help to improve the timeliness of the survey results. As with any quality-related issues, a compromise between cost and the different dimensions of quality (see Chapter 5) is inevitable.
Administrative data can also sometimes be used as a basis for imputing missing survey data for linked data files.
5) Using Administrative Data for Estimation
When a sample survey is used to collect statistical data, it is often necessary to use estimation techniques, particularly if population totals (rather than proportions) are required. Some basis to estimate the values for the non-sampled part of the population is therefore needed. Sometimes this process can use variables from the survey frame used to draw the sample, but in some cases it may be possible to improve accuracy by using data from administrative sources as auxiliary variables in the estimation process. In practice many examples of this approach concern using administrative data to improve estimates for small areas.
8.3 Further Considerations
In any complex statistical processing system using multiple sources, it is vital to consider the role of metadata, particularly those metadata relating to the source of a particular data item. This allows for data items to be treated in different ways throughout the various processes (including unforeseen future processes), according to the way in which they were obtained. Information on the data source is also often a powerful quality indicator, and can help with decisions on the level of quality of statistical outputs.
Using a mixture of statistical and administrative data can be seen either as an end in itself, particularly where the coverage or quality of the administrative data is not seen as sufficiently high to allow statistical data collection to be stopped altogether. It can also be seen as a step in a gradual transition towards a register-based statistical system, as demonstrated in Table 8.1.
Either way, it allows at least some of the benefits of using administrative data to be realised (including cost savings), whilst avoiding some of the disadvantages, such as total reliance on an external supplier and loss of contact with the general public. It gives the possibility to compare statistical and administrative data quality, and allows statisticians to become familiar with using administrative data, and to develop new techniques to improve process quality.
For these reasons, mixed-source approaches are currently much more common than purely register-based statistical systems, however, over time, confidence in administrative data is likely to increase, allowing their use to be expanded and further benefits to be realised. As the balance swings further towards administrative data it will eventually become necessary to consider whether to switch to the sort of register-based model described in the next chapter.
 For more information, see the paper “Use of Tax Data in the Unified Enterprise Survey (UES)” by Marie Brodeur of Statistics Canada. http://unstats.un.org/unsd/economic_stat/Moscow_workshop/Canada%20-%20Use%20of%20tax%20data%20in%20the%20UES-E.pdf
 For example see: The Use of Administrative Data Sources for Lithuanian Annual Data of Earnings, http://home.lu.lv/~pm90015/workshop2006/papers/Workshop2006_22_Slickute_Sestokiene.pdf