|IV. Supporting legislation||Managing Statistical Confidentiality and Microdata Access||VI. Managing tensions between national statistical offices and researchers|
37. There are various ways a National Statistical Office (NSO) can support research work. These are summarised below. There is more expansive commentary in the following sections. Case studies are used to further illustrate these different methods.
38. There is one important point which is not always understood. Microdata files can be anonymised by removing names and addresses and taking other steps (e.g. collapsing geographical detail) to ensure that the identification of individuals is highly unlikely when these files are looked at in isolation. This could be referred to as eliminating spontaneous identification. But other microdata files exist in the public and private sector, sometimes with individuals identified. Studies have shown that by statistically matching the NSO microdata files with existing files, unique records can be identified. The number can be quite significant depending on the amount of detail available in the NSO microdata file. Also in relative terms, the number of unique cases will be greater for smaller countries. These risks are not always well understood by NSOs. Of course, they will be reduced if techniques such as data perturbation or data swapping are used in the NSO microdata file.
(\i) Statistical products for use outside the NSO
|Statistical Tables and Data Cubes||This can include both standard tables and special tables (or special analyses for that matter) generated at the request of the researcher. Some offices now release very detailed matrices, known as data cubes, which researchers can manipulate to support their own needs. However, if these are very detailed, the level of confidentiality risk can be similar to microdata.|
Anonymised Microdata Files (AMFs)
These are microdata files that are disseminated for general public use outside the NSO. They have been anonymised and are often released on a medium such as CD-ROM, sometimes through a data archive. (Note: The term anonymised implies that not only are names and addresses removed, but other steps are taken (e.g. collapsing of geographic details) to ensure that identification of individuals is highly unlikely.)
The level of confidentiality protection in Public Use Files should be such that identification is not possible even when matched with other data files. Public Use Files are a way of providing access to researchers in some countries.
Anonymised Microdata Files
|Licensed files are also anonymised but are distinct from Public Use Files in that their use is restricted to approved researchers and an undertaking or contract is signed before files are provided to the researchers. Even if advertised as generally available to the public, they are not released before an undertaking or contract is provided by the researcher. Even though anonymised and other steps are taken to ensure that identification of individuals is highly unlikely when used in isolation, they may contain potentially identifiable data if linked with other data files; this is one reason why a preventive undertaking or contract is required. There may be other conditions of use that the NSO may impose on researchers.|
(ii) A service window through which researchers can submit data requests
|Remote Access Facilities (RAFs)||Arrangements are now being made in many countries that allow researchers to produce statistical outputs from microdata files through computer networks, without the researchers actually 'seeing' the microdata. Because of the additional controls that are available through RAF, and the fact that microdata do not actually leave the NSO, access to more detailed microdata can be provided this way.|
(iii) Arrangements for allowing researchers to work on the premises of the National Statistical Office
|Data Laboratories (DL)||On-site access to more identifiable microdata, typically with stringent audit trails and NSO supervision. The access to more detailed data creates some inconvenience to the researcher, because of the requirement of working at the NSO, or at an NSO enclave.|
Statistical tables and data cubes
39. Statistical tables remain the most economical way of satisfying many research needs. Their importance should not be underestimated. The advent of data cubes (very detailed multi-dimensional tables) has increased the usefulness of statistical tables for research purposes as they allow researchers to manipulate the data cubes to suit their own needs.
40. Statistics Netherlands was one of the early organizations to embrace data cubes. Case Study 3 illustrates how they use data cubes as a key part of their dissemination strategy.
41. Confidentiality issues still exist for statistical tables and data cubes. For example, most statistical legislation requires that identifiable data cannot be released through statistical tables. But the 'confidentialisation' is done prior to release. Software systems exist for confidentialising statistical tables and improved methods continue to be developed. They are often referred to as disclosure avoidance methods.
Anonymised microdata files - Public Use Files
42. This is seen as a very valuable service by researchers. However, in light of the increased possibilities for data matching, the trend might be to reduce the amount of data available in public use files and to put more reliance on licensed anonymised microdata files, RAFs and data laboratories for researcher access. In addition to steps taken to reduce identification, licensed files rely on researchers honouring the undertakings or contracts that they make no attempt to identify. Such undertakings are often a key part of a release of licensed AMFs (see next section).
43. Although NSOs generally provide equality of access to users of their statistics, this may not be appropriate for microdata. A different attitude may be taken to users who do not have strong bona fide research credentials or if they have access to databases where it would be easy to match AMFs.
44. The exception is Public Use Files (PUFs) where access is deliberately intended to be broad. Researchers have emphasised the importance of PUFs. They are greatly appreciated in those countries where they exist and they are used extensively. Yet it may not be difficult for someone who is so inclined to publicly identify some individuals through statistical matching with other databases, particularly for countries with smaller populations and those with population registers. Prior to the release of PUFs, there should be a close examination of the conditions under which they are released to better manage the risks of a confidentiality violation. For example, a legally enforceable agreement may be one of the requirements of access. It should be possible to set up an arrangement where a prior agreement needs to be signed even where access to PUFs is through the Internet. Generally, the level of risk will be much greater for countries with smaller populations. Consequently, researchers should not expect that all countries will release PUFs.
45. The risk of identification can be reduced by the use of techniques such as data swapping and data perturbation. These techniques are frequently used in the United States for example. The downside is that these techniques may reduce the usefulness of the underlying microdata.
46. Case Study 4 describes the arrangements for the release of PUFs in the United States. It is interesting to note the role that Social Data Archives play in managing access to PUFs to individual researchers.
47. There is extensive literature available on the methods for anonymising microdata files. A good summary is available in Willenburg, L. & de Waal, T. (2001), Elements of Statistical Disclosure Control. The software package, μ-ARGUS, is concerned with protection of microdata against disclosure. Several techniques are available in μ-ARGUS.
Anonymised microdata files - licensed files
48. This is an arrangement where specific users are authorised or licensed to use anonymised microdata files after making a relevant undertaking or contract. Although these files have been anonymised and individuals cannot be identified from these microdata files in isolation, it may be possible to do so by (statistical) matching with other files, hence the need for a licence. There will be conditions associated with the licence, which can be specified in the undertaking or contract signed by the researcher or their institution. The conditions may vary from country to country or even from one researcher to another depending on the research proposal and possibly the affiliation of the researcher.
49. The conditions may include some or all of the following:
- an agreement by the researcher that he or she will abide by the conditions of release;
- that no attempt will be made to identify particular persons or organizations;
- the information will only be used for statistical or research purposes;
- the microdata will not be provided to other persons;
- the microdata will be returned to the NSO when the research project is completed; and
- no attempt will be made to statistically match with other databases without permission.
50. It is good practice for such an undertaking to have some legal standing, for example by providing for such undertakings within enabling legislation. This would allow legal actions to be taken in respect of breaches of the conditions of the undertaking. This does not preclude other actions that might be taken in respect of breaches such as not providing any further services to the researcher and/or possibly the researcher's institution. These are discussed in Chapter 7.
51. It should be possible to release more data through licensed files than public use files if reliance can also be put on the undertaking to ensure protection of the confidentiality of the data. That is, in cases where some of the data are potentially identifiable, when linked with other files.
52. Case Studies 5, 6 and 7 describe the arrangements for the release of licensed microdata files in Australia, Netherlands and Sweden respectively.
Microdata where identification may be possible
53. Some countries externally release microdata files for statistical or research purposes containing data which might be identifiable albeit under strict licensing agreements. The licensing agreements should include the conditions under which the data can be used, and the procedure should be specifically covered by legislation. A strict procedure is necessary in order to maintain respondents' confidence and the general public's trust. Remote access facilities and data laboratories are other ways of dealing with this type of situation.
Remote Access Facilities (RAFs)
54. These facilities are becoming increasingly important but the way RAFs are implemented varies considerably from country to country. The key characteristic is that researchers do not have access to the microdata itself but tasks using that microdata can be submitted remotely over the Internet. Often there is a contractual arrangement between the NSO and the researcher or the institution of the researcher.
55. By way of illustration, Statistics Canada provides researchers with dummy microdata files and allows them to submit runs against the full file via computer networks. Statistics Canada runs the requests offline and sends the results back via computer networks after checking for confidentiality. Although similar arrangements exist at the Australian Bureau of Statistics, there are some important differences. The microdata files are confidentialised to prevent spontaneous identification before becoming accessible through a RAF. However, trial runs are permitted against the RAF files and small numbers of unidentified unit records are allowed to be downloaded to explore outliers and the like. Output is checked before being sent to the researcher. The system currently operates in batch mode but an interactive version is being considered. The arrangements in Statistics Denmark are different again. It is an on-line system where researchers can run analyses against the full microdata file. Arrangements are such that they cannot download the microdata itself. To further manage risks, they rely on the agreements made by institutions and the retribution (particularly denial of future access) if there are breaches of the agreements.
56. There are two basic types of RAF.
- (a) Remote execution, where a researcher submits a program and receives the output later over the Internet.
- (b) Remote facilities, where the researcher performs the analysis and can immediately see the answer on the screen.
Many countries have facilities along the lines of (a) but, apart from the Danish system, facilities along the lines of (b) are still being developed. The acceptability of different arrangements is likely to vary country by country.
57. Although only available so far in a few countries, and though the models and approaches vary as illustrated above, the experience to date with remote access facilities has generally been positive.
58. From the cost perspective, RAFs are preferable to data laboratories (see below) as the supervised access in a RAF is less labour intensive than the supervised use involved in data laboratories.
59. If these facilities do not remove identification risk entirely, there should still be some agreement made by researchers to ensure they are fully aware of their obligations. It is good practice to only provide access to those researchers who have signed some form of agreement outlining the conditions of access. Education of the RAF users is also important, together with regular monitoring and checking of the use of these facilities.
60. Case Studies 8, 9 and 10 outline the remote access facilities in place in Canada, Australia and Denmark respectively.
61. Data laboratories have been in use for many years in some NSOs and have been effective in controlling identification risk whilst enabling researcher access, particularly for data sets where release of a confidentialised microdata file is not possible. They still require conditions of access to provide an adequate level of protection. The main criticism of DLs has been the lack of convenience to the researcher, including sometimes being forced to use unfamiliar data analysis software. They are also expensive for the NSO to manage compared with other options.
62. Some NSOs (e.g. Statistics Canada) have established new premises for data laboratories in locations that are more convenient to researchers (sometimes known as Research Data Centres), but this can also be an expensive option unless specific funding is provided to the NSO.
63. What are key conditions of access to microdata through data laboratories? These might include (a) documentation of the public good that the research will provide, (b) outlining how the results will be accessible to the public, (c) evidence of the bona fides of researchers, (d) a legally binding undertaking, and (e) requirements for supervision by NSO staff.
64. Case Studies 11, 12, 13, 14, 15 and 16 outline the data laboratory arrangements in Canada, USA, Netherlands, New Zealand, Italy and Brazil respectively.
Engaging a researcher as a temporary NSO staff member
65. Another way that researchers may access microdata is through their being engaged as temporary NSO staff members and making them subject to the same secrecy provisions as the staff of the NSO. This should not be done unless the researcher is assisting with the work of the NSO, otherwise the arrangement could be seen as a sham. If this type of pretence were occurring and became public, confidence in the NSO would diminish.
66. The involvement of the researcher may be at the initiation of the NSO, if the researcher is seen as someone who can bring special skills to the work of the NSO and extend the usefulness of the data set. On the other hand, the proposal may come at the initiation of the researcher. It is easier to demonstrate that researchers are assisting the NSO if a published NSO output will result from the work (even if branded somewhat differently from normal published outputs). Of course, there will be benefits for the researchers from such arrangements and it may be agreed that they might publish the outputs of their research in other ways (perhaps after clearance by the NSO).
67. There are some special issues associated with business data, including agricultural businesses. Businesses, and in particular large businesses, are more easily identifiable than household or personal data, especially on a spontaneous basis, because the distribution of their characteristics is much more skewed. Also, in many business surveys, the largest businesses are selected with certainty. In some countries, databases of business data are more accessible, thereby enabling matching. In addition, many academic researchers might also serve as consultants to business and even bona-fide access to business microdata by these researchers might be incompatible with such consultant roles (they cannot be brainwashed of knowledge acquired in the course of their research). Moreover, countries may have issues of economic competitiveness (and possibly even security) due to sharing identifiable business data with researchers in other countries.
68. From the point of view of researcher access, the main differences between household or personal data and business data are that the dissemination streams that provide greatest protection are most relevant to business data.
69. In terms of the dissemination streams:
- statistical tables remain relevant, although the higher level of identification risk means that more detailed tables will generally not be available in respect of businesses;
- anonymised microdata files may only be relevant for the smallest businesses. For some research, small businesses may be a group of particular interest for researchers. Even then there will need to be 'distortion' of some data (e.g. financial data) to avoid matching with other databases (e.g. taxation data). An alternative is to present the data in ranges. Thus, anonymised microdata files are likely to be of limited use;
- for similar reasons, RAFs may only be relevant for microdata files of the smallest businesses. At least, use of these facilities will enable NSOs to control the matching risk, so it may not be necessary to 'distort' the data to protect confidentiality. But, if large businesses are included, it may be difficult to confidentialise outputs even if the researchers cannot directly access the microdata;
- data laboratory arrangements are likely to be most pertinent for access to microdata files of businesses. Such arrangements exist in Statistics Netherlands, for example (see Case Study 13).
70. Some research studies may be able to be supported with the consent of the businesses involved.