Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...


Box 4.3 Data From Different Sources

 

Source 1: Education Register

 

Source 2: Population Register

 

Name

Steve Vale

 

Stephen Vale

 

Address 1

5 St Peter’s St

 

5 Saint Peters Street

 

Address 2

Machen

 

Machen

 

Address 3

Newport

 

Caerphilly

 

Address 4

Gwent

 

South Wales

 

Postcode

NP1 8QB

 

CF83 8QB

 

Date of birth

28/12/1967

 

28/12/1997

 

Occupation

Statistician

 

Civil Servant

 

Employer

CSO

 

Office for National Statistics

 

Workplace postcode

NP10 9XX

 

NP10 8XG

 

 

This example shows two records containing fictional data about the author (education and population registers do not yet exist in the UK). It is designed to illustrate several common issues when trying to reconcile data from different sources:

  • Errors – a simple plausibility check would find the error in the population register, people born in 1997 would still be at school, so could not have an occupation or an employer. As the education register gives the year of birth as 1967, this looks like a simple keying error. Automatic checks can usually find such obvious errors, though have to be used with care, for example, a few genuine cases of children being older than parents have been found in Finland, due to adoptions!
  • Timing – the addresses and postcodes given may actually refer to the same building, but at different points in time. The differences could be due to boundary changes between postal areas. This could be determined by consulting historic address files, or by mapping current and historical addresses using geographic information systems.
  • Abbreviations – “St” at the end of address line 1 in the education register is a common abbreviation for “Street” so these text strings should be treated as synonyms when they appear at the end of a text line. Note, however, that “St” at the start of address line 1 is used as an abbreviation for “Saint” so again, some care is needed. Similar examples can be found in other languages.
  • Timing and abbreviations – in the UK, “CSO” is an abbreviation for “Central Statistical Office” a former name of the “Office for National Statistics”, and one that might still be used by those unfamiliar with the change.
  • Different spellings – “Steve” and “Stephen” are different variants of the same name, and should be treated as such.
  • Classification issues – the occupations “Statistician” and “Civil Servant” are not mutually exclusive. “Statistician” could be said to refer to the profession, whereas “Civil Servant” relates more to the nature of the employment.
  • Default values – sometimes when a value is missing, or only partially present, some sort of default value is used. Typical defaults are “Z” or “9999999”. In the UK, when the second part of a postcode was not known, the default “9XX” was often used, as can be seen in the “Workplace postcode” field. Unfortunately the use of this default had to be abandoned when the Post Office started allocating real postcodes ending with “9XX”!
  • Errors – a simple plausibility check would find the error in the population register, people born in 1997 would still be at school, so could not have an occupation or an employer. As the education register gives the year of birth as 1967, this looks like a simple keying error. Automatic checks can usually find such obvious errors, though have to be used with care, for example, a few genuine cases of children being older than parents have been found in Finland, due to adoptions!
  • Timing – the addresses and postcodes given may actually refer to the same building, but at different points in time. The differences could be due to boundary changes between postal areas. This could be determined by consulting historic address files, or by mapping current and historical addresses using geographic information systems.
  • Abbreviations – “St” at the end of address line 1 in the education register is a common abbreviation for “Street” so these text strings should be treated as synonyms when they appear at the end of a text line. Note, however, that “St” at the start of address line 1 is used as an abbreviation for “Saint” so again, some care is needed. Similar examples can be found in other languages.
  • Timing and abbreviations – in the UK, “CSO” is an abbreviation for “Central Statistical Office” a former name of the “Office for National Statistics”, and one that might still be used by those unfamiliar with the change.
  • Different spellings – “Steve” and “Stephen” are different variants of the same name, and should be treated as such.
  • Classification issues – the occupations “Statistician” and “Civil Servant” are not mutually exclusive. “Statistician” could be said to refer to the profession, whereas “Civil Servant” relates more to the nature of the employment.
  • Default values – sometimes when a value is missing, or only partially present, some sort of default value is used. Typical defaults are “Z” or “9999999”. In the UK, when the second part of a postcode was not known, the default “9XX” was often used, as can be seen in the “Workplace postcode” field. Unfortunately the use of this default had to be abandoned when the Post Office started allocating real postcodes ending with “9XX”!

 

4.10    Missing Data

The problem of missing data is not unique to administrative sources. It can also be due to full or partial non-response to statistical surveys, or even to the removal of data values during the editing process. However, with administrative sources, the issues can sometimes be different, particularly as the problem of missing data can often be more systematic.

...

Box 4.4 Case Study: Dealing With Missing Administrative Data – Turnover per Head Ratios

 

The two variables most commonly available to measure the size of a business are the number of employees, and the total sales (turnover). However it is common for one or both of these variables to be missing or unreliable for new businesses, particularly smaller ones.

To help resolve this problem, turnover per head ratios can be used to estimate the missing variables. These ratios are constructed using information for similar businesses for which both variables are present and considered reliable, then calculating average turnover per head ratios for different categories based on economic activity and institutional sector.

For example, the following are dummy turnover per head (TPH) values calculated for different classes of the International Standard Industrial Classification (ISIC):

  

ISIC class

 

TPH

 

 

.........

 

 

 

 

45.11

 

95

 

 

45.12

 

68

 

 

45.21

 

149

 

 

.........

 

 

 

If a business has ISIC class 45.12, and its turnover is 200, but employment is missing, the imputed employment value is:

            200 / 68  = 2.94 (rounded to 3)

When calculating turnover per head values, problems with outliers are often encountered, so methods such as trimming (removing the top and or bottom x% of values), and calculating the mean of the inter-quartile range are often used.

Ratios of this type can also be of more general use for validating updates, matching records from different sources, and detecting errors. For example, by graphing and studying the distribution of turnover per head values, it is also often possible to get useful information about the population of units in question. The following charts are examples of what has been observed from such an exercise:

1. Normal distribution

Image Modified

 

 

In this case the turnover per head values are distributed evenly around the mean indicating a relative degree of homogeneity amongst the population of units, and very limited impact of outliers.

2. Skewed distribution

Image Modified

 

 

In this case there is a clear grouping of units around a relatively low value, but the outliers towards the outer end of the right-hand tail would clearly affect the mean of the distribution. This is a relatively common distribution for turnover per head data, and highlights the need to take measures to reduce the impact of these outliers.

3. Bi-modal distribution

Image Modified

 

 

This case illustrates that the population in question is rather heterogeneous, and that it might be worth splitting it into two sub-populations to get more meaningful turnover per head ratios.

 

4.11    Resistance to Change

...