METIS

Quick links

GSBPM

Common Metadata Framework

Metadata Case Studies

GSIM

 All METIS pages (click arrow to expand)
Skip to end of metadata
Go to start of metadata

back to case study home page

 

 

3.1 Metadata Classification

The essence of Stats SA's meta-information system is captured by how the organisation uses the metadata. Metadata is used internal to the organisation to enable statistical production processes. This means that metadata is used during various stages of statistical production as essential input to production processes. However, the production processes in turn, produce metadata. This metadata is also important in documenting the trail of activities during the statistical production process. The documentation of production activities informs related metadata issues such as the assessment of data quality and its interpretation. 

Categories of Metadata


Because of this diversity of metadata usage, it was decided that contents of the meta-information system should be aligned with these usage activities. The natural progression of this decision was to undertake a project to classify all of the organisation's metadata. The following is a list of the categories of metadata adopted by Stats SA: 

  • Survey Metadata


Often referred to as dataset metadata, Survey metadata is used to describe, access and update dataset, data structures. Stats SA chose to call this type of metadata survey rather than dataset because some of the metadata such as information about "the population which the data describe" refer to the broader aspects of the survey, and not only the dataset.

  • Definitional Metadata


This is metadata describing the concepts used in producing statistical data. These concepts are often encapsulated into measurement variables used to collect statistical data. Descriptive text is used to define individual concepts, however the concepts are further grouped into logical topics. These main topics are effectively classifications of data. Hence, included in Stats SA's package of definitional metadata classifications drawn from different study domains. 

  • Methodological Metadata


These metadata relate to the procedures by which data are collected and processed. These may include Sampling, Collection methods, Editing processes, etc 

  • System Metadata


System metadata refers to active metadata used to drive automated operations. Some of the examples of system metadata are: 

  • Publication or dataset identifiers date of last update
  • File size
  • Mapping between logical names and physical names of files
  • Dataset input flows
  • Access methods to databases
  • Coordinates as kept in metadata store
  • Table and column definitions schema and mappings of data

 

  • Operational Metadata


This is metadata arising from and summarising the results of implementing the procedures. Examples include Respondent burden, Response rates, Edit failure rates, Costs and other quality and performance indicators, etc 
The different components of Stats SA's meta-information system are logically grouped according to these categories of metadata. This means that the database for the meta-information system has different data structures corresponding to these metadata categories. We have recently (June 2007) finished developing the first metadata component, the survey metadata capturing tool, which is the subject of this case study. 

 

How Metadata Fit into Other Organisational Systems


As already stated, the development of Statistics South Africa's metadata management system (Meta-Information system) is part of a larger system, the ESDMF. The central components of the ESDMF will follow the completion of the meta-information system, because the ESDMF is driven by the metadata. Although the ESDMF is a new system, it is merely a means to centralize the organisation's disparate statistical information systems. Figure 6 below shows the conceptual ESDMF subsystems and how they are placed relative to other organizational subsystems. The metadata subsystem supports the entire statistical cycle. 


 
Figure 6: Conceptual components for the ESDMF in relation to other subsystems 

 

3.2 Metadata used/created at each phase

Metadata are used and/or produced in each phase of the statistical value chain. This strong link between the between the SVC and metadata informs all the development of the metadata subsystem. 

Stats SA's Statistical Value Chain

 

Statistics South Africa's core areas, i.e., those divisions in the organization responsible for the production of statistics, have up to now operated using different approaches. Although it is generally understood in the organization that there are many commonalities in the way different divisions conduct their work, no attempt has been made to formalize a standard statistical production process for the entire organization. The development of the SVC for the organization is a move to correct this situation. The SVC is a generalisation of the activities that need to take place from the beginning to the end of a statistical production process.


Stats SA envisions its statistical cycle along the lines of Michael Porter's Value Chain Model Michael Porter explained this model in his 1985 book, "Competitive Advantage: Creating and Sustaining Superior Performance". Hence we refer to our statistical cycle as the SVC. The value chain categorizes value adding activities of an organization. Figure 7 below is a schematic diagram of the main phases of Stats SA's SVC.


   
Figure 7: High level phases of Stats SA's Statistical Value Chain 


The SVC was designed to be general, catering for most scenarios of statistical production. For example, it is clear that not all the phases of the value chain will be used by all surveys. Figure 8 below shows a   flowchart of statistical production within the context of the SVC. It can be seen that old frequent surveys might not follow the same path as new frequent or once off surveys.

  

Figure 8: Flowchart of a statistical production using phases of the SVC 

A high level description of the main phases of the SVC was given in section 2.4 above. In this section we give a detailed view of the activities involved in each phase. 

Need Phase


The Need phase consists of the following activities: 

Determine the need

The objectives and purpose for doing the particular survey or research must be defined. This starts with conducting interviews with the organisation or individual(s) requesting the new survey. This is an iterative process that concludes with a definition of a statement of need

Determine Information Requirements

A need for a survey or study is triggered by requirements for information that solves a given problem. A clear determination of the nature and extent of this information or data is needed. This is done through consultations with domain experts from the community in need of the information. 

Develop Budget and Plan

Similar to any project that requires resources, a statistical production project has to have a cost-benefit analysis as a foundation of its business case. During this phase, only a high level plan is produced.

 

Obtain Financial Support

Generally, Stats SA's projects are big and critical; thus they need huge financial investments. Because the government pays for them, an intensive process of budget approval has to be undertaken in order to ensure accountability. 

Ministerial Approval

Stats SA projects are funded by the National Treasury under the Ministry of Finance. For large projects to go ahead, ministerial approval is required. 

 

Design Phase


The following activities are contained in the Design phase: 

Develop Detailed Project Plan

The output of the Need phase consists of high level aspects of the proposed survey. All Stats SA's surveys must go through detailed planning. For new priority projects, the responsibility for such planning lies with the organisation's Programme Office. The Programme Office has the overall responsibility for running the project to completion, after which, the future running of the project (in the case of frequent surveys) is handed over to the survey area. 

Develop Survey Methodology

The goal of the survey methodology is to ensure that the statistics collected during the survey are reliable and representative of the survey's target population. For existing surveys, the survey methodology is often already in place. For new and re-engineered surveys, new survey methodologies are developed. 

Design and Test Questionnaires

Questionnaire design is aimed at ensuring that the required information from a survey is realized. It consists of getting both the content and the layout of the questionnaire correct. This process is iterative between constructing survey questions and testing whether the responses to the questions asked address the problem the survey is intended to solve.  Questionnaire testing is initially done "behind-the-glass", during which employees of the organisation are randomly selected for participation. Thereafter, pilot tests are conducted on the field to small population groups in the same way the actual survey will be conducted.

Design Operational Requirements

Survey operations are concerned with the tasks of getting data from respondents or other data sources. Operational requirements must detail all the technical and logistical issues that need to be sorted out in order to have a successful survey. These vary from resource issues to technologies needed to conduct the survey. 

Design Computer System

The system to be used during the statistical production process consists of many related sub-systems that may be implemented through computer technology. Data collected during a statistical survey is captured in computer system for processing. A number of technologies are required to ensure that data are moved from their sources of collection to the computer.



Build Phase


Activities contained in the Build phase are as follows: 

Build a Collection Vehicle

Stats SA collects statistical data through one of the following survey methods:

 

  • Sample survey using questionnaires
  • Administrative surveys, using IT communications methods to access data stored in other organisations' databases.

Building a collection vehicle consists of ensuring, through building customised or procuring all the necessary infrastructure and items for the conduction of a survey. 

Build a Technology Solution

A technology solution should include all the technological components required to support the entire SVC. These may include hardware such as scanners and Optical Character Recognition (OCR) tools for capturing questionnaire-based data, database management systems, data analysis tools and information dissemination tools.

 

Test Technology Solution

Before a technology solution is put into production, it must be tested by the prospective users. This is to ensure that the functionality required by the users is included in the system. Also, issues concerning ease of use, integration of systems are also addressed. At a technical level, the testing of the system may lead to the identification of system bugs that may have been missed during the technical tests done by the developers.

Implement Solution

The implementation of a solution means that it is deemed ready to be used to perform productive work. Therefore, users get to be trained on how to use the system and thereafter certain people are granted access rights to the system.

 

Collect Phase


Contained in the Collect phase are the following activities: 

Manage Respondents

Enumerators must be highly trained so that they are able to explain to the respondents the reasons for collecting data and how they were chosen to be part of the survey and the way such information is planned to be used to improve functions of the agency and improve standards of living; whether responses to the collection of information are voluntary or mandatory (citing authority: Statistics Act); the nature and extent of confidentiality to be provided (citing authority: Statistics Act); an estimate of the average respondent burden together with a request that the public direct to the agency any comments concerning the accuracy of this burden estimate and any suggestions for reducing this burden. Respondent management must be done in ways that reduce the burden of survey on the respondent. Burden reduction includes ensuring that re-visits to respondents are kept at minimum and the questionnaire need to be of reasonable length. 

Post Out

Post Out refers to the process of notifying respondents by sending letters via the post detailing this information. Administrative data does not have this requirement, though legal arrangements are put in place in advance e.g. Memorandum of Agreement, Service Level Agreements etc., for the other party to be able to provide the data. When a survey is conducted by enumerators visiting respondents, the respondents must be notified by Stats SA about the pending survey. This notification must include information such as the objective of the survey, the date(s) when the enumerators will be visiting, etc. 

Acquire Data

Data acquisition at Stats SA includes both the direct (e.g. Sample Surveys and Census) and administrative methods. In most direct acquisitions, data are captured on paper based questionnaires. In a few other cases, electronic media may be used. Figure 9 below shows a flowchart of how Stats SA acquires its data.




   



Close off Collection

The collection period is usually specified at the design stage of the survey. The end of the last day of the defined collection automatically ushers in the closure of field collection of data.



Process Phase


The Process phase consists of performing the following activities: 

Capturing Data into Electronic Form

This applies only to questionnaire based collection methods. Questionnaires are either scanned or manually entered by data capturers into computer databases. Data collected from other electronic systems might only need to be transformed into Stats SA's data formats. 

Perform Macro Edits

Macro edits detect individual errors by: (1) checks on aggregated data, (2) checks applied to the whole body of records. The checks are typically based on the models, either graphical or numerical formulae that determine the impact of specific fields in individual records on the aggregate estimates. 

Rum Imputation/Estimation

Item non-response may result in missing values in a survey dataset. Statistical organizations use imputation methods to calculate estimate values to fill in the missing values. Imputation is implemented using mathematical algorithms through computer programs.
Estimation of missing values should not be confused with the overall statistical estimates which form the main goal of a survey. Statistical estimates are calculated by aggregating all of the collected data. These are often called macro data, and are contrasted with micro data, which are detailed data collected from the respondents. 

Produce Datasets

The primary output of the processing are "clean" datasets that are ready to be analysed. Analysis tools can only process data whose formats and structure they understand. Part of producing datasets is to package them into structures and formats that conform to Stats SA's analysis packages. 

 

Analyse Phase


Statistical data analysis consists of the following activities: 

Produce Statistical results

This is the process where results are produced based on the processing that was done on the data. The ultimate goal of any survey is to produce statistical estimates of the characteristics of the statistical unit of interest. 

Validate Statistical Results

This is where estimates are assessed against expectations, comparing data with the one from previous period, and assessing quality measures to ensure good quality data. 

Interpret Statistical Results

Numbers are meaningless if they are presented without any explanation accompanying them. This is one quality dimension that we cater at Stats SA, that all data that get released should be accompanied by the corresponding metadata. 

Prepare Content for Dissemination

This is the process where actual particular measures are taken to ensure that content from the survey does not disclose information concerning any identifiable respondent. This includes: a) for micro data: remove respondent, content reduction, content modification, b) for tabular data: sensitive cells correction methods such as cell collapsing or suppressing by data providers. 

Perform Quality Control

This process entails making sure that all quality measures in SASQAF have been implemented correctly and the results thereof are known. 


Disseminate Phase

 

  • Receive and Validate Content


During this process, the dissemination team goes through a checklist of what was supposed to be accomplished and whether it was done accordingly and correctly. The content received by the team consists of macro and micro data, and other products such as published reports. 

  • Manage Dissemination Repositories


Data to be disseminated are kept in databases (dissemination repositories), from which they are extracted when disseminated. These repositories store datasets (including both micro and macro data), reports and other documents. 

  • Pre-release for Publishing


This process entails preparations before releasing regarding tables, corporate formatting standards, electronic distribution and hard copy outputs 

  • Manage First Release


This is where distribution media are managed and controlled in order to ensure that different categories of users of statistical information get access to relevant information. Release timelines are handled within this process. 

  • Handle Customers


Handling customers is part of customer relationship and stakeholder management. A system to handle customer enquiries exists. Stats SA's Support and Informatics Services unit handles customer enquiries, categorises main users and other users, consult users to determine needs and make sure data is distributed timely to users. 

 

Metadata Description Matrix


The implemented Survey Metadata Capture Tool of the ESDMF captures the following metadata:
Descriptions are provided for section headings. 

1. Active Metadata Set

The file identifier and status of the current/active metadata set is displayed immediately under this section. In other words, the metadata set that the user is currently capturing, editing or viewing. 

2. Overview

The elements accessible from this section collectively provide a brief description of the survey.
The Overview section comprises the following items:
Survey/Series Status
Objective
Abstract
History
Target Population
Main Topic
Main Users 

3. Generic Information

The elements accessible from this section collectively provide generic information about the survey time frames.
The Generic Information section comprises the following items:
Survey Frequency
Series Time Frames 

4. Primary Data Source

The elements accessible from this section describe external inputs to the survey.
The Primary Data Source section comprises the following item:
External and Internal Data Sources 

5. Methodology

The elements accessible from this section collectively describe the activities conducted and the methods and processes used which are specific to the survey.

The Methodology section comprises the following items:
Survey Population
Instrument Design
Sample Design
Collection
Error Detection/Editing
Imputation
Estimation
Quality Evaluation
Disclosure/Confidentiality Control
Seasonal and Working Day Adjustment
Revisions
Data Item/Variables
Dissemination 

6. Data Quality Report

The element accessible from this section provides a hyperlink to the data quality report for the data release.

The Data Quality Report section comprises the following items:
Relevance
Accuracy
Accessibility
Interpretability
Coherence
Methodological Soundness
Timeliness
Integrity 

7. Documentation

The elements accessible from this section provide hyperlinks to additional documentation related to the survey.
The Documentation section comprises the following item: Documentation 

8. Contact

The elements accessible from this section provide information concerning the contact person who will manage enquiries related to the data or information produced by the survey.
The Contact section comprises the following item: Contact Person 

9. Loaded Metadata Sets

This section lists the file identifiers and statuses of metadata sets created by the current user. It enables the current user to switch between his/her metadata sets.

Table 2 below shows the metadata captured with the Metadata Capture Tool against the Statistical Value Chain, with example for each stage of the SVC.


 

Group

Description

Statistical Value Chain

Examples

Quality Dimensions

Survey Overview

Brief overview about the survey that highlights the background, purpose, history and usage

Need

Title of survey, Series status, Objective of survey, Keywords, Main users and usage

Accessibility

 

 

Build

Metadata file identifier, Metadata version

 

 

 

Design

Target population, Main topics

 

Survey Time Frames

Information about time frames that the life cycle of the survey will be managed

Need

Frequency of series, start date of survey, end date of survey

Timeliness

 

 

Design

Reference period, collection period, product release date

 

Type of Survey

Classification of a survey according to its statistical activity that involves collection, compilation and publication of statistical 
data measuring characteristics of a population

Design

Derived, Direct (e.g. Sample or Census) and Administrative

Methodological soundness

Primary Data Source

Information that gives a description about or identifies the administrative data source

Design

Administrative data information (i.e. title of survey from primary data source, primary data source description, contact person from primary data source)

Pre-requisite

Methodology

Information about processes that are put in place and methods used to collect, process, analyse and publish statistical release

Design

Survey population, instrument design, Collection, Editing/Error detection, Imputation, Estimation, Disclosure/Confidentiality control, seasonal adjustments, revisions, Data variables, Dissemination

Methodological soundness, Integrity and Accessibility 

Data Quality Report

Information about quality measures used and the errors obtained as a result of executing the statistical processes

 

 

Accuracy

Design

 

 

Sampling errors and Non-sampling errors

 

Documentation

Attach any documents with extra information related to specific section of the template

 

 

Interpretability

Contact

Any additional documents that describe the concepts and 
definitions, methods and data quality applying to the specific survey

 

 

Accessibility


Table 2: Relationships between various categories of metadata inputs and different phases of the SVC 

The following table shows the stage of the SVC at which metadata is used: 

Group

Statistical Value Chain

Examples

Quality Dimensions

Survey Overview

Build

Metadata File Identifier, Metadata version

Accessibility

 

Collection

Objective, Main topics

 

 

Dissemination

Title of survey, Series number, Series status, Abstract, History of survey, Keywords, Users and usage

 

Survey Time Frame

Collection

Collection period, reference period

Timeliness

 

Dissemination

Frequency of series, Start date of survey, Product release date, End date of survey

 

Type Of Survey

Collection

Derived, Direct (e.g. Sample or Census) and Administrative

Methodological soundness, Integrity, Accessibility

Primary Data Source

Collection

Administrative data information (e.g. title of survey from primary data source, primary data source description, contact person from primary data source)

 

Methodology

Collection

Survey population, Instrument design, Sample design, collection, Quality evaluation, Data variables,

Methodological soundness, Integrity and Accessibility

 

Process

Quality evaluation, Data Editing, Imputation, Seasonal adjustment, Revisions, Data variables

 

 

Analysis

Quality evaluation, Estimation, Data variables

 

 

Dissemination

Quality evaluation, Disclosure/Confidentiality Control, Dissemination methods

 

Data Quality Report

Process

Sampling errors and Non-sampling errors

Accuracy

Documentation

Dissemination

Documentation

Interpretability

Contact

Dissemination

Contacts

Accessibility


Table 3: Metadata produced with groups of metadata with examples for each group