Workshop on Implementing Standards for Statistical Modernisation, 21-23 Sept 2016 in Geneva.

Submit your abstract by 6 May. More info here





Generic Statistical Information Model (GSIM):



(Version 1.0, December 2012)


















About this document

This is aimed at metadata specialists, information architects and solutions architects. This document includes descriptions of information in a statistical organization. There are also a number of annexes, which include information about the GSIM extension methodology, links and influences of existing standards, a glossary and UML class diagrams.



Table of Contents


I.  Introduction               5

II.  Information in a statistical organization               6

A. Introduction               6

B. Business Group               7

C. Production Group               12

D. Concepts Group               19

E.   Structures Group               25

F. Base Group               30

Annex A.     Extending the model               32

A.   GSIM Extension Methodology               32

B.   Administrative Attributes               36

Annex B.     Influence of existing standards               39

A.   Introduction               39

B. Generic Statistical Business Process Model (GSBPM)               40

C. Data Documentation Initiative (DDI)               41

D. Statistical Data and Metadata eXchange (SDMX)               44

E.   ISO/IEC 11179               45

F. ISO 704               46

G. Neuchâtel Terminology for Classifications               47

H.   Business Process Model and Notation (BPMN)               50

I. COmmon Reference Environment (CORE)               50

J.   The Open Group Architectural Framework (TOGAF)               51

Annex C.     Glossary               53

Annex D. UML class diagrams and object descriptions               102

Base Group               102

Business Group               122

Concepts Group               170

Production Group               222

Structures Group               247


Figure 1. Statistical Need               7

Figure 2. Evaluation               8

Figure 3. Statistical Programs               9

Figure 4. Statistical Activity               10

Figure 5. Acquisition Activity               11

Figure 6. Process Steps can be as large or small as needed               13

Figure 7. Simplified view of Production Group objects               14

Figure 8. Process Step Design               16

Figure 9. Process Step Execution               17

Figure 10. Conceptual and Structural information objects can be Process Inputs and Outputs               19

Figure 11. Populations and Units               20

Figure 12. Variable               21

Figure 13. Represented Variable               21

Figure 14. Instance Variable               22

Figure 15. Over view of Classification               23

Figure 16. Concept Systems               24

Figure 17. Data Resource               25

Figure 18. Data Set               26

Figure 19. Dimensional and Unit Data Structures               28

Figure 20. Dissemination Activity               29

Figure 21. Base Artefacts               30

Figure 22. Organization               31

Figure 23. Extension of Administrative Details.               38

Figure 24: GSIM and its relationship to other relevant standards and models               39

Figure 25. GSIM and GSBPM               40

Figure 26. CORE and GSIM               51

Figure 27. Base Artefacts Class Diagram               103

Figure 28. Organization Class Diagram               105

Figure 29. Information Request Class Diagram               123

Figure 30. Statistical Program Class Diagram               125

Figure 31. Data-Channel Class Diagram               126

Figure 32. Instrument Control Class Diagram               128

Figure 33. Question Group Class Diagram               129

Figure 34. Concept-Population Inheritance Class Diagram               171

Figure 35. Classification Class Diagram               173

Figure 36. Category-Code Class Diagram               174

Figure 37. Variable Class Diagram               176

Figure 38. Node-Inheritance Class Diagram               177

Figure 39. Node-Relationship Class Diagram               178

Figure 40. Production -Overall Class Diagram               223

Figure 41. Process Overview Class Diagram               224

Figure 42. Process Design Class Diagram               225

Figure 43. Process Execution Class Diagram               227

Figure 44. DataSet Class Diagram               248

Figure 45. UnitDataStructure Class Diagram               250

Figure 46. UnitDataSet Class Diagram               252

Figure 47. DimensionalDataStructure Class Diagram               254

Figure 48. DimensionalDataSet Class Diagram               255

Figure 49. Data-Resource Class Diagram               256

Figure 50. DisseminationActivities Class Diagram               257

Figure 51 Service Class Diagram               259




Table 1. Examples of Data Channel, Instrument, Instrument Implementation and Mode               12

Table 2. Recommended Attributes               36

Table 3. Similar Constructs in 11179 and GSIM               46

Table 4. Mapping between Neuchâtel Terminology for Classifications and GSIM               48

Table 5. Similar constructs in BPMN and GSIM               50

I.  Introduction


  1.               The GSIM Specification is the most detailed level of the Generic Statistical Information Model (GSIM). It provides a set of standardized, consistently described information objects, which are the inputs and outputs in the design and production of statistics. Each information object has been defined and its attributes and relationships have been specified. For contextual information, an introduction to GSIM and information on using GSIM, please refer to the GSIM Communication and User Guide documents.


2.                 This document provides a description of GSIM in the context of a statistical organization. It has a number of annexes which provide further details for the reader. These annexes are:

      Annex A: Extending the model - This annex provides information for implementers on how to extend the GSIM for organization specific purposes. It also contains the set of recommended attributes for the administration of the GSIM objects.

      Annex B: Influence of existing models and standards - This annex reviews a number of relevant models and standards. It discusses the relationship to and influence of these models and standards on the GSIM.

      Annex C: Glossary - The annex gives readers definitions and explanatory descriptions for the GSIM information objects.

      Annex D: UML class diagrams


3.                 The GSIM is the result of a collaboration involving statistical organizations across the world in order to develop and maintain a generic model suitable for all organizations and meet the strategic goals (in particular the modernization effort) of the official statistics community.



II.  Information in a statistical organization


A. Introduction


4.               There is a widespread interest across statistical organizations in being able to trace how statistical information (for example, data and metadata) "flow" through statistical business processes (into processes and out of processes). Interested parties include broad statistical systems (like the European Statistical System), National Statistical Systems (both centralized and decentralized) and smaller task teams working inside National Statistical Offices.


5.               In the description of the GSIM Business group, it is seen that GSIM covers the whole statistical process and is designed to support both current and new ways of producing statistics.


6.               Achieving standards-based modernization of the production of official statistics places an emphasis on being able to share and reuse processes, methods, components and data repositories. Achieving reuse of processes, methods and components will require that process designers are readily able to discover what is available for reuse and whether it may be relevant to their particular purposes and needs. The case for reuse will be challenged if, in practice, discovering potentially reusable business resources, and assessing whether those resources are actually suitable for the designer's specific purpose, takes more time than creating new design elements.


7.               GSIM was designed to enable an explicit separation between the design and execution of statistical processes. The description of the GSIM Production group shows how this has been modelled.


8.               There is an increasing business need to record reliable, structured information about the processes used to produce specific statistical outputs. In order to maximize transparency and reproducibility of results, it is important for a statistical organization to understand the process and its inputs and outputs. The GSIM Concepts and Structures Groups contain the conceptual and structural metadata objects that are used as inputs and outputs in a statistical business process.


9.               The GSIM Base Group consists of several objects that can be seen as the fundamental building blocks that support many of the other objects and relationships in the model. These objects form the nucleus for the application of GSIM objects. They provide features which are reusable by other objects to support horizontal functionality such as identity, versioning etc. For these reasons, many of these objects are rather abstract in nature.


10.               Note: GSIM information objects have been given in italics in the descriptions that follow. The diagrams included in this section are stylized representations of the model. The colours of the boxes in diagrams represent which group the information object belongs to (Blue for Business Group, Red for Production Group, Green for Concepts Group, Yellow for Structures Group and Orange for the Base Group). In many cases there is more detail to be found in the UML. Detailed information on each information object in the model, including a glossary and UML class diagrams can be found in Annexes C and D of this document.


B. Business Group


11.               The Business group is used to capture the designs and plans of Statistical Programs . This includes the identification of a Statistical Need , the Acquisition , Production and Dissemination Activities that comprise the Statistical Programs and the evaluations of them.  


12.               An organization will react and change due to a variety of needs. In simple terms, these may be divided into at least two types of Statistical Needs : an Information Request and  an Environment Change .


13.               Where an organization receives an Information Request this will identify the information that a person or organization in the user community [1] requires for a particular purpose. This request will commonly be defined in terms of a Concept or Subject Field that defines what the user wants to measure and the Population that the user wants data about.


14.               When an Information Request is received it will be discussed and clarified with the user. This will be described by a Process Step . Once clarified, a search will be done to check if the data already exist. Discovering these Data Sets may be enabled by searching for Concepts and Classifications .  


15.               Where an organization identifies an Environment Change this indicates that there needs to be an externally motivated change. This may be specific to the organization in the form of reduced budget or new demands from stakeholders or may be a broader change such as the availability of new methodology or technology. A Statistical Need can be both internally and externally driven. For example, a statistical organization may realize that their existing Products and services must be improved. This may be in response to an Assessment of those Products and services.



Figure 1 . Statistical Need


16.               As shown in Figure 1, once an organization has identified a Statistical Need , it will be further specified in the form of a Change Definition . This identifies the specific nature of the change in terms of its impacts on the organization or specific Statistical Programs . This Change Definition is used as an input into a Business Case. A successful outcome will either initiate a new Statistical Program or create a new Statistical Program Design that redefines the way an existing   Statistical Program is carried out.


Figure 2 . Evaluation


17.               At any point in the statistical business process, an organization may undertake an evaluation to determine utility or effectiveness of the business process or its inputs and outputs. An Assessment will be undertaken to evaluate any resources, processes or outputs and may refer to any object described in the model.


18.               An Assessment may be of several types depending on the purpose. A Gap Analysis may be undertaken often in the context of a Business Case . An Evaluation Assessment is undertaken to determine whether a statistical output meets the need for which it was first created through analysis of:


(a) any information object that can be considered a Process Output; and
(b) in light of the original Statistical Need .

Statistical Program


19.               A Statistical Program is the overarching, ongoing activity that an organization undertakes to produce statistics (for example, a retail trade survey). Each Statistical Program includes one or more Statistical Program Cycles . The Statistical Program Cycle is a repeating activity to produce statistics at a particular point in time (for example, the retail trade survey for March 2012).



Figure 3 . Statistical Programs


20.               A Statistical Program (Figure 3) has an associated set of Statistical Program Designs that identify the methodology (the methods used to acquire, process and disseminate the data)   used for the Statistical Program . Only one Statistical Program Design is valid for, and is identified as being used by, a particular Statistical Program Cycle . Changes to the methodology result in new Statistical Program Designs so over time each Statistical Program will have a series of designs that provide a history of changes to the Statistical Program . The Statistical Program Design identifies the set of processes that are intended to be used to undertake the activity ( Process Step Design ), the resources required for the processes and a description of the methodology and context.


21.               Each Statistical Program Cycle consists of one or more Statistical Activities . A Statistical Activity is the set of executed processes and the actual resources required as inputs and produced as outputs. It is analogous to the Statistical Program Design but represents the execution rather than design. The same information that is identified in the Statistical Program Design and intended to be used to undertake an activity, is identified here as the actual information used. For example in the design, a dataset of a particular type may be identified as an input whereas in the Statistical Activity the filename and location of the actual input dataset would be identified.


22.               The model identifies different types of activities that represent the major steps in the statistical production process (Figure 4). Three types have been specifically identified in the model but other types could be defined. The distinction between different types of activities and distinction of a Statistical Activity from a Statistical Program Cycle means that each iteration can be made up of multiple activities of the same or different types and these may or may not represent the sequence of collection through to dissemination. This model supports both the traditional approach of collecting data for a particular need, and the emerging and future approach of collecting data and producing new outputs based on existing data sources that are maintained and added to over time.



Figure 4 . Statistical Activity


23.               A possible future approach relates to a continuous collection process. In the age of 'big data', the cost of collecting and storing data (for example, a statistical register) is low. An organization can collect data on a continuous basis without a particular Dissemination Activity, Product or Dissemination Service in mind. In this case the organization has a Statistical Program with a Statistical Program Cycle that consists of an Acquisition Activity that gathers data and adds to a Data Resource . Any Statistical Program (consisting of only Production or Dissemination Activities ) may then use this Data Resource in the future.


Acquisition Activity


2 4.               For an activity where the purpose is to acquire data a   Collection Description   (Figure 5) provides a description of the activity and the associated contextual information. The Acquisition Activity   identifies the means by which the data is collected and where it is collected from by identifying a Data Channel .



Figure 5 . Acquisition Activity


25.               A Data Channel identifies the Instrument used to collect data. An Instrument is the description of the tool that will be used to collect data. Examples of it may include a questionnaire or a set of requirements to develop software for gathering data.   The Instrument includes an Instrument Control and may have Question Blocks, Questions, Statements and Interviewer Instructions .


26.               Once the Instrument has been designed, it must be implemented in the form of one or more Instrument Implementations . These could be printed forms, software programs, etc. The Data Channel uses the Instrument Implementation to request data and describes the technique used to do it by means of a Mode . Once the Data Channel receives the data, it sends the data to  an identified Data Resource (thus populating it with Data Sets ) .


27.               The Mode represents the way the information collection process is going to be conducted and in this way, 'how' the Data Channel is going to be used, the following table (Table 1) represents some examples of Data Channel , Instrument , Instrument Implementation and Mode .











Table 1 . Examples of Data Channel, Instrument, Instrument Implementation and Mode

Data Channel


Instrument Implementation


Physical presence


Paper Form

Traditional interview

Traditional mail


Direct deposit


Software Program

CAPI interview


CATI interview



Data scanner device

Set of Requirements

Data Scanner Program

Data collector


Web Scraping Robot

Web queries



Web Service Consumer Program

Applications interconnection

Secondary transfer of data

Data Transfer

Data Medium, File Transfer, Web Sphere Application


Production Activity


28.               GSIM includes the notion of a Production Activity . More information about how GSIM expands on this activity can be found in the Production Group section.


Dissemination Activity


29.               GSIM includes the notion of a Dissemination Activity . More information about how GSIM expands on this activity can be found in the   Structures Group section.


C. Production Group


30.               The Production group is used to describe each step in the statistical process, with a particular focus on describing the inputs and outputs of these steps. A business process can be specified in terms of:


      The Process Steps which need to be undertaken during that process, and

      The sequence in which Process Steps need to be undertaken during that process.


31.               A Statistical Activity puts into effect a statistical business process which has been designed previously (it has a Statistical Program Design ) and which spans one or more phases of the business process (for example, the Collect, Process, Analyze, and/or Disseminate phases of the GSBPM).


32.               At the heart of the Production Group is the description of the Process Steps within the statistical business process and the use of statistical information as inputs to, and outputs from, each Process Step . Each P rocess Step can be as "large scale" or "small scale" as the designer of a particular business process chooses (see Figure 6). Steps can contain "sub-steps", those "sub-steps" can contain "sub-steps" within them and so on indefinitely.



Figure 6 . Process Steps can be as large or small as needed


33.               In line with the GSIM design principle of separating design and production, the Production Group (see Figure 7) assumes that each Process Step will be designed during a design phase. Having divided a planned statistical business process into Process Steps , the next requirement is to specify a Process Step Design for each step. The Process Step Design identifies how each Process Step will be performed.

34.               The sequencing of Process Steps within a business process is addressed through the concept of Process Control.   When creating a Process Step Design, a Process Control that provides information on "what should happen next" is specified. Sometimes one Process Step will be followed by the same step under all circumstances. In such cases the Process Control simply records what Process Step comes next. However, sometimes there will be a choice of which Process Step will be executed next. In this case, the design of the Process Control will detail the set of possible "next steps" and the criteria to be applied in order to identify which Process Step(s) should be performed next.


35.               During the production phase, as part of a Statistical Activity, Process Steps are executed in accordance with their design. An agent (person or system) initiates execution of the relevant Process Steps based on the following information:


      Process Step Design to determine how the current Process Step should be executed.

      Process Control to determine which Process Step to execute next.


36.               A Process Step Execution Record should be recorded for each Process Step which is executed. The Process Step Execution Record is the information object which records the action. The action itself is a real world event, where Process Step Execution Record records that real world event.


Figure 7 . Simplified view of Production Group objects

37.               As shown in Figure 7, a Statistical Program Design is associated with a top level Process Step whose Process Step Design contains all the sub-steps and process flows required to put that statistical program into effect. Each Process Step in a statistical business process has been included to serve some purpose. This is captured as the Business Function associated with the Process Step. The Business Function , for example, might be 'impute missing values in the data'.


38.               The Process Step Design associated with that Process Step will then identify the Process Method that will be used to perform the Business Function associated with the Process Step . For example, if the Business Function is 'impute missing values in the data', the Process Method might be 'nearest neighbour imputation'.


39.               A Process Method specifies the method to be used, and is associated with a set of Rules to be applied. For example, any use of the Process Method 'nearest neighbour imputation' will be associated with a (parameterized) Rule for determining the 'nearest neighbour'. In that example the Rule will be mathematical (for example, based on a formula). Rules can also be logical (for example, if Condition 1 is 'false' and Condition 2 is 'false' then set the 'requires imputation' flag to 'true', else set the 'requires imputation flag' to 'false').


40.               At the time the Process Step Design is executed someone or something needs to apply the designated method and rules. The Process Step Design can designate the Business Service that will implement the Process Method at the time of execution. A Business Service represents a service delivered by a person or a piece of software. Putting a publication on the statistical institute's website or putting collected response forms in a shared data source for further processing are both examples of Business Services .


41 .               A Process consists of a set of Process Steps, including their associated process flow information. This enables the particular set of Process Steps to be named, and potentially catalogued and reused, as a Process . Process Steps need not be grouped into named Processes unless business benefits (for example, opportunities for reuse) are likely to result from doing so.


42 .               A Statistical Activity initiates the execution of a top level Process Step which will result in all sub-steps being executed which are relevant to that instance of the Statistical Activity. Executing the top level Process Step should start populating  a Process Step Execution Record associated with that Statistical Activity.


43.               The Process Step Execution Record (see Figure 9) will record the inputs provided when executing the top level Process Step. It will then record information which allows the actual flow of execution for that instance of the Statistical Activity to be traced. This includes recording the actual inputs to, and outputs from, each sub-step as well as the evaluation of each Process Control (which, in turn, determines the specific sequence of Process Steps performed during execution).



Figure 8 . Process Step Design



44.               A Process Step Design (Figure 8) has a Process Input Specification that identifies the types of the Process Inputs required at the time of execution. An example might be a Process Input Specification that requires a Dimensional Dataset to be provided at the time of execution.


45.               A Process Step Design may also identify Process Inputs . These refer to specific instances of inputs, rather than specifying a type of input. For example, a Process Step Design may specify that a particular Code Set will be used to provide a list of valid values.


46.               Process Input Specifications and Process Inputs are often determined by the input requirements of the Business Service , Process Method and Rules associated with the Process Step Design .


47.               Process Output Specifications play an analogous role to Process Input Specifications but describe the types of Process Outputs to be produced at the time of execution of the Process Step .


48.               Process Control specifies what process flow should occur from one Process Step to the next at the time of execution. In some cases it may simply record the next Process Step to be executed on a fixed/constant basis. Alternatively, a Process Control may set out conditions to be evaluated at the time of execution to determine which Process Step(s) to execute next.

49.               An example of the latter might be testing a Process Output against a quality criterion and initiating one course of action if the output meets the standard and another if it does not. It is not until the time of execution of the Process Step that it is possible to determine whether the standard has been met or not.


50.               The specification and evaluation of conditional Process Controls refer to Rules . In the case of Process Controls , the Rules guide the process flow. (In the case of Process Step Designs , Rules guide the work done by the Process Step to produce Process Outputs ).


Figure 9 . Process Step Execution




51.               A Process Step Execution Record (Figure 9) records the execution of activities according to a Process Step Design .


52.               Execution of a Process Step   uses Process Inputs in accordance with the Process Input Specification specified in the Process Step Design (Figure 8).


53.               When execution takes place a particular instance of a Dimensional Dataset (for example, "Turnover of retail trade establishments by employment size, industry class and state, for  November 2012") will be provided as a Process Input . The identity (instance) of the particular Dimensional Dataset may be different for each example of execution. The specific Process Inputs associated with an instance of executing a Process Step are recorded in the Process Step Execution Record .


54.               Parameter Inputs are a form of Process Input used to specify which configuration should be used for a specific execution of a Process Step . For example, a set of parameters like the statistical period concerned or a sample size.


55.               A Process Input may be provided to a Process Step in order for the Process Step to 'add value' to that input by producing an output which represents a transformed version of the input. Such a Process Input is classed as a Transformable Input . Usually this represents the main dataflow within the statistical process (like microdata, aggregated data, and disseminated data). It is, in short, the data transformed by the statistical process.


56.               A Process Support Input influences the work performed by the Process Step , and therefore influences its outcome, but does not correspond to a Parameter Input or a Transformable Input. Examples could include:


      A Code   List which will be used to check whether the codes recorded in one dimension of a dataset are valid.

      An auxiliary Data Set which will influence imputation for, or editing of, a primary dataset which has been submitted to the process step as the Transformable Input .


57.               A Process Output is any instance of an information object which is produced by a Process Step as a result of its execution. Process Outputs are subtyped as part of the Process Output Specification .


58.               A Transformed Output is the result which provides the 'reason for existence' of the Process Step . If that output were no longer required then there would be no need for the Process Step in its current form. Typically, a Transformed Output produced by a particular Process Step will either be provided as a Process Input to a subsequent Process Step or it represents the final product from a statistical business process.

59.               A Process Metric records information about the execution of a Process Step . For example, how long it took to complete execution of the Process Step ; or what percentage of records in the Transformable Input were updated by the Process Step to produce the Transformed Output.


60.               Process Outputs associated with execution of the current Process Step may be evaluated as part of Process Control in determining which process step to execute next.



Figure 10 . Conceptual and Structural information objects can be Process Inputs and Outputs


61 .               The execution of a Process Step will supply Process Input s and result in Process Outputs . The specific Process Inputs and Process Outputs associated with the particular execution will be recorded in the Process Step Execution Record . Through Process Input Specification and Process Output Specification the Process Step Design defines the types of Process Inputs to be supplied, and the types of Process Output s to be produced at the time of execution (See Figure 10). In many cases, these Process Inputs and Outputs are the conceptual and structural information objects that are described in the GSIM Concepts and Structures Groups (See Sections D and E). The same instance of an information object may perform different roles in different process steps.


D. Concepts Group


62 .               The GSIM Concepts Group contains sets of information objects that describe and define the terms used when talking about real-world phenomena that the statistics measure in their practical implementation.


63.               The information objects in this group are used as Process Inputs and are often referred to in Products and Representations to provide information that helps users understand results.


64.               At an abstract level, a Concept is defined in GSIM as 'unit of thought differentiated by characteristics'. Concepts are used in these situations:

(a) As a Population. To describe the set of objects it is wanted to obtain information about in a statistical survey. For example, the Population of adults in Netherlands.

(b) As a characteristic.   A particular Concept about a Population is described by a Variable . The data are linked to a concept via a variable. For example, the Concept   of   gender   in the Population of adults in Netherlands is collected by a Variable . At the representation level, there are data with Codes .

(c) As a Category to further define details about a Concept . For example, Male and Female for the Concept of Gender. Codes are linked to a Category via a Classification Scheme , for use within a Classification .





Figure 11 . Populations and Units


65.               As part of a Statistical Activity there is a Population (see Figure 11). There are several kinds of Populations : Target , Survey , Frame , and Analysis .   The objects of interest are Units (for example, persons or businesses). Data are collected about Units . There are   two kinds of Unit specified in the model. These are   Observation Unit and Analysis Unit . A Unit is associated with a Population .




66.               When used as part of a Statistical Activity , a Population is associated with a characteristic. The association of Population and a Concept playing the role of a characteristic is called a Variable (see Figure 11) . For example, if the   Population is   adults in Netherlands, then a relevant Variable might be educational attainment.


67.               Variable   (educational attainment of adults in   Netherlands)   does not include any information on how the resulting value may be represented. This information is in the Represented Variable. This distinction   prevents the duplication of Variable information when what is being measured   is the same but it is represented in a different manner. It   promotes the   reuse of a Variable definition.


68.               A derived Variable is created by a Process Step that applies a Process Method to one or more Transformable Inputs ( Variables ). The transformed Output of the Process Step is the derived Variable. In GSIM, this is modelled in the Production Group (see Section C).





Figure 12 . Variable


69.               A Conceptual Domain is associated with Variable . It has two subtypes: Described Conceptual Domain and Enumerated Conceptual Domain . An Enumerated Conceptual Domain , in combination with a Category Set contains information on the semantics of the Categories used by the Variable .

Represented Variables


70.               GSIM assists users in understanding both the meaning of the object and the concrete data-representation of the object. Accordingly, GSIM distinguishes between conceptual and representation levels in the model, to differentiate between the objects used to conceptually describe information, and those that are representational.



Figure 13 . Represented Variable


71.               The Represented Variable (see Figure 13) adds information that describes how the resulting values may be represented through association with a Value Domain . While Conceptual Domains are associated with a Variable , Value Domains are associated with a Represented Variable . These two domains are distinguished because GSIM wants to be able to talk about the semantic aspect ( Conceptual Domain ) separately to the representational aspect ( Value Domain ).


72.               Both the Enumerated Value Domain and the Described Value Domain give information on how the Represented Variable is represented. The Enumerated Value Domain does this in combination with a Code List , while the Described Value Domain provides a definition of how to form the values, rather than explicitly listing them.


73.               The Value Domain is defined by a Data Type . Data Types contain information on the allowed computations one may perform on the Datum (see Figure 15) . For example, it is possible to distinguish between nominal-, ordinal-, interval-, and ratio-data as Data Types . Gender Codes lead to nominal statistical data, whereas age values lead to interval data.


74.               A Unit of Measure   refines the Value Domain . It is the entity by which some quantity is measured. Examples are Tonnes, Count of_, and Dollars.


Instance Variable



Figure 14 . Instance Variable


75.               An Instance Variable (see Figure 14) is a particular Represented Variable associated with a collection of data ( Datum ). This corresponds to a column of data in a database. More particularly, the age of all the US presidents either now (if they are alive) or the age at their deaths is a column of data described by an Instance Variable , which is a combination of the Represented Variable "Age" and the Value Domain of "decimal natural numbers (in years)".


76.               A Datum is defined by the measure of a Value Domain combined with the link to a Unit (for example, persons or businesses) .   A Datum is also associated with a Data Type and  a Unit of Measure through the Value Domain .






Figure 15 . Over view of Classification


77.               Figure 15 provides an overview of the objects relating to Classifications . Classifications describe the Category role of a Concept .


78.               A Classification is a categorization of real world objects so that they may be grouped, by like characteristics, for the purposes of measurement, for example ISIC (International Standard Industrial Classification of All Economic Activities). Classifications can be grouped into a Classification Family , such as industrial activity.


79.               A Classification such as ISIC is a set of related Classification Schemes . It relates Classification Schemes that differ as Classification Versions or Classification Variants . A Classification Variant is based on a Classification Version . In a Classification Variant , the Categories of the Classification Version are split, aggregated or regrouped to provide additions or alternatives to the standard order and structure of the base Classification Version . A Classification Scheme has Categories organized into Levels determined by the hierarchy.   A Level is a set of Concepts that are mutually exclusive and exhaustive, for example, section, division, group and class in ISIC rev 4.


80.               A Classification Item combines the meaning, representation and additional information in order to meet the Classification criteria, for example "A - agriculture, forestry and fishing" and accompanying explanatory text such as information about what is included and excluded.


81.               A Correspondence Table can be created by a Map that links a Classification Item in a Classification Scheme with a corresponding Classification Item in another Classification Scheme via the Category corresponding to both Classification Items . For example, in a table displaying the relationship between ISIC Rev.4 and the  North American Industry Classification System (NAICS 2007 (US)), 0111 in ISIC Rev.4 is related to 111110 in NAICS.



Figure 16 . Concept Systems


82.               A Category is typically part of a Category Set , which is a subtype of Concept System . A Category Set contains one or more Category Items . A Category can be represented in a Category Set , a Code List or a Classification Scheme . A Category provides meaning to these information objects, for example "agriculture, forestry and fishing" or "female".


83.               A Code List is also a type of Concept System. It is used for creating a group of Codes and their associated Categories . It can consist of one or more Code Items. A Code designates a Category providing representation to the meaning from the Category. F or example in "F - female", the Code is F and the Category is Female.  


E.   Structures Group


84.               The GSIM Structures Group contains sets of information objects that describe and define the terms used in relation to data and their structure.   Like the information objects in the Concepts Group, the information objects in this group are used as Process Inputs and are often referred to in Products and Representations to provide information that helps users understand the structure of the data.



Figure 17 . Data Resource


85.               An Acquisition Program (see Figure 4) conducted by a statistical organization produces or supplies an Information Resource (Figure 17) . In GSIM, one subtype of an Information Resource has been specified. This is the Data Resource .  


86.               A Data Resource is comprised of Data Sets. These Data Sets are made available as part of:


      an Acquisition Activity (that is, made available by the data providers for data acquisition or resulting from the Acquisition Activity ); or

      a Dissemination Activity .


87.               For a Data Resource, the Data Set is discovered and provided by means of the Data Location . The Data Location specifies from where the data can be retrieved. Either this can be a link to a specific file containing the data or to a Dissemination Service (see Figure 20) that will consume a query for the data and will return a Data Set . If the link is to a Dissemination Service then it is probable that the Dissemination Service is able to be queried for many types of data and so can provide many Data Sets . Each Data Set must be structured according to a known Data Structure (for example, a known structure for Balance of Payments, Demography, Tourism, Education etc.).


88.               The Data Location is associated with a specific Provision Agreement which identifies the Data Provider and the Data Flow. Only one Data Structure can structure data relating to a Data Flow . A Data Flow can be grouped by Subject Field s (for example, National Accounts, Balance of Payments, Demography) which support data discovery.


89.               It is mandatory that the Data Set is linked to a Provision Agreement to which it relates (that is, the union of the Data Provider and the Data Flow ).


Figure 18 . Data Set

90.               A Data Set has Data Points . A Data Point is placeholder (for example, an empty cell in a table) in a Data Set for a Datum . The Datum is the value that populates that placeholder (for example, an item of factual information obtained by measurement or created by a production process). A   Data Structure describes the structure of a   Data Set by means of Data Structure Components ( Identifier Components,   Measure Components and   Attribute Components) . These are all Represented Variables with specific roles.


91 .               Data Sets come in different forms, for example as Administrative Registers, Time Series, Panel Data, or Survival Data, just to name a few. The type of a Data Set determines the set of specific attributes to be defined, the type of Data Structure required ( Unit Data Structure or Dimensional   Data Structure ), and the methods applicable to the data.


92.               For instance, an administrative register is characterized by a Unit Data Structure , with attributes such as its original purpose or the last update date of each record. It contains a record identifying variable, and can be used to define a Frame Population , to replace or complement existing surveys, or as an auxiliary input to imputation. Record matching is an example of a method specifically relevant for registers.


93.               An example for a type of Data Set defined by a Dimensional Data Structure is a time series. It has specific attributes such as frequency and type of temporal aggregation and specific methods, for example, seasonal adjustment, and must contain a temporal variable .


94.               Unit data and dimensional data are perspectives on data.   Although not typically the case, the same set of data could be described both ways.   Sometimes what is considered dimensional data by one organization (for example, a national statistical office) might be considered unit data by another (for example, Eurostat where the unit is the member state).   A particular collection of data need not be considered to be intrinsically one or the other. This matter of perspective is conceptual.




Figure 19 . Dimensional and Unit Data Structures


95.               A   Dimensional Data Structure describes the structure of a Dimensional Data Set by means of Dimensional Identifier Components, Dimensional Measure Components and Dimensional Attribute Components . These are all Represented Variables with specific roles.


96.               The combination of dimensions contained in a   Dimensional Data Structure creates a key or identifier of the measured values. For instance, country, indicator, measurement unit, frequency, and time dimensions together identify the cells in a cross-country time series with multiple indicators (for example, gross domestic product, gross domestic debt) measured in different units (for example, various currencies, percent changes) and at different frequencies (for example, annual, quarterly). The cells in such a multi-dimensional table contain the observation values.


97.               A measure is the variable that provides a container for these observation values. It takes its semantics from a subset of the dimensions of the   Dimensional Data Structure . In the previous example, indicator and measurement unit can be considered as those semantics-providing dimensions, whereas frequency and time are the temporal dimensions and country the geographic dimension. An example for a measure in addition to the plain 'observation value' could be 'pre-break observation value' in the case of a time series. Dimensions typically refer to Variables with coded Value Domains , measures to Variables with uncoded Value Domains .


98 .               A   Unit Data Structure describes the structure of a Unit Data Set by means of Unit Identifier Components, Unit Measure Components and Unit Attribute Components . These are all Represented Variables with specific roles.


99.               A Unit Data Structure specifies the structure of unit data. It distinguishes between the logical and physical structure of a Data Set . A Unit Data Set may contain data on more than one type of Unit, each represented by its own record type.


100 .               Logical Records describe the structure of such record types, independent of physical features by referring to Represented Variables that may include a unit identification (for example, household number). A Record Relationship defines source-target relations between Logical Records .



Figure 20 . Dissemination Activity


101 .               A   Dissemination Service   exposes the Data Sets   and other metadata that is contained in the Information Resource. It   is the mechanism to create and disseminate   Representations   to consumers. These   Representations   are created dynamically on the specific request and according to the specific needs of the consumer (the Output Specification ).   Representations   may contain any type of information, for instance statistical data (as a   Data Set   or visualization) or structural or conceptual metadata like a   Data Structure , a   Code Set   or a description of a   Concept .


102 .               A   Product   is the result of a   Publication Activity .   Products   are stored for later dissemination through   Dissemination Services . Examples of   Products   are publications, press releases, etc.   Representations   may be used as input to, and as components of, a   Product .

F. Base Group


103 .               The GSIM Base Group consists of several information objects that can be seen as the fundamental building blocks that support many of the other information objects and relationships in the model. These information objects form the nucleus for the application of GSIM information objects. They provide features which are reusable by other information objects to support horizontal functionality such as identity, versioning etc. For these reasons, many of these information objects are rather abstract in nature.



Figure 21 . Base Artefacts


104 .               The only base artefact in GSIM that gives underlying identity and naming is the Identifiable Artefact . It can be inherited by any class in GSIM for which identity, name, description, and additional documentation is required.


105.               The Identifiable Artefact has three associations to Contextual String – one for each of name, description, and documentation. The value in the Contextual String is given a context by the Context Key which can be Type or Language .


106 .               There is no attempt in GSIM to model the administration of items in repositories such as the maintenance agency, versioning, repository functions. However, the Identifiable Artefact does have a link to Administrative Details where such details can be added using the GSIM extension methodology.






Figure 22 . Organization


107 .               An Organization Scheme comprises Organization Items , each of which can be an Organization Unit or an Individual . The Organization Unit can be in a hierarchic scheme of Organization Items . An Individual or Organization Unit can have a number of different Contact Details.


108 .               The Individual or Organization Unit can play zero or more recognized roles ( Organization Item Role ) in the maintenance ( Maintenance Agency ) data collection ( Data Provider ) and dissemination ( Data Consumer ) processes.


Annex A.     Extending the model

109.         One of the GSIM design principles is that GSIM can easily be adapted and extended to meet users' needs. It is expected that some implementers may wish to extend GSIM, by adding detail and indicating which information objects are used, and exactly how.


110.         Examples of when this could be needed are:

(a)       A statistical organization wants to specify types of Rules (for example, Methodological Rules and Process Control Rules)

(b)     A statistical organization wants to add another specialization of Instrument


111.         Note that there are many points in GSIM where additional detail is expected to be added. These extensions can be done using the modelling techniques which GSIM itself uses. The following guidelines are intended to help modellers employ a common technique when extending and implementing the conceptual model, so that the use of GSIM itself within specific organizations is done in a common and understandable fashion.


112.         For people who have experience in modelling with the standard UML tools, the recommended technique should be straightforward. However, not all staff have this experience. For those with less familiarity, a 'metamodel template' is also provided which allows non-modellers to capture the same information in a form that relies on plain text.


A.   GSIM Extension Methodology




113.               As part of the GSIM v1.0 release, the Enterprise Architect file which contains the UML models will be released. In this file there are five 'namespaces' (or 'packages') – one for each of the GSIM Groups.


114.         Any organization extending GSIM should establish one or more namespaces which are specific to and owned/maintained by that organization. This provides a clean separation between GSIM itself, and the extensions that have been made to it.


115.         In many cases, the extensions might provide useful input to future development of GSIM itself, so should be made available to the maintenance agency (UNECE Standards Steering Group). In other cases, they may be too organization-specific for this purpose.


116.         The classes native to GSIM would be imported into the organization-specific namespace(s), and extensions made from them. Any new information objects would also be modelled in this namespace. In the same way that GSIM itself is organized into namespaces, it is recommended that if more than one organization-specific namespace is created by the extender, these should be organized along similar lines.


New Classes


117.         New classes may be created using the same style of modelling as is found in GSIM itself. GSIM uses a fairly standard but restricted set of the features of UML. The best guide to this style is to study the GSIM UML models. Such things as multiple inheritances have been avoided, and there is a distinct style in terms of how relationship roles are named.

Extensions/restrictions to existing classes


118.         Any class within GSIM can be imported and then extended/restricted. Classes can be extended with new properties and relationships, and the existing properties and relationships can be over-ridden.


119.         The extended classes inherit all properties and relationships from their parents, so these do not need to be explicitly modelled unless:


(a)       they are required for clearer understanding (they will appear proceeded by a slash ["/"]); or
(b)     they have been changed - that is, over-ridden.


120.         Extension and restriction in the UML models are shown with an open-headed arrow pointing from the extending/restricting class to the class that it inherits from, and of which it is a sub-type. The details of what is allowed are provided below:

Extension of existing classes:

121.         Create a new sub-type, with its own name, a definition, explanatory text, and examples, and then specify any additional type-specific additions to the set of properties or relationships which that information object possesses.


122.         Note: There are some common attributes, which exist for all GSIM information objects, and these will be present by inheritance. The same is true for administrative attributes added to the GSIM Base Administrative Details information object.

Restriction of classes:

123.         The information object to be restricted is imported into the organization-specific namespace and then sub-classed. Any existing relationships or properties may be over-ridden, unless they are required by the inherited cardinalities. This is done by simply re-stating the property or relationship, and changing its details.   Even within required cardinalities, so long as a restriction still produces a valid instance of its parent, the change is allowed. For example, a property with a cardinality of 1..* may be restricted to having a cardinality of 1, but not less than that, since at least one instance of that property is required.


124.         Note: If a class in GSIM is to be both extended and restricted, the same sub-type is used, with over-rides and additions made as desired.


125.         It is possible, using this mechanism, to express exactly what information objects within an organization are used and not used. If there is no relationship to an information object, or if its cardinality has been reduced to 0 for all properties and relationships, it is simply not used.



126.         GSIM itself should be used as an example of how to document extensions and restrictions. This means providing the information in the metamodel template (see below) and providing the definitions and descriptions/examples in tabular form, as well as providing an overall narrative of each UML diagram produced.


Box 1. Metamodel Template


Information Object Name




Explanatory Text:








Value Type


































Relationships (repeat as needed)



Target Object:

Relationship Type:


Source Role:

Source cardinality:

Target Role:

Target Cardinality:










Box 2. Example of completed template

Classification Family

Version:    1.0

Package:  Concepts

Definition:  A set of related Classifications. The Classification Family includes Classifications devoted to describing the same subject matter, such as industries.

Explanatory Text

Synonyms :  

Constraints : None


Attributes   :




Value Type


The unique identifier of the object.


Unique value within the owner agency.



A human-readable identifier for the object





The version of the object assigned by the owning agency.


Version designator (defaults to “1.0”)



The organization or legal entity which owns and maintains the object.


Entity designator


A human-readable description of the object.




A human-readable internal note intended for the developers/maintainers of GSIM.



Valid From

The effective date on which the object is published.



Valid To

The effective date on which the object is withdrawn from publication.





Name:   Subject

Target Object: Classification

Relationship Type: Aggregation

Description: Classification Family is a grouping of related Classifications, which is for relating Classifications covering the same subject matter.  An example is industrial classifications, for which NAICS and ISIC are related Classifications.

Source Role: Contained in

Source cardinality: 0..N

Target Role: Contains

Target Cardinality: 0..N




B.   Administrative Attributes


127.         GSIM does not model the information used by statistical organizations to administer and maintain their metadata - there are too many potential differences. Such administrative attributes are also very dependent on implementation, and GSIM is a conceptual model.

128.         To support the use of administrative attributes, GSIM provides an information object - Administrative Details - which can be extended to include whatever set of administrative attributes are needed by an implementer of the GSIM.


129.         In order, to encourage commonality of practice, GSIM recommends a set of administrative attributes based on the ISO/IEC 11179 standard. The following table shows the set of recommended attributes for the administration of GSIM information objects.


Table 2 . Recommended Attributes




Value Domain

Identification attributes





A term which designates a concept, in this case an information object. The identifying name will be the preferred designation. There will be many terms to designate the same information object, such as synonyms and terms in other languages.




The unique identifier of the information object; assigned by the owner agency.



Governance attributes





The version designator of the information object assigned by the owner agency.



Owner Agency

The organization or legal entity that owns and maintains the information object.



Organization Unit

The organization unit, within an agency, which owns (has rights to create, update, delete) the information object.


Controlled vocabulary

Valid From

The date on which the information object is effective or valid.



Valid Until

The date on which the information object is no longer effective or valid.



Created Date

The date on which the information object was created



Created User Id

The person who created the information object


Controlled vocabulary

Last Update Date

The date on which the information object was last changed.



Last Update User Id

The person who last changed the information object.


Controlled vocabulary

Administrative status [2]

indicator for access to an item: under review, open for use, or removed


Controlled vocabulary

Life cycle status

indicator for the quality of an item: incomplete, valid, superseded, or retired


Controlled vocabulary

Content attributes





A statement which describes an information object. It also delineates the information object's scope.




A comment or instruction which provides additional explanations about the information object and how to use it.




The subject or theme the information object is related to. This is included to support search.


Controlled vocabulary


Terms related to the information object. These are included to support search.


Controlled vocabulary

Technical implementation attribute





Identifies if the description can be executed by a machine.



130.         Implementers can use the GSIM extension methodology to include the recommended set of administrative attributes. The Administrative Details information object in GSIM has been purposefully left blank as a stub to be extended.


131 .         In this case, all that is needed is to create a namespace and to import the Administrative Details information object into it. The Administrative Details information object is then sub-classed, and the attributes listed above are added. Figure 23 shows what would appear in a UML diagram if this is done.  



Figure 23 . Extension of Administrative Details.


Note: The fields containing controlled vocabularies are shown in the diagram as text. These text strings would agree with a maintained list appropriate to the field which uses them.


Annex B.     Influence of existing standards


A.   Introduction


132.             GSIM must be implementable: In order to support the implementation of the GSIM reference framework, many known standards and tools have also been examined, to ensure that the reference framework is complete and useful in this respect. This section describes the influences of and relationships to a number of relevant standards.


133 .               Figure 24 illustrates how different relevant standards, models, and implementation syntaxes and tools relate to GSIM. Standards and models that have provided significant input to GSIM are presented on the left hand side of the figure. Implementation syntaxes and tools that are currently of relevance to an implementation of GSIM are presented on the right hand side of the figure. This list will become outdated as more and more implementation syntaxes and tools are developed.  The particular software packages listed are widely used in statistical organizations, but are intended to be illustrative examples, and are not a complete list.



Figure 24 : GSIM and its relationship to other relevant standards and models


B. Generic Statistical Business Process Model (GSBPM)


134.               GSBPM provides descriptions of business processes that can occur throughout the statistical production process. It is a framework for categorizing processes. In order to describe a process in a level of actionable detail, more information is needed.


135.               GSBPM explicitly excludes descriptions of flows within processes. This additional information is necessary if you wish to have reusable processes that talk about "flow" rather than just the specific functions which need to be performed during the flow (with no description of how they fit together).


136.               Information needs to:


      Flow between GSBPM processes. For example, data are processed or transformed between the Collect and Disseminate phases.

      Govern the behaviour of GSBPM sub-processes. There are business rules and derivation formulas that are applied during processes (for example Impute, Derive New Variables). There are also rules or plans that determine which process should be performed next. An example of this is whether the quality of the data is sufficient to proceed to the next step or whether some form of remedial processing is required.

      Report on the outcome of GSBPM processes. For example, process related statistical quality metrics such as response rates or imputation variance.


137 .               The GSIM Production Group seeks to provide a standard way to capture this information about processes. It includes information objects such as Process Step, Process Step Design , Process Step Execution Record , Rule , Process Input and Process Output .


138.               GSIM is designed to support current production processes and facilitate the modernization of statistical production. Implementation of GSIM, in combination with GSBPM, will lead to more advantages that are important. GSIM will:


      create an environment prepared for reuse and sharing of methods, components and processes;

      provide the opportunity to implement rule based process control, thus minimizing human intervention in the production process;

      generate economies of scale through development of common tools by the community of statistical organizations.


Figure 25 . GSIM and GSBPM

C. Data Documentation I nitiative (DDI)


139.         The DDI Alliance supports the development of the GSIM information model and finds many parallels between the model and the DDI Lifecycle specification. The DDI Alliance is interested in working closely with the GSIM group to extend the modeling effort to encompass the definition of lower-level elements and attributes to provide actionable metadata that can be used to drive production and data collection processes.


Relationship with GSIM Business Group


140.         The DDI standard is not designed to describe all aspects of a statistical program. However, there is a solid alignment with this portion of GSIM, especially as it relates to describing the data and metadata which are used by a particular activity. The primary link between GSIM and DDI, in this regard, is the DDI ‘Study Unit’ and the GSIM Statistical Program Cycle information object. All of the data and metadata associated with a particular cycle of a Statistical Program can be described using DDI XML, and the relationship of the different information objects can be described.


141.               As DDI has a ‘lifecycle’ orientation, it is useful for describing many different aspects of the data, from collection through to dissemination. DDI provides a very rich description of a survey instrument, and this can be used to implement the GSIM Survey Instrument information object. It is easy to see that the DDI ‘ControlConstruct’ elements can be implementations of the GSIM Instrument Control information object, although these are more detailed in the DDI implementation model.


142.         GSIM and DDI both model the existence of such information objects as Questions and Interviewer Instructions , as opposed to the use of these resources in a survey instrument, in the same way. This is important when it comes to re-use, as a question which is bound to a specific survey (for example) becomes non-reusable. Both DDI and GSIM see a similar set of when describing a Survey Instrument: Questions, Statements, Question Blocks, and Interviewer Instructions. These information objects are shared by both standards.


Relationship with GSIM Production Group


143.         In the current versions of DDI, there is very little content related to the management of statistical production. However, there are plenty of metadata to describe some specific types of data processing. In addition, DDI provides a way of recording ‘ Lifecycle Events ’, which can record any kind of processing or production event and associate it with other identifiable metadata information objects. DDI is very useful when it comes to describing many of the data and metadata inputs and outputs for these processes. (It should be noted that Process Metrics are often themselves data sets, and can be described as such in DDI.)


144.         Another strong feature of DDI relative to GSIM is the ability to describe data collection activities. There is in DDI elements for describing ‘CollectionEvents’, which can be associated with specific variables populated (although this feature is not required). While it would be intuitive to associate ‘CollectionEvents’ in DDI with data acquisition activities in GSIM, there is also a relationship with data processing activities.


145.         In future versions of DDI, it is likely that the ability to associate processing and production information with metadata information objects will be enhanced. There were several extensions to this capability found in DDI version 3.2, and this feature will be revisited and enhanced in future versions of DDI.


146.           ‘ Processing Events’ in DDI can be used to describe some types of processes as well: ‘ Control Operations’, ‘Cleaning Operations’, ‘Codings’, ‘Data Appraisal Information’, and ‘ Weighting’. While GSIM does not provide a breakdown of these types of processes, it is easy to see where these might fit into a process model such as GSBPM. What is captured by DDI, however, is the same type of information content as is found in GSIM.


147.         The types of processing described by DDI include different types of ‘Codings’: ‘ Generation Instructions ‘ and ‘ General Instructions ‘. For each of these it is possible to provide a textual description of the process; to link to or insert the actual program code used to execute the process; and, in the case of generation instructions, it is possible to link to the variables manipulated by a derivation process. This model is in some ways similar to what is found in GSIM – it lacks tie-backs to the methodology used, and also to the explicit business function, but in some ways (inputs, code and controls applied, outputs) is fairly similar to GSIM.


148.           ‘ Generation Instructions’ describe processes used to create new variables from existing ones. Tabulation of data often requires the tabulation of new variables. This DDI structure is similar to a GSIM Process Step Execution Record , and includes some additional information (such as the processing code). The DDI structure is not perhaps optimal, because there will potentially be a lot of detail in a Process Step Execution Record placed into a text field in the DDI XML – the description field of the ‘ Generation Instruction’.


149.           ‘ General Instructions’ are used in DDI to describe other types of processing, in a manner similar to ‘ Generation Instructions’. This can cover the Process Step Execution Record portion of the GSIM model. ‘ Data Appraisal’ includes information such as sampling error and response rate, which may be useful for some processes as Process Metrics . Other ‘ Processing Events’ are simple descriptions.


150.           ‘ Lifecycle Events’ can be used to associate any process with the relevant inputs and outputs. Typically a process model (such as GSBPM) is used to distinguish types of ‘ Lifecycle Events’ , but there is no rule in DDI to prevent the process references being at a more detailed level.


151.         DDI does not provide a mechanism for describing Process Step Design , but was designed to work with a separate description of this information, expressed in BPMN, or BPEL, for example.




Relationship with GSIM Concepts Group


152.         DDI as a standard describes many of the foundational metadata objects which are modelled in the GSIM Concepts Group. Concepts , Categories , Codes , Variables , and Populations (in DDI, ‘Universes’) are all present in both DDI and GSIM. There is no dedicated way of representing a Classification in DDI – it is simply a pairing of a ‘category scheme’ and a ‘code scheme’ – but otherwise the two models are very similar. One major difference in this area is that DDI (and, indeed, all other models) lack the concept of what is described in GSIM as a Node . This is a key improvement to managing this type of metadata which is in GSIM, and is expected that it will be reflected in future versions of DDI.


153.         One feature of GSIM which is more nuanced that DDI is in the set of Variable information objects. In GSIM there is a separation between Variable , Represented Variable , and Instance Variable . In DDI 3.1, there is only the instance variable included (called ‘Variable’ in DDI terminology). In DDI 3.2, the standard has added what it terms a ‘Data Element’, corresponding to the GSIM Represented Variable information object.


154.         GSIM also has a richer set of concept links between various information objects than the current versions of DDI-Lifecycle. It is anticipated that the DDI model will be adjusted to include such linkages in future, as a response to GSIM. In DDI, there are links between concepts, questions, variables, and levels within classifications. In DDI 3.2, links to categories have been added. This is a more consistent model in GSIM, where concepts links are also applied to populations.


Relationship with GSIM Structures Group


155.         The GSIM Structures Group maps well to DDI, especially in regards to the description of unit data. As GSIM is a conceptual model, it does not go into all of the implementation detail found in DDI for describing the storage of data. However, at the logical level, the two models are very compatible. Variables in DDI play a few specialized roles, including the identification of unit records, observations about those unit records, and additional supporting information such as weights. This maps very cleanly onto the GSIM model.


156.         Further, DDI also has a concept of ‘NCubes’, multi-dimensional data sets. These also exist in GSIM in the form of Dimensional Data Sets . DDI here has variables playing roles of identifying (dimensions), measures (observations), and attributes (attributes) and as such is very much like the model found in GSIM. Both models tie the values here back to the variables in which they are stored as well.


157.               What DDI is largely lacking are the constructs used to manage the data, such as Data Flows , Data Channels , Provision Agreements , etc. The mis-match here largely results from the fact that DDI is fundamentally organized around a lifecycle model, rather than a model of exchange like SDMX. It will seen how much of this type of metadata will be introduced into the DDI model in future – it is likely that GSIM itself may dictate that this type of information be better supported.


D. Statistical Data and Metadata eXchange (SDMX)


Relationship with GSIM Business Group


158.               In general, SDMX does not cover explicitly the constructs in the Business group. However, the SDMX ‘Metadata Structure Definition’ and related ‘Metadata Set’ are used to describe and to provide quality, methodological and other reference metadata. These metadata not modelled explicitly in GSIM but are rather embedded in other GSIM constructs. These same SDMX constructs would also be used to map the metadata of the GSIM Statistical Need, Assessment, and Business Case.


159.               GSIM has additional information about the Data Channel, Instrument, Instrument Control, Question Scheme, Information Request, and Statistical Program which are not found in SDMX.


Relationship with GSIM Production Group


160.               The SDMX standard is primarily focused on the description of aggregate data sets and related metadata of various types. These various types of data and metadata are used as inputs and outputs by statistical processes. However, SDMX also contains some structures which are relevant to an implementation of the GSIM Production Group. Key among these is the ability SDMX provides to describe processes and process steps.


161.               There is quite a good fit between the SDMX process model – which is made up of a set of nested, hierarchical sub-steps – and the GSIM approach, which is more detailed, but essentially similar. A Process , its constituent Process Steps and its associated Process Control and Rule information objects describe essentially the same information as the SDMX ‘Process’, ‘Process Step’, ‘Transition’, and ‘Computation’ description: the flow of a process and the data and metadata inputs and outputs.


162.               Whilst SDMX supports ‘Process Artefact’ for inputs and outputs, there is no link in SDMX to what provides the inputs. GSIM Process Input is provided by the Statistical Program Design (static design input) and the Statistical Activity (dynamic “run time” input).


163.               Other parts of the GSIM production model could be implemented with SDMX as reference metadata, but the utility of this will depend very much on what the GSIM implementation is being built to do.


Relationship with GSIM Conceptual Group


164.               The SDMX standard contains many of the foundational metadata objects used in this part of GSIM. The level of detail is somewhat different, because SDMX does not make a distinction between the meaning of a code (a Category ) and the Code itself – both are bundled together into a ‘Codelist’ in SDMX. However, the same information about hierarchies (in SDMX, ‘Hierarchical Codelists’) can be expressed. There is no distinct classification information object in SDMX – classifications are described using a combination of SDMX ‘Codelists’ and ‘Hierarchical Codelists’ information objects. The ‘Hierarchical Codelist’ model in SDMX was developed with classification support as one use case.


165.               Both the GSIM and the SDMX models have Concepts as an important construct, although the linkages to concepts are richer in GSIM than in SDMX.


166.               As SDMX focuses on aggregate data, there is no information object representing a variable, which is different than in GSIM. When used to describe data, ‘Concepts’ in SDMX can be mapped to Variables as they appear in GSIM, however, such that the SDMX ‘Concept’ represents a collapsed GSIM Concept and Variable.


SDMX has no explicit support for the GSIM Population .


Relationship with GSIM Structures Group


167.               GSIM has a number of constructs in the Information area which will be familiar to people using SDMX. The Dimensional Data Set corresponds to an SDMX ‘Data Set’, and an SDMX ‘Data Structure Definition’ corresponds to a GSIM Dimensional Data Structure .


168.         The Data Resource model contains information objects from SDMX such as Data Flows, Data Providers , and Provision Agreements , in a very similar form. The way GSIM groups Data Flow by Data Resource and Subject Field would be supported in SDMX by ‘Category’ (this is not the same as a GSIM Category) and ‘Categorisation’.


E.   ISO/IEC 11179


169.         ISO/IEC 11179 is a standard for describing and managing the   meaning and representation of data. It specifies the kind and quality of metadata necessary to describe data. The GSIM Concepts Group contains a terminological description of data. This is similar in many respects to 11179.

170.         However, 11179 also specifies the management and administration of metadata in a metadata registry, Registration is the process of managing the content and quality of descriptions, and this is supported explicitly in 11179. GSIM does not seek to replicate this work.

171.         There are a number of constructs which are similar in 11179 and GSIM. Table 3 shows the pairs of constructs are equivalent in the two specifications:










Table 3 . Similar Constructs in 11179 and GSIM



Object Class




Value Domain

Value Domain

Enumerated Value Domain

Enumerated Value Domain

Described Value Domain

Described Value Domain

Conceptual Domain

Conceptual Domain

Enumerated Conceptual Domain

Enumerated Conceptual Domain

Described Conceptual Domain

Described Conceptual Domain

Concept System

Concept System

Unit of Measure

Unit of Measure


Data Type


172.         Dimensionality is   specified as well in 11179. It identifies those units of measure that are equivalent. For example, miles per hour, meters per second, and furlongs per fortnight all measure speed; and they are equivalent measures. Data measured in any one of those units can be converted without loss of information to any of the others. This is only lightly supported in GSIM.

173.         The notion of classifications is more explicitly defined in GSIM than in 11179. The following objects related to classifications are defined in GSIM and not in 11179: Category Set , Code List , Datum , Nodes and Node Sets .


F. ISO 704


174.         Both GSIM and 11179 base their description of data on the principles laid out in ISO 704. However, GSIM does a more careful job of making sure these principles are followed precisely. In GSIM, Populations , Variables , and Categories (called a property in 704) are all laid out as roles for Concepts , and these have parallels to the principles defined in 704.


175.               In contrast to 704, GSIM explains more clearly the relationships between:
(a)       concepts ( Populations in GSIM) and characteristics ( Variable s in GSIM) and
(b)     objects (not explicit in GSIM but the individual units from which measurements are taken) and properties ( Categories in GSIM)




G. Neuchâtel Terminology for Classifications


176.                 A statistical classification is often described as a tool that is used to handle and structure objects systematically into categories in the production of statistics [3] . Neuchâtel Terminology for Classifications [4] is one of the most used standards for classification management.  


177.               The Neuchâtel terminology definition of classification:


"A classification version is a list of mutually exclusive categories representing the version-specific values of the classification variable. If the version is hierarchical, each level in the hierarchy is a set of mutually exclusive categories. A classification version has a certain normative status and is valid for a given period of time. A new version of a classification differs in essential ways from the previous version. Essential changes are changes that alter the borders between categories, that is, a statistical object/unit may belong to different categories in the new and the older version. Border changes may be caused by creating or deleting categories, or moving a part of a category to another. The addition of case law, changes in explanatory notes or in the titles do not lead to a new version."


178.               One important difference between GSIM and the Neuchâtel terminology for classifications is that GSIM separates meaning and representation. Table 4 below show how GSIM maps to the Neuchâtel Terminology for Classifications:




Table 4 . Mapping between Neuchâtel Terminology for Classifications and GSIM

Neuchâtel terminology




Classification family

Classification ­ Family

Activity (Industry) classifications, Educational classifications

Group of Classifications




Group of Classifications Schemes

Classification version

Classification Version

NACE rev 2, ISIC rev 4, ISCO 08,


Classification variant

Classification Variant

High-level SNA/ISIC aggregation A*10/11 grouping


Classification level


Section, division, group and class in ISIC rev 4


Classification item

Classification Item

0111 - Growing of cereals (except rice), leguminous crops and oil seeds


Correspondence table

Correspondence Table

ISIC rev 4 - NAICS


Classification index



List of aliases

Classification index entry




Item change




Case law




Classification index entry

Alias on Node



Correspondence item


0111 in ISIC - 111110 NAICS


Classification item - code

Attribute on Classification ­ Item

0111 (in ISIC)

Not an information object in itself in GSIM

Classification item - title

Attribute on Classification ­ Item

Growing of cereals (except rice), leguminous crops and oil seeds

Not an information object in itself in GSIM

Classification item - explanatory notes

Attribute on Classification ­ Item

"This class includes:

-           growing of temporary and permanent crops

-           cereal grains: rice, hard and soft wheat, rye, barley, oats, maize, corn (except sweetcorn) etc.

-           growing of potatoes, yams, sweet potatoes or cassava

-           growing of sugar beet, sugar cane or grain sorghum

-           growing of tobacco, including its preliminary processing: harvesting and drying of tobacco leaves

-           growing of oilseeds or oleaginous fruit and nuts: peanuts, soya, colza etc.

-           production of sugar beet seeds and forage plant seeds (including grasses)

-           growing of hop cones, roots and tubers with a high starch or inulin content

-           growing of cotton or other vegetal textile materials

-           retting of plants bearing vegetable fibres (jute, flax, coir)

-           growing of rubber trees, harvesting of latex

-           growing of leguminous vegetables such as field peas and beans growing of plants used chiefly in pharmacy or for insecticidal, fungicidal or similar purposes

-           growing of crops n.e.c.


This class excludes:

-           growing of melons, see 0112

-           growing of sweet corn, see 0112

-           growing of other vegetables, see 0112

-           growing of flowers, see 0112

-           production of flower and vegetable seeds, see 0112

-           growing of horticultural specialties, see 0112

-           growing of olives, see 0113

-           growing of beverage crops, see 0113

-           growing of spice crops, see 0113

-           growing of edible nuts, see 0113

-           gathering of forest products and other wild growing material (cork, resins, balsam etc.), see 0200"



H.   Business Process Model and Notation (BPMN)


179.           BPMN provides a standard means to document business processes, including representing them graphically. GSIM does not try to duplicate the richness of modelling in BPMN. It simply aims to establish a high level connection.


180.         There are two main objects in GSIM that have a direct relationship with BPMN. These are Process Step Designs and Process Control (shown in Table 5).

Table 5 . Similar constructs in BPMN and GSIM




A high level Process Step


An intermediate level Process Step


A low level (atomic) Process Step

Sequence Flow

A Process Control (in cases where the flow between process steps is invariable)


A Process Control (in cases where the flow between process steps is evaluated at the time of execution)


181.         The BPMN V2.0 specification explicitly notes that BPMN is not a 'data flow language'. BPMN can represent 'data objects' but does not explicitly model them in detail.   GSIM does model these objects explicitly ( Process Input Specifications, Process Inputs, Process Output Specifications and Process Outputs ) .  


182.               The BPMN V2.0 specification also explicitly excludes

      modelling of functional breakdowns (GSIM Business Functions )
business rule models (GSIM Process Methods )


I. COmmon Reference E nvironment (CORE)


183.               The CORE model is a communication protocol for the exchange of information between a CORE service (a service designed with the help of CORE information objects) and its environment (an implementation of CORE on any specific platform). The CORE model knows of the existence of statistical information objects, but knows nothing else about them.


184.               In CORE, a ‘channel’ is a communication line between a service and its environment. A ‘channel’ is specialized in the transportation of specific objects by referring to their ‘kind definition’ (for example, Data set kind – constraining Data set definitions; Column kind – constraining Column definitions; Rule kind – constraining rules; etc.). There is a channel kind labeled ‘GSIM Object Description’, which will accept a GSIM object without understanding its contents, structure or meaning.


185.         Figure 26 shows the constructs which are similar in CORE and GSIM.


Figure 26 . CORE and GSIM


J.   The Open Group Architectural Framework (TOGAF)

186.         TOGAF is widely recognized and used within statistical organizations as well as many other organizations around the world.   Most other architectural frameworks are basically consistent with TOGAF although the terms and precise concepts used within other architectural frameworks may vary.


187.         Most information objects defined within GSIM (for example, Data Sets and Classifications ) would be considered ‘business objects’ within TOGAF.   Within TOGAF, such ‘business objects’ would typically be modeled as Data Entities within the Data Architecture.


188.         Three information objects within the Production Group, however, are included in the metamodel for Business Architecture within TOGAF, Business (Process), Business Function and Business Service . Within TOGAF, Business Functions and Business Service s interact with (for example, produce and consume) ‘business objects’.


189.         Process Method is not directly referred to by TOGAF.   Statistical organizations place particular emphasis on design and selection (and evaluation) of statistical methods (in the context of statistical methodology more generally) when producing official statistics.   For many other industries, the method to be selected and used to perform a particular Business Function might not need to be separately identified (for example, it will not be subject to specific evaluation or reuse).   In these cases, the concept of “method" could be subsumed in the definition of the Business Process in the TOGAF metamodel.


190.         GSIM does not model in as much detail as TOGAF the way that Organizational Units interact with Business Functions and Business Services .   A lot of the detail about how Organizational Units interact will be specific to a particular organization.   Nevertheless, Process Steps and Business Services need to have owners designated in GSIM.


191.         The TOGAF metamodel sets out a very flexible (rather than strictly hierarchical) relationship between Business Functions , business processes and Business Service s . For example, the business process used to fulfill a particular Business Function (for example, GSBPM 6.2 Validate Outputs) might require another Business Function (for example, GSBPM 5.3 Review, validate and edit) to be performed.     GSIM inherits this flexibility.


192.               This allows an individual to apply GSIM to describe the relationship between statistical information and statistical business processes for those aspects of the statistical production processes that are of interest to that person.   They don't need to model the workflows required to deliver services they consume, they merely need to document (via a single Process Step ) the inputs and outputs associated with their use of the service.





Annex C.     Glossary





Explanatory Text


Acquisition Activity


The set of executed processes and the actual resources required as inputs and produced as outputs to acquire data about a given Population for a particular reference period. It includes the process and resources required to acquire data in a Statistical Program consisting of gathering data via one or more Data Channels in order to create or feed one or more Data Resources .

This object holds Statistical Activity information that relates specifically to data collection or acquisition. It inherits the relationships and attributes from the Statistical Activity type.


Acquisition Design


The specification of the resources required and processes used and description of relevant methodological information for a set of activities to collect data about a given Population .

This object holds Statistical Program Design information that relates specifically to data collection or acquisition. It inherits the relationships and attributes from the Statistical Program Design type. Related to Acquisition Design is Acquisition Activity , which holds the detailed information about the conduct of the Acquisition Activity for a single reference period, The Acquisition Design describes the methodology and design elements that are intended to apply across all Acquisition Activities until such time as a decision is made to alter the design.


Administrative Details


A placeholder for extensions to the GSIM model.

GSIM does not seek to replicate or embed constructs from the administration of objects held in metadata registries, but includes this placeholder to allow for future extensions.


Analysis Population


A Population used for the analysis, processing, or dissemination of statistical data.

Population determined by parameters of an analysis

object class, analytical population

Analysis Unit


A Unit that is defined for the analysis, processing, or dissemination of statistical data.

Object corresponding to an Analysis Population

analytical unit, unit of analysis



An activity to analyze quality or effectiveness and consider available options.

The Assessment is a generic class that regroups different types of more specific assessments. An example of Assessment is a SWOT assessment that identifies the Strengths, Weaknesses, Opportunities and Threats of a specified proposal. Another example is a Gap Analysis that formalizes the difference between the current situation and the state to reach due to certain requirements. An Assessment can use various objects as inputs, whether they are the main objects that the Assessment is about or auxiliary information objects that help the accomplishment of the assessment.


Attribute Component


  The role given to a Represented Variable in the context of a Data Structure . The role is to hold the pertinent information in addition to the identifiers and measures for a particular unit in a Data Set.  


For example the publication status of an observation (e.g. provisional, final, revised), or information specific to the use of an Identifier in the context of a Data Set.  


Business Case


A proposal for a body of work that will deliver outputs designed to achieve outcomes. A Business Case will provide the reasoning for initiating a new Statistical Program Design for a Statistical Program, as well as the details of the change proposed.

A Business Case is produced as a result of a detailed consideration of a Change Definition . It sets out a plan for how the change described by the Change Definition can be achieved. A Business Case usually comprises various evaluations, for example a SWOT assessment, or Gap Analyses for the different solutions that are considered for satisfying the Statistical Need . The Business Case will also specify the stakeholders that are impacted by the Statistical Need or by the different solutions that are required to implement it.


Business Function


Something an enterprise does, or needs to do, in order to achieve its objectives.

A Business Function delivers added value from a business point of view. It is delivered by bringing together people, processes and technology (resources), for a specific business purpose.


Business Functions answer in a generic sense "What business purpose does this Process Step Design serve?" Through identifying the Business Function associated with each Process Step Design it becomes easier in for someone in future with an equivalent business need to identify Process Step Designs that they might reuse (in whole or in part).

                                                                                                         A Business Function may be defined directly with descriptive text and/or through reference to an existing catalogue of Business Functions . The phases and sub processes defined within GSBPM can be used as an internationally agreed basis for cataloguing high level Business Functions . A catalogue might also include Business Functions defined at a lower level than "sub process". For example, "Identify and address outliers" might be catalogued as a lower level Business Function with the "Review, validate and edit" function (5.3) defined within GSBPM.


Business Service


A defined interface for accessing business capabilities (an ability that an organization possesses, typically expressed in general and high level terms and requiring a combination of organization, people, processes and technology to achieve).

A Business Service may provide one means of accessing a particular Business Function . Requesting a particular service through the defined interface may result in a business process (workflow) being executed.


The explicitly defined interface of a Business Service can be seen as representing a "service contract". If particular inputs are provided then the service will deliver particular outputs in compliance within specific parameters (for example, within a particular period of time).


In the case of GSIM, a Business Service typically implements a particular Process Method to perform a particular Business Function .      


Note: The interface of a Business Service is not necessarily IT based. For example, a typical postal service will have a number of service interfaces:                                            - Public letter box for posting letters                                                                                    - Counter at post office for interacting with postal workers




A Concept whose role is to extensionally define and measure a characteristic.

Categories for the Concept of sex include: Male, Female                                   


Note: An extensional definition is a description of a Concept by enumerating all of its sub ordinate Concepts under one criterion or sub division.


For example - the Noble Gases (in the periodic table) is extensionally defined by the set of elements including Helium, Neon, Argon, Krypton, Xenon, Radon. (ISO 1087-1)


Category Item


An element of a Category Set.

A type of Node


Category Set


A list of Categories

A kind of Node Set for which the Categories have no assigned Designations .


For example:




Change Definition


A structured, well-defined specification for a proposed change.

A related object - the Statistical Need - is a change expression as it has been received by an organization. A Statistical Need is a raw expression of a proposed change, and is not necessarily well-defined. A Change Definition is created when a Statistical Need is analyzed by an organization, and expresses the raw need in well-defined, structured terms.


A Change Definition does not assess the feasibility of the change or propose solutions to deliver the change - this role is satisfied by the Business Case object. The precise structure or organization of a Change Definition can be further specified by rules or standards local to a given organization.      


Once a Statistical Need has been received, the first step is to do the conceptual work to establish what it is we are trying to measure. The final output of this conceptual work is the Change Definition.


The next step is to assess how we are going to make the measurements - to design a solution and put forward a proposal for a body of work that will deliver on the requirements of the original Statistical Need . The Change Definition is an input to this Process Step and the final Business Case is an output. Depending on the needs of individual agencies a Change Definition may be created before or after a Business Case has been created, or even created to a basic extent before the Business Case development and further developed after a Business Case has been approved and a decision made to proceed with the change.


Channel Activity Specification


The description of the Data Channel made at run time.

This object is a specialization of a Data Channel and is used to describe the behaviour of a Data Channel at execution time.


Channel Design Specification


The description of the Data Channel made at design time.

This object is a specialization of a Data Channel , and is used to make the design of the characteristics of a Data Channel before using it.




A set of related Classification Schemes . The Classification relates Classification Schemes which differ as versions or variants of each other.

For example, NAICS (North American Industrial Classification System) is a Classification , but NAICS 2002 and NAICS 2007 are Classification Schemes , as they are different versions of NAICS.


Classification Family


A set of Classifications that are related from a certain point of view.

  The Classification Family includes Classifications devoted to describing the same subject matter, such as industries.


Classification item


A Category at a certain

Level within a Classification Scheme .



Classification Scheme


A structured list of mutually exclusive Categories . Such a structured list may be linear or hierarchically structured.

Classification Scheme has two subtypes - Classification Version and Classification Variant. In a hierarchical Classification Scheme , Categories organized into Levels determined by the hierarchy. The Categories in each Level are mutually exclusive and exhaustive.


Classification Variant


  A Classification Variant is based on a Classification Version . In a variant, the Categories of the Classification Version are split, aggregated or regrouped to provide additions or alternatives to the standard order and structure of the base version.



Classification Version


  A Classification Version is a list of mutually exclusive Categories representing the version-specific values of the classification variable.

  A Classification Version has a certain normative status and is valid for a given period of time.




A Designation for a Category

Codes are unique within their Code List. Example: M (Male) F (Female)


Code Item


An element of a Code List .

  A type of Node


Code List


A list of Categories where each Category has a predefined Code assigned to it. 

  A kind of Node Set for which the Category contained in each Node has a Code assigned as a Designation .


For example:

1 - Male

2 - Female


Code Value


An alpha-numeric string used to represent a Code .

This is a kind of Sign used for Codes


Collection Description


The set of information that provides a textual description of the processes and methods used to undertake an Acquisition Activity . It provides a set of contextual and reference metadata about the acquisition process.





Unit of thought differentiated by characteristics

ISO 1087-1 defines Concept as a "unit of knowledge created by a unique combination of characteristics". First, the term knowledge is poorly defined, and the word thought seems to capture the idea more cleanly. Second, different systems may try to capture the same thought but depend on different characteristics (i.e., attributes). For instance, typical demographic surveys care about age, sex, income, ethnicity, and education of persons. However, persons in a justice survey are either criminals or victims.


Concept System


Set of Concepts structured by the relations among them.

Here are 2 examples                                                                                                      1) Concept of Sex: Male, Female, Other                                                                               2) ISIC (the list is too long to write down)


Conceptual Domain


Set of Categories , irrespective of any relations among them

Here are 3 examples -                                                                                                  1) Sex categories (enumerated CD): male, female, other                                                  

2) Non-negative whole number (described CD)                                                                         3) Endowment categories (enumerated CD) $0-$99,999; $100,000-$999,999; $1,000,000 and above


Contact Details


A collection of modes and strings by which an Organization Item can be contacted.

Contact modes can include (but are not limited to) telephone, e-mail or fax. In these cases, the relevant strings would be the telephone number, e-mail address and fax number.


Context Key


Gives semantic or structural meaning to the value of a Contextual String.

Context Key has two sub classes - Type and Language. For example: Type =  Short Name, or Language = French


Contextual String


A textual value, which is given context by one or more Context Keys.

A Contextual String can be given context by one or more Context Key. For example: Type =  Short Name, or Language = French


Control Transition


Governs how to determine the next Instrument Control based on factors such as the current location in the Instrument , the response to the previous questions etc.



Correspondence Table


A tool for the linking of Classifications . A Correspondence Table systematically explains where, and to what extent, the Categories in may be found in different Classification Schemes of the same Classification or in Classification Schemes of different Classifications .

Given 2 Category Sets                                                                               1) Marital Status A: Married, Single                                                                                       2) Marital Status B: Married, Single, Widowed, Divorced                                                                                               A Correspondence Table harmonizing the 2 Cate gory Sets will contain Maps that link Categories from each set:                                                                                                    Married (A) -> Married (B)                                                                                                   Single (A) <- Single (B), Widowed (B), Divorced (B)                                                              where the arrow points to the Category which is more generic.


Data Channel


A means of exchanging data.

A Data Channel is an abstract object that describes the means for communicating with Data Resource(s) . The Data Channel identifies the Instrument Implementation , Mode , and Data Resource that are to be used in a process. In some cases the Data Channel that is used by the Data Provider to send its responses could be different that the one used by the statistical office or organization to request information; the statistical office may put electronic formats that can be downloaded by the Data Provider and once answered returned by traditional mail. Two specialized objects are used to implement this abstract object: Channel Design Specification used at design time and Channel Activity Specification used at run time.


Data Consumer


An organization that uses data or metadata as input for further processing.



Data Flow


The Data Flow represents both the availability of data over time and the availability of sub sets of the possible data that could be made available according to a Data Structure .

There may be many data sets structured according to a Data Structure , perhaps made available at a pre-defined frequency (for example, monthly).


There can be many Data Flows that share the same Data Structure : for instance data for National Accounts may be compartmentalized into a number of Data Flows for organizational purposes or for data discovery purposes (there can be different Data Flows for different sub sets of National Accounts where each sub set is structured by the same Data Structure ).


Data Location


Identifies where a Data Set can be retrieved from.

This could be a Data Set structured in a known format and retrievable via a URL, or the URL of a service that can be queried to return such a Data Set . It could also be the location of a publication.


Data Point


A placeholder in a Data Set for an item of factual information obtained by measurement or created by a production process

Example for Unit Data: (1212123, 43) could be the age in years on the 1st of January 2012 of a person ( Unit ) with the social security number 1212123. The social security number is an identifying variable for the person whereas the age, in this example, is a variable measured on the 1st of January 2012.


Data Provider


An organization, association, group or person who delivers information for a S tatistical Activity .

A Data Provider is an organization, association, group or person that possesses statistical information (that it has collected, produced, bought or otherwise acquired) and that is willing to supply those data and metadata to a statistical organization.

data supplier

Data Resource


An organized collection of stored information made of one or more Data Sets which may be sourced from multiple Acquisition or Statistical Activities .

Data Resources are collections of structured or unstructured information that are used by a statistical activity to produce information. This information object is a specialization of an Information Resource .

data source

Data Set


An organized collection of data.

Examples of Data Sets could be observation registers, time series, longitudinal data, survey data, rectangular data sets, event-history data, tables, data tables, cubes, registers, hypercubes, and matrixes. A broader term for Data Set could be data. A narrower term for Data Set could be data element, data record, cell, field

database, data file, file, table

Data Structure


Defines the structure of an organized collection of data ( Data Set ).

The structure is described using Data Structure Components that can be either Attribute Components, Identifier Components or Measure Components . Examples for unit data include social security number, country of residence, age, citizenship, country of birth, where the social security number and the country of residence are both identifying components ( Unit Identifier Component ) and the others are measured variables obtained directly or indirectly from the person ( Unit ) and are Unit Measure Components .


Data Structure Component


The identification of the Represented Variable used in the context of a Data Structure .

A Data Structure Component can be an Attribute Component, Measure Component or an Identifier Component .


Example of Attribute Component : The publication status of an observation such as provisional, revised.


Example of Measure Component : age and height of a person in a Unit Data Set or number of citizens and number of households in a country in a Data Set for multiple countries ( Dimensional Data Set ). 


Example of Identifier Component : The personal identification number of a Swedish citizen for unit data or the name of a country in the European Union for dimensional data.


Data Type


The computational model for some data, characterized by axioms and operations, and containing a set of distinct values.

Here are 3 examples (with type families taken from ISO/IEC 11404)

1) State (nominal data): unordered, no arithmetic                                                                            2) Integer (interval data): Ordered, subtraction, bounded below                                                                          3) Enumerated (ordinal data): ordered, no arithmetic




Association of a Unit with an element of a Value Domain .

A Datum is the actual instance of data that was collected. It is the value with populates a cell in a table.

Here are 2 examples -                                                                                                                                                     1. <M, male> (for unit Dan Gillman with respect to sex of US persons)

2. <3, $1,000,000 and above> (for unit John Hopkins with respect to endowments for US universities)


Described Conceptual Domain


A Conceptual Domain , with each Concept defined by a Rule .

For example: All real numbers between 0 and 1 (where 'number' is a Concept, and 0 and 1 are possible designations.)

non-enumerated conceptual domain

Described Value Domain


A Value Domain , with each Designation defined by a Rule .

For example: All real decimal numbers between 0 and 1 (Where 'decimal number' is a Designation, such as the numeric string 0.5 for the number one half)

non-enumerated value domain

Design Context


Methodological metadata that provide the basis for the specification of the information objects required as input to and output from the Process Step Design including Process Method and Rules .





The name given to an object so it can be identified.

  The association of a Concept with a Sign which denotes it.

term, code, appellation

Dimensional Attribute Component


  A Represented Variable that is required to supply information in addition to the identification and measures of a Dimensional Data Set .

  Example: The publication status of an observation such as provisional, revised.


Dimensional Data Point


A placeholder or cell in a Dimensional Data Set determined by the crossing of (all) the values for the Identifier Components to contain the value ( Datum ) for an Instance Variable (defined by a Measure Component ) with respect to a given Unit .

A Dimensional Data Point is uniquely identified by the combination of exactly one value for each of the dimensions ( Dimensional Identifier Component ) and one measure ( Dimensional Measure Component ).                                                                                                   There may be multiple values for the same Dimensional Data Point that is for the same combination of Dimension values and the same measure. The different values represent different versions of the data in the Data Point . Values are only distinguished on the basis of quality, date/time of measurement or calculation, status, etc. This is handled through the mechanisms provided by the Datum information object.


Dimensional Data Set


A collection of aggregated data that conforms to a known structure.


hyper cube, macro data, n-cube, aggregated data, multi-dimensional data, dimensional data

Dimensional Data Structure


Defines the structure of a collection of aggregated data by Represented Variables (in their respective roles as Dimensional Measure Components, Dimensional Attribute Component or Dimensional Identifier Components ) and their Value Domains .

This is similar to the SDMX Data Structure Definition:  Set of structural metadata associated to a Data Set , which includes information about how Concepts are associated with the measures, dimensions, and attributes of a data cube, along with information about the representation of data and related descriptive metadata.

file description, data set description

Dimensional Identifier Component


A Represented Variable that is required to identify or classify each observation value in a Dimensional Data Set .

Example: The name of a country in the European Union, the type of dwelling, the gender of a person, age-category of person


Dimensional Measure Component


A Represented Variable that has been given a role in a collection of aggregated data to hold the summary values (means, mode, total, index, etc.) for a specific sub-population.

Examples: average age or total income in a sub-population


Dissemination Activity


The set of executed processes and the actual resources required as inputs and produced as outputs in the dissemination of data for a given P opulation for a particular reference period, or of metadata. It describes the process and resources required in the dissemination of data and metadata in a Statistical Program .

This object holds Statistical Activity information that relates specifically to data and metadata dissemination. It inherits the relationships and attributes from the Statistical Activity type. A special type of Dissemination Activity is Publication Activity .


Dissemination Design


The specification of the resources required and processes used and description of relevant methodological information for a set of activities to disseminate data about a given Population, or metadata.

This object holds Statistical Program Design information that relates specifically to dissemination. It inherits the relationships and attributes from the Statistical Program Design type.


Dissemination Service


The mechanism for delivering, and possibly creating, structured content dynamically in response to a consumer request and in accordance with defined parameters as provided by that consumer.

A Dissemination Service will deliver a Representation created by a process that it invokes. The inputs into the Dissemination Service determine and feed the process that is to be invoked.


A Dissemination Service retrieves the information to be structured and delivered through an Information Resource . As part of the service execution, the consumer may be given a chance to browse or search through the collection of information available from the Information Resource exposed by the Dissemination Service . Based on the results, the consumer can than refine the Output Specification as (further) input to the Dissemination Service to complete the process of creating and delivering the information required in the form of a Representation to the consumer.




1. SDMX SOAP Data Web Services: The query XML message provides the Service with data selection and the specification of the preferred format (e.g. Generic format or Structured format, time series or cross-sectional). Based on this input the Service will retrieve a Data Set from the Data Resource and invoke a process that will format the data as an SDMX data message.

                                                                                                                  2. A manual service such as a response to a telephone request where the person answering the call based on the caller's request would mail a PDF (which might either be a Product or dynamically created from another source).


Enumerated Conceptual Domain


A Conceptual Domain expressed as a list of Categories .

Example: The Sex categories of 'Male' and 'Female'.


Enumerated Value Domain


A Value Domain expressed as a list of Designations .

Example - Sex Codes <m, male>; <f, female>; <o, other>


Environment Change


A requirement for change (type of Statistical Need ) that originates from a change in the operating environment of the statistical activity.

An Environment Change reflects variations in the context of execution of the S tatistical Activity that create a need for a modification in the way that this activity is conducted. Environment Changes can be of different origins and also take different forms. They can result from a precise event (budget cut, new legislation enforced) or from a progressive process (technical or methodological progress, application or tool obsolescence). Other examples of Environment Changes include the availability of a new Data Resource , the opportunity for new collaboration between agencies, etc.                       


Environment Change objects may be structured in very diverse ways, but an object will usually group text material describing the type of change that has occurred and created the need for change. This allows the statistical organization to document precisely the (possibly multiple) changes in environment that have led to the Statistical Need .


Evaluation Assessment


A type of Assessment that evaluates the process outputs of a statistical activity based on a formalized methodological framework.

The evaluation can be done in regard to various characteristics of the output, for example its quality, the efficiency of the production process, its conformance to a set of requirements, etc. The result of an Evaluation Assessment can lead to the creation of a Statistical Need : in this case, the Statistical Need will reference the Evaluation Assessment for traceability and documentary purposes.


Frame Population


A Population represented by records in a frame, which is the observable part of a Target Population and provides a reasonable approximation to it.

Example: most recent population census frame

object class

Gap Analysis


An expression of the difference (the 'gap') between the current state and a desired future state.

A Gap Analysis is a type of Assessment that compares the actual state of the activity with a potential state that would correspond to the implementation of a change. An organization will list the factors that define its current state and what is needed to reach its target state. This will for example document a Business Case and help to take the decision to implement the change or not.

need assessment

Identifiable Artefact


An abstract class that comprises the basic attributes and associations needed for identification, naming and other documentation.



Identifier Component


The role given to a Represented Variable in the context of a Data Structure . The role is to identify the unit in an organized collection of data.

An Identifier Component is a sub-type of Data Structure Component .  The personal identification number of a Swedish citizen for unit data or the name of a country in the European Union for dimensional data.




A person who acts, or is designated to act towards a specific purpose.



Information Request


An outline of a need for new data or metadata required for a particular purpose.

An Information Request is a special case of Statistical Need that comes in a more organized way, for example by specifying on which S ubject Field the information is required, or what type of C oncept is to be measured, or even the type of U nits that are under consideration. The Information Request can for example be expressed internally, or by another statistical organization or authority.


Information Resource


An abstract notion that is any organized collection of information.

The only concrete sub class is Data Resource . The Information Resource allows the model to be extended to other types of resource.


Instance Interviewer Instruction


The use of an Interviewer Instruction in a particular Instrument .



Instance Question


The use of a Question in a particular Instrument .



Instance Question Block


  The use of a Question Block in a particular Instrument .



Instance Statement


  The use of a Statement in a particular Instrument .



Instance Variable


The use of a Represented Variable within a Data Set . It may include information about the source of the data.

The Instance Variable is used to describe actual instances of data that have been collected.                                                                                                                            Here are 3 examples:   

                                                                                              1) Gender:                                                                                                                           Dan Gillman has gender <m, male>,                                                                                Arofan Gregory has gender<m, male>,                                                                                       etc.


2) Number of employees:                                                                                               Microsoft has 90,000 employees;                                                                                         IBM has 433,000 employees,                                                                                                 etc.   

                                                                                                                          3) Endowment:                                                                                                                Johns Hopkins has endowment of <3, $1,000,000 and above>,

Yale has endowment of <3, $1,000,000 and above>,                                                                        etc.




A tool conceived to record the information that will be obtained from the Observation Units.

The Instrument describes the tool used to collect data. It could be a traditional survey, a set of requirements for a software collection program, a clinical procedure, etc.


Instrument is described from the perspective of the statistical organization collecting the data. It includes the special type of Instrument used for the explicit purpose of gathering data through a questionnaire (Survey Instrument). The behavior and characteristics of a concrete Instrument is determined by an Instrument Implementation . Several implementations can be based in the same Instrument giving the possibility of using multiple channels and to apply different collection techniques ( Modes ) to gather data.


An example of this is when a printed format to collect information for a survey is substituted by a software program; in both cases the Instrument will collect the data from the Unit but the behavior of the Instrument will be different accordingly with its implementation.


Instrument Control


A record of the flow of an Instrument and its use of Questions , Interviewer Instructions and Statements .



Instrument Implementation


A concrete and usable tool for gathering information based on the rendering of the description made by an Instrument .

This represents an implementation of an Instrument . It describes the way in which an Instrument has been translated from a design to a concrete tool. It could represent a printed form, a software program made following a specific technological paradigm (web service, web scraping robot, etc.), the software used by a specialized device to collect data, etc. When it describes a Survey Instrument , it can contain descriptions of how each construct (e.g. Questions , Value Domains , validation Rules contained in the Instrument ) is implemented.


Interviewer Instruction


  Directions given to an interviewer to aid the completion of the Instrument

  Example: “Show prompt card before reading question”




The linguistic code used. This takes into account geographic variations, e.g. Canadian French or Australian English.





Set of Concepts which are mutually exclusive and exhaustive

For example, section, division, group and class in ISIC Rev. 4. A Level often is associated with a Concept , which defines it.


Logical Record


Describes a type of Unit Data Record for one Unit within a Unit Data Set .

A Logical Record describes the record using variables of which one or more can uniquely identify the record ( Identifier Component ). It represents characteristics of a real or artificially constructed Unit , which could be represented by a Concept . The relationships between Logical Records are given by Record Relationships .


Examples: household, person or dwelling record.


Maintenance Agency


The organization or expert body that maintains an artefact.





An expression of the relation between

a Category in a source Classification Scheme and a corresponding Category in  the target Classification Scheme .

Given 2 Category Sets

1) Marital Status A

·            Married

·            Single


2) Marital Status B

·            Married

·            Single

·            Widowed

·            Divorced


The 2 Married Categories may be compared as follows

Married (A) -> Married (B)

where the arrow points to the Category which is more generic.


Measure Component


The role given to a Represented Variable in the context of a Data Structure . The role is to hold the observed/derived values for a particular Unit in an organized collection of data.

A Measure Component is a sub-type of Data Structure Component. For example age and height of a person in a Unit Data Set or number of citizens and number of households in a country in a Data Set for multiple countries ( Dimensional Data Set ).




A set of characteristics that describe the technique (the "how") used for the data acquisition through a given Data Channel based on a specific Instrument Implementation.

While the Data Channel describes the means used for data acquisition, the Instrument describes the "what" (i.e. the content, for example, in terms of questions in a questionnaire or a list of agreed time series codes in a data exchange template) and an Instrument Implementation describes the tool used to apply the Instrument ; the Mode describes "how" the Data Channel is going to be used. The Mode is relevant for all types of Data Channels , Instrument Implementations and Instruments and can change over time. The list of Modes will potentially grow in the future and vary from organization to organization.


Multiple Question Item


A construct that has all of the properties of a Question but additionally links to sub questions.

A Multiple Question Item is a specific type of Question .




A combination of a Category and related attributes.

A Node is created as a Category , Code or Classification Item for the purpose of defining the situation in which the Category is being used.


Node Set


A set of Nodes

Node Set is a kind of Concept System . Here are 2 examples:      


1) Sex Categories

·            Male

·            Female

·            Other


2) Sex Codes

·            <m, male>

·            <f, female>

·            <o, other>


Non Structured Data Set


A Data Set whose structure is not described in a Data Structure.



Observation Unit


A Unit for which information can actually be obtained during data collection.

The sub-set of the Population of interest for which information can actually be obtained. For example, if the Population is the persons living in Ontario, the Observation Units might be persons currently residing in Ontario neither in an institution nor in a remote northern location nor temporarily out of the province.

collection unit, unit of observation, unit of collection

Organization Item


An abstract class which has two sub classes: Organization Unit and Individual.



Organization Item Role


The function or activities of an Organization Item , in statistical processes such as collection, processing and dissemination.


organization role

Organization Scheme


A maintained collection of Organization Items .



Organization Unit


A unique framework of authority within which a person or persons act, or are designated to act, towards some purpose.



Output Specification


Contains the specifications for the dynamic creation and delivery of a Representation by a Dissemination Service .

An Output Specification is a specialization of Parameter Input . It is in fact a request for the dynamic creation and delivery of a Representation . It contains references to the information (e.g. a Data Set , a Data Structure , a Code List, a publication plan) desired with specifications concerning selections, (technical) form and/or method of delivery.    


The references to the information come from the collection of information sources provided by the Information Resource that is exposed by the Dissemination Service . The consumer may select any (combination) of those information sources by including the references in the Output Specification .

                                                                                                                      Note that the Output Specification may be "soft" or "broad" in that it may identify groups of internal information objects rather than individual ones. For instance, all Data Sets within a certain (sub) category or theme. This may lead to multiple Representations being delivered.          


As part of the Output Specification , the consumer may be given the option to select one of a number of possible formats for the Representation (e.g. SDMX, CSV, JSON or PDF) or to select one of a number of possible methods for delivery (web service response, email, FTP, mail delivery, etc.)                                                                                                                        The Dissemination Service may be used to request future deliveries of Representations for information that is not yet available. This results in a subscription, where the specification of the Representations to be delivered in future is given in the Output Specification.


Parameter Input


Inputs used to specify which configuration should be used for a specific Process Step which has been designed to be configurable.

Parameter Inputs may be provided where Rules and/or Business Service interfaces associated with a particular Process Step have been designed to be configurable based on inputs passed in to the Process Step.




The total membership of a defined class of people, objects or events

Population has a number of subtypes. Here are 3 examples –

1. US adult persons

2. US computer companies

3. Universities in the US




A nominated set of Process Step Designs , and associated Process Controls (flow), which have been highlighted for possible reuse.

In a particular statistical business process, some Process Steps may be unique to that business process while others may be applicable to other business processes. A Process can be seen as a reusable template. It is a means to accelerate design processes and to achieve sharing and reuse of design patterns which have approved effective. Reuse of process patterns can also lead to reuse of relevant Business Services and business Rules .      

                                                                                                     By deciding to reuse a Process , a designer is actually reusing the "pattern" of Process Step Designs and Process Controls associated with that Process . They will receive a new instance of the Proce ss Step Designs and Process Controls . If they then tailor their "instance" of the Process Step Designs and Process Controls to better meet their needs they will not change the definition of the reusable Process .


Process Control


A decision point which determines the flow between Process Steps .

The typical use of Process Control is to determine what happens next after a Process Step Design is executed. The possible paths, and the decision criteria, associated with a Process Control are specified as part of designing a production process. There is typically a very close relationship between the design of Process Steps and the design of Process Controls .      


It is possible to define a Process Control where the next Process Step that will be executed is a fixed value rather than a "choice" between two or more possibilities. Where such a design would be appropriate, this feature allows, for example, initiation of a Process Step representing the GSBPM Process Phase (5) to always lead to initiation of GSBPM sub-process Integrate Data (5.1) as the next step.


This allows a process designer to divide a business process into logical steps (for example, where each step performs a specific Business Function ) even if these Process Steps will always follow each other in the same order. In all cases, the Process Control defines and manages the flow between Process Steps , even where the flow is "trivial". Process Step Design is left to focus entirely on the design of the Process Step itself, not sequencing between steps.


Process Input


Any instance of an information object which is supplied to a process step at the time its execution is initiated.

Process Input has three subtypes: Process Support Input , Parameter Input and Transformable Input , to be able to identify the range of roles that the Process Inputs perform in the course of a Process Step . A Process Input may be provided to a Process Step to:                                                                                               - "add value" to that input by producing an output which represents a "transformed" version of the input.

- control (for example, as a parameter) or influence the behavior of the Process Step .            

- be used by the Process Step as either an input or a guide.                                                        


Note: The same instance of an information object may perform different roles in regard to different Process Steps .


Process Input Specification


A record of the types of inputs required for a Process Step Design

The Process Input Specification enumerates the Process Inputs required at the time a Process Step Design is executed. For example, if five different Process Inputs are required at the time,  the Process Input Specification will describe each of the five inputs. For each required Process Input the Process Input Specification will record:


1. the type of Process Input ( Parameter Input, Process Support Input or Transformable Input ); and                                                                                                                               2. the type of information object (based on GSIM) which will be used as the Process Input (Example types might be a Dimensional Data Set or a Classification ).


The Process Input to be provided at the time of Process Step execution will then be a specific instance of the type of information object specified by the Process Input Specification . For example, if a Process Input Specification requires a Dimensional Data Set then the corresponding Process Input provided at the time of Process Step execution will be a particular Dimensional Data Set .


Process Method


A specification of the technique which will be used to perform the unit of work.

The technique specified by a Process Method is independent from any choice of technologies and/or other tools which will be used to apply that technique in a particular instance. The definition of the technique may, however, intrinsically require the application of specific Rules (for example, mathematical or logical formulas).


A Process Method describes a particular method for performing a Business Function. Similarly to the way in which Business Function documents the high level purpose of a process step ("what business purpose does this process step serve?"), Process Method documents the high level methodological "how" associated with the Process Step . Where a Process Step Design applies a method which is not specifically statistical in nature, however, this can still be recorded as the Process Method .


Process Metric


A Process Output whose purpose is to measure and report some aspect of how the Process Step performed during execution.

A Process Metric is a sub-type of Process Output which records information about the execution of a Process Step . For example, how long it took to complete execution of the Process Step and what percentage of records in the Transformable Input was updated by the Process Step to produce the Transformed Output.


One purpose for a Process Metric may be to provide a quality measure related to the Transformed Output . For example, a Process Step with the Business Function of imputing missing values is likely to result, as its Transformed Output , in a Data Set where values that were missing previously have been imputed. Statistical quality measures, captured as Process Metrics for that Process Step may include a measure of how many records were imputed, and a measure of how much difference, statistically, the imputed values make to the dataset overall.

                                                                                                        Another purpose for a Process Metric may be to measure an aspect of the Process Step which is not directly related to the Transformed Output it produced. For example, a Process Metric may record the time taken to complete the Process Step or other forms of resource utilization (for example, human and/or IT).                               


Often these two kinds of Process Metrics will be used in combination when seeking to, for example, monitor and tune a statistical business process so its statistical outputs achieve the highest level of quality possible based on the time, staff and/or IT resources that are available.


Process Output


Any instance of an information object which is produced by a Process Step as a result of its execution.

Process Outputs are subtyped.