Comments to GSIM – SDMX mapping
 

Version: 1.0

Date: 1 4 March 201 3

Author: Vincenzo Del Vecchio – Bank of Italy

 

 

These comments are relevant to the mapping between GSIM and SDMX as described in the document “GSIM-standards_mapping”, version available the 1 st of March 2013. The definition of the GSIM artefacts is derived from the document “Generic Statistical Information Model (GSIM): Specification”, version 1.0, Decembre 2012, p. 5-31. The definition of the SDMX artefacts is derived from the SDMX Information Model version 2. 1 April 2011 . The general approach for defining unit data using SDMX is described in the document “The SDMX 2.1 support to different data”, April 2011, presented in the SDMX 2011 Global Conference in Washington .

 

This review is relevant to the “structural” part of GSIM and SDMX. The summary table below, relevant to the artefacts qualified as “structures”, evidences the original mapping and the mapping resulting from the review (the suggested mapping).

The review of “structural” artefacts is not completed so far. In the summary table of the review below, the artefacts not yet taken into consideration are highlighted in grey.

 

 

Summary Table of the review

 

GSIM

SDMX

SDMX

SDMX Notes (from the original mapping)

SDMX Notes (from the review)

 

 

Original mapping

Suggested mapping

 

 

 

 

 

 

 

 

 

 

 

Legend:

 

 

 

 

 

The character colour blue is for agreement

 

 

 

The character colour red is for disagreement or not complete agreement

 

 

 

The grey background means "artifact not rewiewed"

 

 

 

The symbol  <    means "the GSIM class is contained in the SDMX class …”

 

 

 

 

 

 

 

 

 

 

 

 

116

Attribute Component

 

Data Attribute

 

 

117

Data Flow

Dataflow

< Dataflow

 

SDMX Dataflows concide with the set union of (i) GSIM Dataflows don't having Logical Records and (ii) Logical records of GSIM Dataflows having Logical Records; the analysys should be completed with the Data Resource class

118

Data Location

Datasource

 

 

 

119

Data Point

 

 

 

 

120

Data Resource

 

 

 

 

121

Data Set

Data Set

 

 

 

122

Data Structure

 

Data Structure Definition

SDMX: Does not maintain Variables as independent constructs: the Dimension, Measure, and Data Attribute are defined in a Data Structure and are not reusable (i.e. a Dimension in another Data Structure Definition that uses the same Concept and Representation must be defined explicitly and not by reference to another Dimernsion etc.).  Each Dimension, Measure, and Data Attribute is defined in terms of the use of a Concept and its valid Representation(GSIM Value Domain) in terms of the data to be collected or disseminated.

SDMX does allow a Concept to have a "default representation" (GSIM Value Domain) and so this is a direct mapping to the Represented Variable, which is a sub class of the GSIM Concept.

There is partial agreement with the original mapping (see also the description of the review below) and  with the original note, which should be reformulated as follows: SDMX mantains Represented Variables as independent construct as well as GSIM, but in SDMX the GSIM Represented Variables are named Concepts (this will be better analyzed in the next steps).

123

Data Structure Component

Component

Component

SDMX: see note for Process

SDMX also has a powerful Transformation and Expression model but with no syntactic implementation.

The original note seems not related to this mapping

124

Dimensional Attribute Component

Data Attribute

< Attribute

 

See the description of the review below

125

Dimensional Data Point

see Data Point

 

 

 

126

Dimensional Data Set

Data Set

 

 

 

127

Dimensional Data Structure

Data Structure Definition

< Data Structure Definition

 

See the description of the review below

128

Dimensional Identifier Component

Dimension

< Dimension

 

See the description of the review below

129

Dimensional Measure Component

Measure

< Measure

 

See the description of the review below

130

Dissemination Service

 

 

 

 

131

Identifier Component

 

Dimension

SDMX: see note 1

See the description of the review below

132

Information Resource

 

 

 

 

133

Logical Record

 

< Dataflow

 

See the note on Dataflow and the description of the review below

134

Measure Component

 

Measure

 

See the description of the review below

135

Non Structured Data Set

 

 

SDMX: see note for Process Input

 

136

Output Specification

 

 

SDMX: see note for Process Output

 

137

Product

 

 

SDMX: see note for Contextual String

 

138

Provision Agreement

 

 

 

 

139

Publication Activity

 

 

 

 

140

Record Relationship

 

 

 

 

141

Representation

 

 

 

 

142

Unit Attribute Component

 

< Attribute

 

See the description of the review below

143

Unit Data Point

 

 

 

 

144

Unit Data Record

 

 

 

 

145

Unit Data Set

 

 

 

 

146

Unit Data Structure

 

< Data Structure Definition

 

See the description of the review below

147

Unit Identifier Component

 

< Dimension

 

See the description of the review below

148

Unit Measure Component

 

< Measure

 

See the description of the review below

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Description of the review

 

The original mapping was reviewed analyzing the correspondence between the GSIM and SDMX artefacts and the basic notions of mathematics and statistics. Although the analysis is done at a conceptual/abstract level, it is worthwhile to point out that the part relevant to SDMX corresponds to concrete solutions applied in real use cases.

It is assumed that a statistical datum provides information on some groups of “statistical units” (e.g. groups of people, families, enterprises, banks, securities …) with reference to certain time values. The groups may  contain as many elements as needed, even just one unit of a given population (one people, family, enterprise, bank, security …), so this definition of “statistical datum” is intended to be valid for dimensional data (also called aggregate data or macro-data, which typically refers to groups composed of many units), registers (which typically contains data relevant to the single units that are registered), and unit data (data relevant to single statistical units, for example collected through a questionnaire, also called “micro data”).

 

A statistical datum is considered the law which, for each pair constituted by a group of units and a time value, associates the value(s) of the information we need, in other words the measure(s) we are interested in.  It follows that a statistical datum may be considered as a mathematical function , having as independent variables the ones that identify the groups of statistical units and as dependent variables the ones that express the measures.

 

As known, a mathematical function is made of an intension (the data definition, which includes the identification of the function and the specification of its structure) and an extension (the list of the observations, one for each group of statistical units and time). As obvious, the structure of each observation (the extension) must comply with the structure of the function (the intension). Note that the identification and the extension of a mathematical function are strictly associated (one to one), any function has just one structure and the same structure may be common to many functions.

 

 

The Dataflow and the Data Resource

 

In the SDMX meta-model, the artefact that identifies the intension of a statistical datum (i.e. the identification of a mathematical function) is the Dataflow, therefore each dataflow corresponds to a different mathematical function and has its own extension.

 

In GSIM, as a first approximation, the Data Flow seems to have the same role than the SDMX Dataflow, in fact each GSIM Data Flow identifies a datum and has a structure, so corresponds to a mathematical function. If it was so, a GSIM Data Flow would match with the SDMX Dataflow.

 

However, the correct mapping seems more complex, in fact in GSIM the Unit Data Structure is different from the Dimensional Data Structure because it may have Logical Records (GSIM specs fig.19). The description of Logical Record is the following (GSIM specs p.29 points 99 and 100): “A Unit Data Set may contain data on more than one type of Unit, each represented by its own record type. Logical Records describe the structure of such record types …”. According to this definition, each Logical Record has its own data structure and therefore corresponds to a different mathematical function. As a consequence, each GSIM Logical Record would be defined in SDMX as a Dataflow.

 

Therefore, as for statistical data, the SDMX Dataflow coincides with the union of (i) GSIM Data Flows which don’t have Logical Records and (ii) Logical Records of GSIM Data Flows having Logical Records .

 

Note that the analysis described so far has still some degree of uncertainty deriving from the following elements:

 

First the GSIM Data Resource is not yet considered. In particular, there is the need of understanding if GSIM makes it possible to use the Data Resource class for identifying mathematical functions or not. If the answer is “yes” (for example in case the Data Flow is used to identify exchanged data and the Data Resource is used to identify stored data), the mapping should be properly adjusted.

 

Second, it is necessary to solve some ambiguities about the Logical Record.  In fact, according to the definition of the GSIM specs p.29 points 99 and 100, the Unit Data Structure would be composed of Logical Records and each Logical Record would have its own structure, whereas, as for the fig.19, it seems that the Unit Data Structure has a structure and that the Logical Record has not a structure (even the fig.18 seems to be aligned with the fig.19 and not aligned with the description of Logical Record, in fact it shows that a Data Set is structured by a Data Structure and the Logical Record is not considered). In this analysis it was assumed that the definition of the GSIM specs p.29 points 99 and 100 is true (and the fig.18 and 19 are not completely aligned), otherwise all the Logical Records of a Unit Data Structure would have the same structure, and the Logical Record class wouldn’t have a reason for existing any more.

 

 

Data Structure

 

Each mathematical function (representing statistical data) must have one and just one data structure. Both in GSIM and in SDMX the data structure is identified in a separate class, to allow different data to share the same structure.

 

The GSIM Data Structure has two sub-classes, namely the Unit Data Structure and the Dimensional Data Structure, whereas the SDMX Data Structure Definition hasn’t sub-classes.

 

If we consider a given list of data including GSIM Unit Data and Dimensional Data, their structures would be defined separately in one of the two dedicated GSIM sub-classes, whereas in SDMX all of them would be defined in the same class (Data Structure Definition).

 

This implies that the SDMX Data Structure Definition class corresponds to the GSIM Data Structure class and therefore corresponds to the union between the GSIM Unit Data Structure class and Dimensional Data Structure class

 

The GSIM structures are defined through Data Structure Components, which are of two types corresponding to the two types of data structures, namely the Unit Data Structure Components and the Dimensional Data Structure Components. The SDMX structures are defined through Components having the same meaning than GSIM but without distinguishing sub-types.

 

For the same reasons explained above, the SDMX Component class corresponds to the GSIM Data Structure Component class and to the union between the GSIM Unit Data Structure Component class and Dimensional Data Structure Component class .

 

In GSIM, both the types of structures, the Unit Data Structure and the Dimensional Data Structure, have Identifier Components, Measure Components and Attribute Components. Even the SDMX Data Structure Definition includes equivalent components, which are named Dimensions, Measures and Attributes respectively, only SDMX doesn’t distinguishes them in sub-types. Therefore, according to the same reasons explained above:

      The SDMX Dimension class corresponds to the union between the GSIM Unit Identifier Component class and Dimensional Identifier Component class

      The SDMX Measure class corresponds to the union between the GSIM Unit Measure Component class and Dimensional Measure Component class

      The SDMX Attribute class corresponds to the union between the GSIM Unit Attribute Component class and Dimensional Attribute Component class

 

 

Considerations on the GSIM and SDMX basic structures

 

Comparing the GSIM and the SDMX approach in modeling data and data structures, it is possible to make some additional considerations. Even if the mapping is good, the two models are not equivalent according to some points of view.

 

The main difference lies in the fact that GSIM differentiates Unit and Dimensional data in terms of their structure, whereas SDMX doesn’t differentiate them.

 

The GSIM need of differentiating  derives from the fact that Unit Data are allowed to have many Logical Records (one for each different structure) whereas Dimensional Data are not allowed, in fact all the classes connected to the Data Structure are equivalent in GSIM and SDMX save the Logical Record class (GSIM specs fig.19).

 

In SDMX there is not the need of differentiating Unit and Dimensional structures because of   different implicit assumptions:

o         a datum should be always considered equivalent to a mathematical function

o         a mathematical function has just one structure (cannot have more structures)

o         different structures should be described through different mathematical functions (i.e. through different data)

 

This means that GSIM Logical Records, if modeled through SDMX, would correspond to different SDMX Dataflows, as already mentioned.

 

The additional GSIM purpose (in introducing the Logical Record class and in differentiating Unit and Dimensional data) seems to lie in the need of specifying that more logical records belong to the same IT object (the traditional dataset having more record types).

 

Even if the purpose is valid, the GSIM solution has some drawbacks:

 

o         The need of mapping mathematical functions and the relevant IT containers is not a valid reason for differentiating Unit and Dimensional data, in fact even Dimensional Data having different structures may be stored in the same IT container (meaning any possible IT artifact containing data, e.g. traditional dataset, relational table, xml file, spreadsheet and so on). Therefore, if the logical records are allowed for Unit data, they should be allowed also for Dimensional data. Otherwise, if the logical records are not allowed for Dimensional Data, they shouldn’t be allowed either for Unit data;

o         it is obviously possible that the same IT object contains the extensions of many statistical data having different structures (meaning mathematical functions), however this is true for any kind of IT container, so not only for traditional datasets having different data types but also for DBMS tables, XML files, spreadsheets and so on (this seems to be not considered in the GSIM solution); moreover, just to consider the real cases in all their variety, there exist cases in which the extension of a mathematical function is contained in more than one IT container (the GSIM solution can’t model this case in a uniform way);

o         The “statistical” perspective (statistical artifacts corresponding to mathematical objects) should be well distinguished from the “IT” perspective (the IT artifacts and objects) even because they are oriented to different people (the former to users/statisticians, the latter to IT experts); therefore a separate part of the model should be devoted to the mapping between the former (e.g. Data Flows) and the latter (IT data containers), whereas in the case of the Logical Record class GSIM mixes up the different perspectives;

o         The resulting GSIM model is unnecessarily more complex and this greater complexity is counter-productive; in fact, according to the previous points,  it would have been possible  to unify the representation of Unit and Dimensional data by assuming that each logical record is a distinct Data Flow, corresponding to a mathematical function and having its own data structure; this way, the Logical Record class wouldn’t have existed any more, the differences between Unit and Dimensional data would have disappeared and the resulting model would have been simpler and more powerful at the same time: 

      Simpler because there wouldn’t be the need of different descriptions (and consequently different behaviors) for Unit and Dimensional data, which causes inevitably more difficulties for the users, more barriers for the integration of different types of data, more complex software, more costs;

      More powerful both because the joint use of Unit and Dimensional data would have been facilitated and also because the mapping between the mathematical structures and the IT structures could have be solved not only for Unit data but for Dimensional data too.

o         Adopting the current modeling choice, GSIM gives implicitly a negative message to the possible users of the model, in fact it suggests that Unit and Dimensional data should be differentiated (this is also explicitly said in the points 91 to 94 of the GSIM specs).