Comments to GSIM – SDMX mapping
Version: 1.0
Date: 1 4 March 201 3
Author: Vincenzo Del Vecchio – Bank of Italy
These comments are relevant to the mapping between GSIM and SDMX as described in the document “GSIMstandards_mapping”, version available the 1 st of March 2013. The definition of the GSIM artefacts is derived from the document “Generic Statistical Information Model (GSIM): Specification”, version 1.0, Decembre 2012, p. 531. The definition of the SDMX artefacts is derived from the SDMX Information Model version 2. 1 April 2011 . The general approach for defining unit data using SDMX is described in the document “The SDMX 2.1 support to different data”, April 2011, presented in the SDMX 2011 Global Conference in Washington .
This review is relevant to the “structural” part of GSIM and SDMX. The summary table below, relevant to the artefacts qualified as “structures”, evidences the original mapping and the mapping resulting from the review (the suggested mapping).
The review of “structural” artefacts is not completed so far. In the summary table of the review below, the artefacts not yet taken into consideration are highlighted in grey.
Summary Table of the review

GSIM 
SDMX 
SDMX 
SDMX Notes (from the original mapping) 
SDMX Notes (from the review) 


Original mapping 
Suggested mapping 











Legend: 





The character colour blue is for agreement 




The character colour red is for disagreement or not complete agreement 




The grey background means "artifact not rewiewed" 




The symbol < means "the GSIM class is contained in the SDMX class …” 













116 
Attribute Component 

Data Attribute 


117 
Data Flow 
Dataflow 
< Dataflow 

SDMX Dataflows concide with the set union of (i) GSIM Dataflows don't having Logical Records and (ii) Logical records of GSIM Dataflows having Logical Records; the analysys should be completed with the Data Resource class 
118 
Data Location 
Datasource 



119 
Data Point 




120 
Data Resource 




121 
Data Set 
Data Set 



122 
Data Structure 

Data Structure Definition 
SDMX: Does not maintain Variables as independent constructs: the Dimension, Measure, and Data Attribute are defined in a Data Structure and are not reusable (i.e. a Dimension in another Data Structure Definition that uses the same Concept and Representation must be defined explicitly and not by reference to another Dimernsion etc.). Each Dimension, Measure, and Data Attribute is defined in terms of the use of a Concept and its valid Representation(GSIM Value Domain) in terms of the data to be collected or disseminated.

There is partial agreement with the original mapping (see also the description of the review below) and with the original note, which should be reformulated as follows: SDMX mantains Represented Variables as independent construct as well as GSIM, but in SDMX the GSIM Represented Variables are named Concepts (this will be better analyzed in the next steps). 
123 
Data Structure Component 
Component 
Component 
SDMX: see note for Process SDMX also has a powerful Transformation and Expression model but with no syntactic implementation. 
The original note seems not related to this mapping 
124 
Dimensional Attribute Component 
Data Attribute 
< Attribute 

See the description of the review below 
125 
Dimensional Data Point 
see Data Point 



126 
Dimensional Data Set 
Data Set 



127 
Dimensional Data Structure 
Data Structure Definition 
< Data Structure Definition 

See the description of the review below 
128 
Dimensional Identifier Component 
Dimension 
< Dimension 

See the description of the review below 
129 
Dimensional Measure Component 
Measure 
< Measure 

See the description of the review below 
130 
Dissemination Service 




131 
Identifier Component 

Dimension 
SDMX: see note 1 
See the description of the review below 
132 
Information Resource 




133 
Logical Record 

< Dataflow 

See the note on Dataflow and the description of the review below 
134 
Measure Component 

Measure 

See the description of the review below 
135 
Non Structured Data Set 


SDMX: see note for Process Input 

136 
Output Specification 


SDMX: see note for Process Output 

137 
Product 


SDMX: see note for Contextual String 

138 
Provision Agreement 




139 
Publication Activity 




140 
Record Relationship 




141 
Representation 




142 
Unit Attribute Component 

< Attribute 

See the description of the review below 
143 
Unit Data Point 




144 
Unit Data Record 




145 
Unit Data Set 




146 
Unit Data Structure 

< Data Structure Definition 

See the description of the review below 
147 
Unit Identifier Component 

< Dimension 

See the description of the review below 
148 
Unit Measure Component 

< Measure 

See the description of the review below 












Description of the review
The original mapping was reviewed analyzing the correspondence between the GSIM and SDMX artefacts and the basic notions of mathematics and statistics. Although the analysis is done at a conceptual/abstract level, it is worthwhile to point out that the part relevant to SDMX corresponds to concrete solutions applied in real use cases.
It is assumed that a statistical datum provides information on some groups of “statistical units” (e.g. groups of people, families, enterprises, banks, securities …) with reference to certain time values. The groups may contain as many elements as needed, even just one unit of a given population (one people, family, enterprise, bank, security …), so this definition of “statistical datum” is intended to be valid for dimensional data (also called aggregate data or macrodata, which typically refers to groups composed of many units), registers (which typically contains data relevant to the single units that are registered), and unit data (data relevant to single statistical units, for example collected through a questionnaire, also called “micro data”).
A statistical datum is considered the law which, for each pair constituted by a group of units and a time value, associates the value(s) of the information we need, in other words the measure(s) we are interested in. It follows that a statistical datum may be considered as a mathematical function , having as independent variables the ones that identify the groups of statistical units and as dependent variables the ones that express the measures.
As known, a mathematical function is made of an intension (the data definition, which includes the identification of the function and the specification of its structure) and an extension (the list of the observations, one for each group of statistical units and time). As obvious, the structure of each observation (the extension) must comply with the structure of the function (the intension). Note that the identification and the extension of a mathematical function are strictly associated (one to one), any function has just one structure and the same structure may be common to many functions.
The Dataflow and the Data Resource
In the SDMX metamodel, the artefact that identifies the intension of a statistical datum (i.e. the identification of a mathematical function) is the Dataflow, therefore each dataflow corresponds to a different mathematical function and has its own extension.
In GSIM, as a first approximation, the Data Flow seems to have the same role than the SDMX Dataflow, in fact each GSIM Data Flow identifies a datum and has a structure, so corresponds to a mathematical function. If it was so, a GSIM Data Flow would match with the SDMX Dataflow.
However, the correct mapping seems more complex, in fact in GSIM the Unit Data Structure is different from the Dimensional Data Structure because it may have Logical Records (GSIM specs fig.19). The description of Logical Record is the following (GSIM specs p.29 points 99 and 100): “A Unit Data Set may contain data on more than one type of Unit, each represented by its own record type. Logical Records describe the structure of such record types …”. According to this definition, each Logical Record has its own data structure and therefore corresponds to a different mathematical function. As a consequence, each GSIM Logical Record would be defined in SDMX as a Dataflow.
Therefore, as for statistical data, the SDMX Dataflow coincides with the union of (i) GSIM Data Flows which don’t have Logical Records and (ii) Logical Records of GSIM Data Flows having Logical Records .
Note that the analysis described so far has still some degree of uncertainty deriving from the following elements:
First the GSIM Data Resource is not yet considered. In particular, there is the need of understanding if GSIM makes it possible to use the Data Resource class for identifying mathematical functions or not. If the answer is “yes” (for example in case the Data Flow is used to identify exchanged data and the Data Resource is used to identify stored data), the mapping should be properly adjusted.
Second, it is necessary to solve some ambiguities about the Logical Record. In fact, according to the definition of the GSIM specs p.29 points 99 and 100, the Unit Data Structure would be composed of Logical Records and each Logical Record would have its own structure, whereas, as for the fig.19, it seems that the Unit Data Structure has a structure and that the Logical Record has not a structure (even the fig.18 seems to be aligned with the fig.19 and not aligned with the description of Logical Record, in fact it shows that a Data Set is structured by a Data Structure and the Logical Record is not considered). In this analysis it was assumed that the definition of the GSIM specs p.29 points 99 and 100 is true (and the fig.18 and 19 are not completely aligned), otherwise all the Logical Records of a Unit Data Structure would have the same structure, and the Logical Record class wouldn’t have a reason for existing any more.
Data Structure
Each mathematical function (representing statistical data) must have one and just one data structure. Both in GSIM and in SDMX the data structure is identified in a separate class, to allow different data to share the same structure.
The GSIM Data Structure has two subclasses, namely the Unit Data Structure and the Dimensional Data Structure, whereas the SDMX Data Structure Definition hasn’t subclasses.
If we consider a given list of data including GSIM Unit Data and Dimensional Data, their structures would be defined separately in one of the two dedicated GSIM subclasses, whereas in SDMX all of them would be defined in the same class (Data Structure Definition).
This implies that the SDMX Data Structure Definition class corresponds to the GSIM Data Structure class and therefore corresponds to the union between the GSIM Unit Data Structure class and Dimensional Data Structure class .
The GSIM structures are defined through Data Structure Components, which are of two types corresponding to the two types of data structures, namely the Unit Data Structure Components and the Dimensional Data Structure Components. The SDMX structures are defined through Components having the same meaning than GSIM but without distinguishing subtypes.
For the same reasons explained above, the SDMX Component class corresponds to the GSIM Data Structure Component class and to the union between the GSIM Unit Data Structure Component class and Dimensional Data Structure Component class .
In GSIM, both the types of structures, the Unit Data Structure and the Dimensional Data Structure, have Identifier Components, Measure Components and Attribute Components. Even the SDMX Data Structure Definition includes equivalent components, which are named Dimensions, Measures and Attributes respectively, only SDMX doesn’t distinguishes them in subtypes. Therefore, according to the same reasons explained above:
The SDMX Dimension class corresponds to the union between the GSIM Unit Identifier Component class and Dimensional Identifier Component class
The SDMX Measure class corresponds to the union between the GSIM Unit Measure Component class and Dimensional Measure Component class
The SDMX Attribute class corresponds to the union between the GSIM Unit Attribute Component class and Dimensional Attribute Component class
Considerations on the GSIM and SDMX basic structures
Comparing the GSIM and the SDMX approach in modeling data and data structures, it is possible to make some additional considerations. Even if the mapping is good, the two models are not equivalent according to some points of view.
The main difference lies in the fact that GSIM differentiates Unit and Dimensional data in terms of their structure, whereas SDMX doesn’t differentiate them.
The GSIM need of differentiating derives from the fact that Unit Data are allowed to have many Logical Records (one for each different structure) whereas Dimensional Data are not allowed, in fact all the classes connected to the Data Structure are equivalent in GSIM and SDMX save the Logical Record class (GSIM specs fig.19).
In SDMX there is not the need of differentiating Unit and Dimensional structures because of different implicit assumptions:
o a datum should be always considered equivalent to a mathematical function
o a mathematical function has just one structure (cannot have more structures)
o different structures should be described through different mathematical functions (i.e. through different data)
This means that GSIM Logical Records, if modeled through SDMX, would correspond to different SDMX Dataflows, as already mentioned.
The additional GSIM purpose (in introducing the Logical Record class and in differentiating Unit and Dimensional data) seems to lie in the need of specifying that more logical records belong to the same IT object (the traditional dataset having more record types).
Even if the purpose is valid, the GSIM solution has some drawbacks:
o The need of mapping mathematical functions and the relevant IT containers is not a valid reason for differentiating Unit and Dimensional data, in fact even Dimensional Data having different structures may be stored in the same IT container (meaning any possible IT artifact containing data, e.g. traditional dataset, relational table, xml file, spreadsheet and so on). Therefore, if the logical records are allowed for Unit data, they should be allowed also for Dimensional data. Otherwise, if the logical records are not allowed for Dimensional Data, they shouldn’t be allowed either for Unit data;
o it is obviously possible that the same IT object contains the extensions of many statistical data having different structures (meaning mathematical functions), however this is true for any kind of IT container, so not only for traditional datasets having different data types but also for DBMS tables, XML files, spreadsheets and so on (this seems to be not considered in the GSIM solution); moreover, just to consider the real cases in all their variety, there exist cases in which the extension of a mathematical function is contained in more than one IT container (the GSIM solution can’t model this case in a uniform way);
o The “statistical” perspective (statistical artifacts corresponding to mathematical objects) should be well distinguished from the “IT” perspective (the IT artifacts and objects) even because they are oriented to different people (the former to users/statisticians, the latter to IT experts); therefore a separate part of the model should be devoted to the mapping between the former (e.g. Data Flows) and the latter (IT data containers), whereas in the case of the Logical Record class GSIM mixes up the different perspectives;
o The resulting GSIM model is unnecessarily more complex and this greater complexity is counterproductive; in fact, according to the previous points, it would have been possible to unify the representation of Unit and Dimensional data by assuming that each logical record is a distinct Data Flow, corresponding to a mathematical function and having its own data structure; this way, the Logical Record class wouldn’t have existed any more, the differences between Unit and Dimensional data would have disappeared and the resulting model would have been simpler and more powerful at the same time:
Simpler because there wouldn’t be the need of different descriptions (and consequently different behaviors) for Unit and Dimensional data, which causes inevitably more difficulties for the users, more barriers for the integration of different types of data, more complex software, more costs;
More powerful both because the joint use of Unit and Dimensional data would have been facilitated and also because the mapping between the mathematical structures and the IT structures could have be solved not only for Unit data but for Dimensional data too.
o Adopting the current modeling choice, GSIM gives implicitly a negative message to the possible users of the model, in fact it suggests that Unit and Dimensional data should be differentiated (this is also explicitly said in the points 91 to 94 of the GSIM specs).