Unit vs Dimensional Data

Summary of Findings

The lines of thought and the explanations set out below are different in nature to those I expected to present when I first set out to explore in more detail

       the underlying differences between Unit and Dimensional perspectives on data, and

       whether or not these differences are significant enough, in conceptual terms, to warrant retention of current divisions into “Unit” and “Dimensional” concrete classes in GSIM (eg for DataStructures, DataSets, Components).

Having undertaken the analysis I am more convinced now, for conceptual and practical reasons, the current division is – on balance – appropriate.

On the other hand, I will not assert there is an absolutely watertight and incontestable case for the division.

It is proposed within the analysis that, not surprisingly, Unit data relates to individual Units .

An early barrier to exploring whether Unit and Dimensional data should remain differentiated in GSIM, however, was arriving at a similarly clear and unambiguous definition - for the purposes of comparison and analysis - of what Dimensional data relates to.

My recollection is that during the GSIM V0.8 to GSIM V1.0 process there was a common view that the distinction between Unit data and Dimensional data was not about microdata vs aggregate data.  Nevertheless, 75% of definitions in GSIM V1.0 related to DimensionalDataSets and DimensionalDataStructures mention the word “Aggregate”.

In seeking to arrive at a distinction that could be used for comparison and analysis, I started wondering “What is Dimensional data ‘about’?”.  For the purposes of this analysis, in line with long standing sources quoted below, I propose that Dimensional data refers to (sub) Populations, rather than individual Units.  When selecting this working definition, it is recognised that it is possible to have subpopulations that consist of zero or one Unit – or even engineer design of a DimensionalDataSet such that every subpopulation identified within the DimensionalDataSet is guaranteed to correspond to an individual Unit .

While this line of thinking is not reflected consistently by definitions in GSIM V1.0, it is strongly supported by the definition of DimensionalMeasureComponent .

A Represented Variable that has been given a role in a collection of aggregated data to hold the summary values (means, mode, total, index, etc.) for a specific sub-population.

If this line of thinking is accepted then I believe it is reasonable to consider Dimensional data relating to “summary” (or aggregate) data for subpopulations.

A concept like the “population of Australia” could be considered as a summary (count) of the persons (or person records) for the subpopulation identified as “All Persons“, “All Ages”, “All Regions” from a dimensional dataset structured as Sex x Age x Region.  It could equally well be considered as the measure of a property of Australia as a top level administrative Unit (Country) within the world.  This means the a number given for “Population of Australia” might be considered as summary or might be considered as a Unit measure – depending on context/perspective.  Similarly my age can be considered a count of the number of years I’ve been alive (an aggregate/summary) or simply an attribute of me as a Unit (of UnitType Person).

We can then note that Units and Populations are different (but related) objects within GSIM.

From the difference “about Units ” compared to “about (sub) Populations ”, as detailed in this report, there appear to flow a number of differences in what can be done (from a mathematical/statistical perspective) with Unit data compared with Dimensional data.  Examples include

       “Finest grain” (sub) Populations limit the operations that are possible with Dimensional data

       The ability to interrelate data from different records/datasets is different for Units compared with (sub) Populations

       Relationships between Units and relationships between (sub) Populations are different in nature.

It may be possible to arrive at an alternative conceptual definition and model of “Dimensional” data which side steps the above dichotomy between “about (sub) Populations” and “about Units ” (for Unit data).  Even if that turns out to be the case, however, the differentiation appears likely to apply, and be significant, in the majority of cases.  In other words, the above distinction may be a reasonable and useful basis for differentiation on heuristic grounds even if it is considered not to be completely beyond doubt (or completely satisfying) from a pure conceptual perspective.

It remains essential to recognise that in many cases the same “physical” set of data (eg codes and numbers stored in a relational database) could be viewed from both a Unit and a Dimensional perspective.  Para 94 of the GSIM V1.0 specification, for example, notes that “unit data” and “dimensional data” are different perspectives on data and that a lthough not typically the case, the same set of data could be described both ways .

In fact, as illustrated in the section of this document titled “Edge Cases”

       DimensionalDataStructures can be used to describe data (eg the contents of a business register) that would typically be considered as relating to Units

       UnitDataStructures can be used to describe data that would typically be considered as relating to a Population  

It is not proposed that it is “wrong” to make such choices – depending on the context – but it is proposed that such edge cases

       constitute different (and atypical) ways of “looking at” the data concerned from a conceptual perspective, and

       the choice of perspective – and data description based on perspective - can influence what it is possible to do with the data in practice

In addition, under “Edge Cases”, it is highlighted

       that there can be multiple Unit perspectives on the same set of data and multiple Dimensional perspectives on the same set of data

       that, once again, the choice of perspective can influence what it is possible to do with the data in practice

Overall, these “edge cases” further highlight that the Unit vs Dimensional question is primarily about how we choose to “think about” and characterise a particular set of data, and not so much about the physical representation and implementation of the underlying data.  For me this highlights that the differentiation belongs in a conceptual (or, at a minimum, logical) characterisation of data rather than at the physical implementation level.

If the “about Units ” versus “about (sub) Populations” basis for differentiation is broadly accepted, this may open the way for “tightening” some definitions in GSIM (eg related to Identifier, Measure and Attribute Components for Unit and Dimensional data) and for providing additional guidance in the User Guide on applying these concepts to design of Data Structures and Data Sets.

If the basis for differentiation is broadly accepted, the GSIM Implementation Group may wish to consider whether the possible “tightening” of definitions for GSIM V1.1 can be pursued over coming weeks and during the Sprint.

Structure of this report

The report starts by setting out the basic thesis.  This includes referring back to discussions of microdata and macrodata in the UNECE Guidelines For The Modeling Of Statistical Data And Metadata from 1995 and suggesting that

       the definitions of “microdata” and Unit Data are close matches

       the definition of “macrodata” and Dimensional Data are not necessarily as close as matches, but the extent of difference is ambiguous based on definitions used in GSIM V1.0 and the majority of instances appear likely to match in practice.

While matching (with a possible degree of imprecision) the two sets of definitions appeared more questionable to me at first, I gained greater confidence in the practical validity when the resulting framework appeared to apply very naturally to the discussion and analysis of examples, including edge cases.

The report then considers further the thesis that Dimensional Data is “about” sub Populations rather than Units , including implications which explain some of the observed differences between what is typically modelled, and what is typically possible, with Unit/microdata compared with Dimensional/aggregate data.

Topics considered include

       Consideration of the way “finest grain” sub Populations limit the operations which are possible with Dimensional data

       Differences when combining data from multiple DimensionalDataSets compared with combining data from multiple UnitDataSets

       Differences in relationships between Units and relationships between sub Populations

       Edge cases

o         Using DimensionalDataStructures to describe data about Units

o         Using UnitDataStructures to describe “aggregate” data

o         Different Dimensional perspectives on the same data

o         Different Unit perspectives on the same data

Rather than documenting overall conclusions at the end, these have been presented at the start of the report within the Summary of Findings.

Modelling statistical data, including identifying “what the data is about”

Why start here?

After analysing a number of examples, it seems possible the clearest “top down” explanation of the distinction between Unit perspectives and Dimensional perspectives on data is rooted in the classics – namely the work of Bo Sundgren on microdata and macrodata.

Source

For the purposes of this section of the report, I have drawn on Section 1.2 of the Guidelines For The Modeling Of Statistical Data And Metadata

where 1.2.1 discusses microdata and 1.2.2 discusses macrodata.

These UNECE guidelines date from 1995 and are still referenced by Part B of the METIS Common Metadata Framework.  The preface records

Statistics Sweden has been responsible for preparing the material. The work was conducted under the direction of Professor Bo Sundgren.

Bo’s modelling

What led me back to Bo’s work was trying to find a way of saying that a key difference between unit and dimensional views of data (and varying unit views of the same data and varying dimensional views of the same data) is what the data is considered to be “about”.

Microdata

Microdata are the result of observations or measurements of a set of object characteristics.  An object characteristic can be formalized as an ordered pair

C o = O(t).V(t)

where

(i) O is an object type;

(ii) V is a variable;

(iii) t is a time parameter.

Macrodata

Macrodata, in daily talk simply referred to as "statistics", are the result of estimations of a set of statistical characteristics (statistical concepts).

A statistical characteristic can be formalised as a triple

C s = O(t).V(t).f

where

(i) O(t).V(t) is an object characteristic;

(ii) f is a statistical measure, that is, an aggregation function (count, sum, average, correlation, etc) summarizing the true values of V(t) for the objects in O(t).

Discussion

As an aside, an interesting semantic difference between the two definitions is that microdata is said to refer to “observations” (or “measurements”) where macrodata is said to refer to “statistics” (or “estimates”).

 

A key (and possibly related) difference is that the “object” [loosely O(t), but debatable] for macrodata is described in terms of “objects” or “a population of objects existing at/during

time t1” where microdata (eg a GSIM Unit Data Record ] relates to a specific object [ Unit in the GSIM definition of Unit Data Record ]

Is there a significant conceptual difference between “Microdata” and “Unit Data”?

An early question is whether what Sundgren means by “microdata” corresponds with what GSIM V1.0 means by “unit data”.

I would contend they are close enough.

For example, GSIM defines a Unit Data Point as a placeholder for the value of a particular Instance Variable with respect to a given Unit .

For microdata, Sundgren talks about “object characteristics” where this is an ordered pair of an object identifier and a variable at a particular point in time.  Elsewhere in the UNECE guidelines “statistical units” is used as a synonym for “objects”, so where Sundgren refers to “object” I think the fit with (statistical) Unit in GSIM is reasonable.

Is there a significant conceptual difference between “Macrodata” and “Dimensional Data”?

I find this a less straightforward question to answer than the previous one.

A primary reason is that definitions related to “Dimensional” data in GSIM V1.0 appear somewhat inconsistent with each other.

My recollection (which may be faulty) is that during the GSIM V0.8 to GSIM V1.0 process we agreed the distinction between Unit data and Dimensional data was not about microdata vs aggregate data.  We agreed, instead, that it was about particular “perspectives” on data.  After the analysis in this report, however, I now wonder whether the underlying reason for choosing one perspective or the other typically boils down to whether we wish to consider a particular set of data from a unit or aggregate/summary perspective.

In any case, in GSIM V1.0 we have

Object

Definition in Glossary

Definition in UML

Dimensional Data Set

A collection of aggregated data that conforms to a known structure

A collection of aggregated data that conforms to a known structure

Dimensional Data Structure

Defines the structure of a collection of aggregated data by Represented Variables (in their respective roles as Dimensional Measure Components, Dimensional Attribute Component or Dimensional Identifier Components ) and their Value Domains

75% of the definitions refer to “aggregated”.  Even the definition that doesn’t is associated with “aggregated data” as a synonym.

Even if the GSIM Implementation Group wished to consider that Dimensional Data is not “by definition” aggregated, I expect that in a substantial majority of practical cases “Dimensional” data corresponds to aggregated data/macrodata.

In addition, it is impossible to assess the exact extent and significance of any difference between “macrodata” and any alternative definition of “Dimensional” data until a more detailed alternative definition of “Dimensional” is tabled.

Considering Dimensional Data as about sub Populations rather than Units

I agree with Professor Sundgren that macrodata – and all dimensional data depending on definition – can typically be considered to be “about” (sub) Populations - rather than Units (recognising that some of the subpopulations may consist of 1 Unit (or 0 Unit s)).

 

The fact GSIM does not consider Population and Unit to be the same thing may suggest a possible significant difference between microdata and macrodata (to use the 1995 terms).  It remains necessary, however, to demonstrate – as explored in the subsequent sections – that this makes a (significant enough) difference in practice.

 

Firstly, it is interesting that GSIM does not define a Dimensional Data Record as a counterpart for the Unit Data Record . I don’t think this is because such an entity cannot be defined, just that it is typically “less interesting” than a Unit Data Record because a Dimensional Data Record does not refer to an individual Unit (in the common sense).  A Dimensional Data Record could, however, be visualised as something like a single row in a database table holding dimensional data.

 

The GSIM V1.0 definition of Unit Data Point is

A placeholder in a Unit Data Record to contain the value ( Datum ) for an Instance Variable with respect to a given Unit .

 

The GSIM V1.0 definition of Dimensional Data Point is

A placeholder or cell in a Dimensional Data Set determined by the crossing of (all) the values for the Identifier Components to contain the value ( Datum ) for an Instance Variable (defined by a Measure Component ) with respect to a given Unit .

 

As an aside, I’d question what the Unit is likely to be associated with most Dimensional Data Points.

 

More significantly, however, why does the Unit Data Point definition not refer to Identifier Components ?

 

A Unit Identifier Component is

 

The role that has been given to a Represented Variable , in a Unit Data Structure , to identify the Unit

 

I suspect the reason Identifier Components is not mentioned in the definition of Unit Data Point is because for Unit Data, once we know the Identifier Components we consider we have direct identification of the Unit and we then feel comfortable talking in terms of the Unit rather than Identifier Components.

 

In the case of Dimensional Data (at least if it is macrodata) the Identifier Components actually identify a specific sub- Population rather than a Unit .  The Dimensional Data Point then refers to the value ( Datum ) for a Measure Component for the specific sub Population which is identified through the specific combination of the values of the Identifier Components.

 

While there are edge cases that are explored further, in general the concept of “about a unit” vs “about a sub population” seems to explain a lot of the difference in GSIM V1.0 between the modelling of Unit Data and Dimensional Data.

 

It is recognised, however, that the concept “about a sub population” is not explicit in most of the current definitions associated with Dimensional data.  One exception is the definition of DimensionalMeasureComponent .

 

A Represented Variable that has been given a role in a collection of aggregated data to hold the summary values (means, mode, total, index, etc.) for a specific sub-population.

Observed differences between what is typically modelled, and what is typically possible, with Unit compared with Dimensional data.

“Finest grain” subpopulations limit the operations that are possible with dimensional data

Fundamentally, once you are dealing with the “finest grain” subpopulation identified by a particular set of DimensionalIdentifierComponents then, unless you have an added ability to “drill down” to the Unit Data Records associated with that subpopulation, you cannot further differentiate or analyse the individual members of that subpopulation.

 

The idea of “finest grain” is important.  In a DimensionalDataSet where the IdentifierComponents are Occupation and Sex, it may be possible to further differentiate the subpopulation of “Medical Practitioners” to consider “Male Medical Practitioners” or “Surgeons” (or “Male Surgeons”) by “drilling down” on various dimensions.  At some point, however, a “finest grain” will be reached.

 

If the measures available are a count of the subpopulation and a total income then it will be possible to work out the mean income of “Male Surgeons” but not the median income (which would require the associated Unit Data Records).

 

Similarly, even where Age is an IdentifierComponent, it will not be possible to identify the most prevalent star sign for Male Surgeons – although this may be possible from Unit Data Records that contain dates of birth.

 

It is not always the case, either, that measures for “coarser grain” populations can be derived from finest grain populations, depending on the type of measure.  For example, if I know the median income for Male Surgeons and for Female Surgeons, I cannot derive the median income for Surgeons as a whole.  (If I only know the mean income of Male Surgeons and of Female Surgeons then I can’t work out the mean income for Surgeons as a whole either, but if I have counts as well then I can – at least if I set aside consideration of statistical error in estimation).

 

Combining data from multiple datasets

Linking (and then, typically, combining for the purposes of analysis) UnitData typically consists of concluding that two different UnitDataRecords, in two different UnitDataSets, are referring to the same Unit .  This may be because the IdentifierComponents are the same - or because they can be demonstrated to be equivalent (eg via a third source that correlates the IDs).  Alternatively, matching may be probabilistic based –eg - on values of a number of MeasureComponents .

 

Arguably, even longitudinal data linking tends to be about relating corresponding observations for “the same” Unit over time (The point can get philosophical; a “ river of life ” perspective would argue that the person you are surveying with one question at one moment is not the same person you are surveying with the next question the next moment – that’s within a single study, let alone longitudinally).

 

Linking DimensionalData sometimes consists of ensuring the reference is to the same subpopulation.  In many of these cases, in practice, IdentifierComponents will not match.  For example, an Australian Population Census DataSet for 2011 may not include Time as a dimensional component at all, and will use the code “0” for Australia in the spatial dimension.  Data from an international time series dataset might refer to (“close enough to”) the same subpopulation but the combination of IdentifierComponents will almost certainly be different (eg an explicit time dimension and use of the code “au” for Australia).

 

More typically, however, different DimensionalDataSets are combined to obtain information on “related” subpopulations.  A very common example, cited on Page 8 of the UNECE Guidelines, relates to analysing “corresponding” subpopulations over time.

 

If I seek to combine data from a 1991 Population Census DimensionalDataSet and 1996 Population Census DimensionalDataset, and I am interested in the characteristics of 15-19 Year Old Males in the ACT (Australian Capital Territory), the subpopulation in 1991 and the subpopulation in 1996 should consist entirely of different Units. (Any person who was 15 years old on 6 August 1991 should have been 20 years old by 6 August 1996.)

 

Nevertheless, it would not be unusual to wish to explore characteristics of the two subpopulations which might appear equivalent in terms of their IdentifierComponents.  (Time is not usually an explicit dimension in Australian Population Census Datasets – except for Time Series Datasets)

 

The above example includes an added complication, however, because the ACT changed its definition between 1991 (when Jervis Bay was included) and 1996 (when it wasn’t).  Similar things can happen in terms of other dimensions over time (eg different scopes for some “seemingly equivalent” Industry Divisions over different versions of the Industry Classification).  This indicates some of the particular risks and issues with “combining” based on similarity of subpopulations.

 

The other common example is analysing subpopulations which correspond to each other in all but (ideally) one regard - other than time.  An example would be comparing the characteristics of 15-19 Year Old Males in Germany compared with Australia.  If one DimensionalDataSet comes from Germany and the other from Australia then, unless there is harmonisation in advance such as reporting against an internationally agreed SDMX Data Structure Definition, gauging the exact extent of similarity (and difference) in regard to the subpopulations is likely to be particularly exacting.

Relationships

There are significant relationships between subpopulations for DimensionalDataSets based on dimensionality.  For example

       Superset:subset (eg Medical Practitioners:Male Medical Practitioners or Medical Practitioners:Surgeons)

       Differing by one Dimension (eg Males 15-19 living in Germany:Males 15-19 living in Australia)

As Units are individual “things” (people, businesses, events etc), however, they tend (as illustrated below) to have more specific relationships, including with Units of different Unit Types .  These relationships are typically described through RecordRelationship.  RecordRelationship is a construct which is not associated with DimensionalDataStructures in GSIM V1.0 s (nor, as far as I am aware, in modelling outside GSIM).

Typically for DimensionalDataSets, Superset:Subset relationships work basically the same way regardless of which of the DimensionalIdentifierComponents you drill down on (or roll up on).  RecordRelationships for UnitDataSets , however, are not such a “generic” mechanism.

In Population Census data, I can have a subpopulation of Males 40-44 living in ACT in 2006 and Males 15-19 living in ACT in 2006.  These two subpopulations can be seen as having a “differing by one Dimension” relationship with each other.  If I had access to Unit Records underpinning this dimensional data, however, I may discover additional relationships such as

       In some cases a member of the first subpopulation lives in the same dwelling as a member of the second subpopulation

       In a subset of these cases, the member of the first subpopulation is the father of the member of the second subpopulation

As per Pages 22-23 in

http://www.ausstats.abs.gov.au/Ausstats/subscriber.nsf/0/CACF387B87CE36F3CA2575B4001A2380/$File/20370_2006.pdf

Australian Population Census data has a comparatively simple structure of Dwelling Records, Family Records and Person Records.  Each record type is associated with a different set of Unit Measure Components ( Represented Variables with specific roles) that are relevant to the UnitType associated with that record type .

To be able to relate an individual Unit of one Unit Type with the corresponding Unit another Unit Type the records record, for example , the Family to which Persons “belong” and the Dwelling to which Families “belong”.  (Based on current definitions in GSIM, it seems the Family Record Identifier on the Person Record might not be considered a UnitIdentifierComponent because it is not identifying the Unit associated with the Person Record.)

The record relationships for Population Census data can be seen as relatively straightforward.  The Unit Data Structure for microdata from the Survey of Disability, Aging and Carers in 2003 consisted of ten record types arranged in a complex hierarchy. 

Edge Cases

Using DimensionalDataStructures to describe data about units

It is possible to use DimensionalDataStructure to describe (what would usually be considered) microdata.

 

This might be interpreted as structuring a DimensionalDataSet in a manner that ensures every combination of DimensionalIdentifierComponents identifies a subpopulation which consists of a single unit.

 

An example is that a business register could record information about Local Units, Global Enterprises and Global Enterprise Groups.  The DimensionalIdentifierComponent might become simply the ID of the Unit.

 

Everything else that was recorded about the unit (eg Main Industry, Number of Employees, Reported Turnover Last Year, Geographic Location of Headquarters) would be either a DimensionalMeasureComponent or a DimensionalAttributeComponent .

 

DimensionalIdentifierComponents are typically coded (“cross classifications”) although this is not a requirement within GSIM, or within common implementation standards such as SDMX.

 

If the hierarchy is simple enough (eg no unit at one level belongs to more than one unit at the next level up) the relationship between Local Units, Global Enterprises and Global Enterprise Groups could be represented by – and access systematically via – a CodeList.  This would have the disadvantage, however, of needing to maintain a large and complex CodeList which records each unit and its parent.  It could require updating (and, potentially, versioning) the CodeList each time a new unit is recognised (including “births” and mergers) and each time a relationship changes (eg acquisitions).  Maintaining unit information as a CodeList, therefore, while possible in some cases may be a cumbersome option.

 

Another alternative might be to store the required information within one or more DimensionalAttributeComponents (eg “ID of parent unit”).  This potentially requires particular relationships between different components (including between the DimensionalIdentifierComponent and other components) to be

 

1.       documented when describing the data, and

2.       harnessed when analysing the data.

 

A generic application for working with dimensional data (eg as “data cubes”) would usually have significant issues being able to recognise, and correctly utilise, this additional information.  A generic application for working with Unit Data, however, should not face the same issues were the data described using a UnitDataStructure .

 

This can be seen as a separate consideration to the possibility of describing such data as dimensional for the purpose of exchange (eg using infrastructure based on SDMX) as opposed to describing it as dimensional to support more applied and operational uses.

 

It would be possible, for example, to exchange (and, eg, synchronise) data using DimensionalDataSets on a generic basis between two systems which, when operating on the data after exchange, apply the specialised concepts related to unit relationships, record types and linkages associated with UnitDataStructures .

Using UnitDataStructures to describe “aggregate” data

This is a scenario which has been raised several times during discussions.

 

An example might be to take a range of data (eg counts of population and selected sub-populations, measures of economic activity) and consider them as measures relating to a particular administrative unit (eg a country, state/provinces or local government area).  In these cases the measures would typically be thought of as “aggregate” (eg counts of people, estimates of turnover and other measures for industry sectors).  In this scenario, however, they would be seen as attributes/measures of a (larger scale) “unit”.

 

Such an example might be seen as not be so different in concept to, eg, recording the number of cars associated with a dwelling or the number of employees associated with an enterprise.  In terms of cars/employees as units these are aggregate counts but in terms of dwellings/enterprises as units these are a measure/attribute associated with a particular unit

 

In this example, it would be quite possible there could be differences in the data recorded/available at, eg, country, state/province and local government area level.  In other words, even if held in a single set of records physically, there could be multiple record types in a logical sense depending on the UnitType associated with each record.

 

This forms the basis of quite a common technique used in practice for “bringing together” data with quite different dimensionality and, in fact, quite different underlying concepts.  For example, it is possible to bring together quite diverse social, economic and environmental data by relating it back to a common administrative unit – and potentially a common reference time period – where it would be inordinately complex – and of very limited practical use - to try to take the different DimensionalDataStructures associated with each of the sources of data that is being brought together and to seek to synthesise from those structures a coherent “hypercube”.

 

It is worth noting that although the data sources that are brought together this way might be quite readily interpretable from a dimensional perspective (without giving particular precedence to the dimension that identifies the administrative unit) once the data starts being interpreted from a Unit perspective some of the other constructs change.

 

For example, a sample of records from a simple dimensional dataset might be

 

Region

Sex

Age

Count

New South Wales

Male

0-4

X

New South Wales

Female

0-4

X

New South Wales

Male

5-9

X

New South Wales

Female

5-9

X

Victoria

Male

0-4

X

Victoria

Female

0-4

X

Victoria

Male

5-9

X

Victoria

Female

5-9

X

 

If this data starts being considered as relating to each Region as a Unit, however, there is a tendency (but not an absolute requirement in GSIM) to consider each Unit as having one record (at least for any one reference period).  Thus “Number of Males 0-4”, “Number of Females 0-4”, “Number of Males 5-9”, “Number of Females 5-9” (at a particular point in time) would all become attributes/measures associated with the Unit “New South Wales”.  In other words, the physical structure of “several short records associated with each Unit” would more often be considered logically as a “single wide record for each unit”.

 

I found out after the fact when I looked for ABS data presenting population by age by sex by region that the first example I found does present it on a “single wide record for each unit” basis.

 

Incidentally, this example may suggest that the complex, multi-parent / multi path hierarchies commonly used to roll countries (as political units) up into various regional, economic and political groupings may be more driven by seeking to describe complex sets of unit relationships and less by seeking to describe the structure of a generic “classificatory” dimension within a DimensionalDataStructure .  This topic has received much discussion in regard to GSIM.  Perhaps it is appropriate that these relationships are described by additional metadata useful if one wants to consider the details of how the different “units” relate to each other rather than within the generic “dimensional” (and “classificatory”) definition of data.  This could be seen – more or less – as what Hierarchical Code Lists within SDMX do – they are not used in the direct dimensional definition of a DimensionalDataStructure .

 

Using UnitDataStructure s to describe “aggregate” data in this manner appears reasonable depending on how the designer of the DataStructure intends to be present and use the data.  It seems arguable, however, that what would appear to be aggregate data from a different perspective is, in such cases, actually being presented as microdata associated with a “larger scale” unit.

Different Dimensional perspectives on the same data

The following example provides an illustration, and brief exploration, of two different dimensional characterisations of the same (or equivalent) data.

Dimensional DataSet with Structure 1 (DDS 1)

Month

State

Sex

Employed

(‘000)

Unemployed

(‘000)

In labour force

(‘000)

Not in labour force

(‘000)

July 2013

NSW

Male

X

x

X

x

July 2013

NSW

Female

X

x

X

x

July 2013

Vic

Male

X

x

X

x

July 2013

Vic

Female

X

x

x

x

 

 

 

 

 

 

 

The overall population for Labour Force statistics in Australia is the Civilian population aged 15 years and over.

The first row of data in DDS 1 relates to the subpopulation of males living in NSW.

Dimensional DataSet with Structure 2 (DDS 2)

Month

State

Sex

Labour Force Status

Estimate

(‘000)

July 2013

NSW

Male

Employed

x

July 2013

NSW

Male

Unemployed

x

July 2013

NSW

Male

In labour force

X

July 2013

NSW

Male

Not in Labour force

x

 

 

 

 

 

DDS 2 can be considered a simple “repackaging” of the data associated with DDS 1.  In DDS 2 the first row relates to the subpopulation of employed males living in NSW.

The difference in packaging/perpsective can, however, be significant.

If we want to calculate “Unemployment Rate” then DDS1 makes this straightforward, it is simply a new measure calculated through expressing “Unemployed” as a percentage of “In labour force”.

For DDS 2, however, adding “Unemployment Rate” is not straightforward because it is not a measure related to any of the subpopulations expressed in DDS 2 (it is actually based on a ratio between two of the subpopulations).

In other words, how dimensional data is characterised can impact – at least in some cases – the “functions” that can (readily) be applied to it.

DDS 1 and DDS 2 would be characterised by different DimensionalDataStructure under GSIM even though the same physical instance of data (given appropriate conceptual to physical mappings at a level below GSIM) could be related to both structures.  The difference in structure in this instance depends on what is characterised as a Dimensional Identifier Component compared with a Dimensional Measure Component .

One use case for GSIM is to provide a common reference model at the conceptual level to which different physical/technical implementations can be related.  This allows GSIM to assist translations between different physical implementations based on whether the concepts (about the structure of data and metadata) associated with the different technical implementations are the same or not.  This may result in a translation that is simpler to design, and easier to assess the semantic quality of, than simply asking technical staff to try mapping (eg) two sets of XML schema elements to each other at a technical level. 

If we were using GSIM in this manner in this example, and if we were seeking to translate data from an in house physical instance to a standard “dimensional data” format such as SDMX, a DDI NCube, Google Dataset Publishing Language or the RDF Data Cube Vocabulary, it would make a practical difference in all of these cases whether DDS 1 or DDS 2 were selected as the way the in house physical instance of data was characterised in conceptual terms.

(In this scenario, there would also need to be a layer of metadata that related the conceptual characterisation based on GSIM to the in house physical instance of data.  Metadata that connects “logical to physical” is outside the scope of GSIM but is addressed in standards such as DDI which can be used for implementing GSIM.)

In conclusion, the conceptual/logical characterisation of data in GSIM is important to supporting a range of use cases for GSIM as a common reference model.  These characterisations are not matters that should be left undifferentiated at the GSIM level.  Leaving these characterisations undifferentiated at the GSIM level would require them to be worked out on a case by case basis, duplicatively - and potentially inconsistently, at the “physical implementation” level each time.

Different Unit perspectives on the same data

The following is an actual case that arose in the ABS.

 

Two different subject matter areas were considering the same physical data records.  These data records included a Person ID, the Person’s name and the Person’s address, as well as a range of other data related to the person in question.

 

One subject matter area, possibly to facilitate record matching with other data, chose to identify records through the combination of Name and Address.  The other subject matter area chose to identify records using Person ID.

 

The conclusion reached by analysts within the ABS is that the two areas were seeking to describe, and work with, the data using two different UnitDataStructures at the conceptual level.

 

In the first case Person Name and Person Address were acting as UnitIdentifierComponents.  Person ID, arguably, was simply an UnitAttributeComponent which was not terribly relevant for the purposes of most analysis by that user of the data.

 

In the second case, Person ID was being used as the UnitIdentifierComponent and potentially Person Name and Person Address were UnitAttributeComponents.  (Are there any circumstances under which Person Address – or, eg, Person Sex - could be considered a UnitMeasureComponent or are UnitMeasureComponents always numeric measures, even though this does not appear to be stated explicitly in GSIM V1.0?)