Workshop on Implementing Standards for Statistical Modernisation, 21-23 Sept 2016 in Geneva.
Issue raised by GSIM / DDI mapping work:
What is a non structured data set? Why do we need it?
A nonstructured dataset is a A Data Set whose structure is not described in a Data Structure. Basically UnitDataSet and DimensionalDataSet require that there is a known structure (Data Structure). You couldn't use those concrete classes in GSIM to model the fact you know a dataset exists but you don't (yet) know its structure. In theory you might know whether it is a Unit data set or a Dimensional data set but you can't describe it formally as one or the other because you don't know its precise structure. (Note that to really define the structure you not just need to know what columns/fields exist but the variable being represented within each column/field.)
I agree the idea of a "nonstructured dataset" sounds odd.
One alternative would be to say that UnitDataSets and DimensionalDataSets can exist without having a (known) DataStructure. I'd be wary of that option, it means knowning there is a UnitDataSet or DimensionalDataSet does not imply it has a (knowable) DataStructure - it makes for a very "loose" model.
Another option might be to call such an object something different to a DataSet. I am not sure what the alternative name would be. That way you can say that by definition - as an abstract class - DataSets have a known structure associated with them.
I don't like the idea of simply saying such an object can't be described. You may wish to be able to say that a Data Resources exists with, eg, UnitDataSets associated with it (with known Structures), DimensionalDataSets associated with it (with known Structures) and some other "sets" of data for which the structure is not (yet) known. If there is no way to talk about other "sets" of data then you can't catalogue everything you know about the (eg external) data resource unless you know absolutely everything about every DataSet. For an external data source you may only wish to invest in investigating certain "sets" of data thoroughly, and understanding their structure, only at the time you have a potential use for them. If you can't even document that you know such sets of data exist, however, you won't know to come back and investigate them more fully later.
Whilst I remain to be convinced either way that we really need this object, perhaps a better term could be "DataPool". This would sort of mirror the approach in the Nordic Metamodel (used for some implementations of PC-Axis) where value sets are drawn from value pools, based on certain criteria. A DataPool could then be defined as a DataResource with an unknown structure. For me, this would logically refer just to unit data, as dimensional data must have at least one dimension, which means that structure is at least implicit.
Another option could be simply to say that a DataResource may or may not have a known structure. The object we are talking about would then be implicit (the UML would need tweaking!)
In both cases, I agree with Al that DataSets could be defined as having a known structure
I am keen not to lose the idea entirely that we can know a (potentially "tappable") resource exists without knowing everything about that resource. If NSIs are going to make more use of existing data resources outside the NSI if/when a new need arises, we're going to need better records of what is out there. The record of what is out there, however, might not always be documented to the level of detail of defined data flows based on defined data structures at first.
After all, when it actually comes time to "tap" that resource we may only want to use a small number of the "fields" from the data which are available. Realistically, I think NSIs will invest effort in defining detailed structures based on what they want to use from the resource, not necessarily "chart in complete detail" the totality of the resource when they first learn it exists.
For me there are strong parallels with natural resource economics. We may not assess every resource in detail as soon as we discover it - we may judge that for the time being the resource is not going to be of economic interest - but we still want to record the resource exists and what we know about it. By analogy, it is when we've established the resource is likely to be an economic asset (eg because demand for that commodfity has increased) that we will assess it to the nth degree of detail (and, as a producer of official statistics, formalise a data structure definition for the portion of the overall resource that is of interest to us).
Maybe this does lead to "scope creep" for Data Resource and Information Resource. More broadly than statistics, however, what is being described are "resources" that are not necessarily yet under management as "assets", but which may make that transition in future.
I agree that we may want to represent some data resource whose structure we don't yet know, at least not in detail, as an information object. Conceptually, it would still be a data resource. And its structure exists, we just haven't yet found out (probably not even tried to find out) what that structure looks like. This is not the same as unstructured data (typically texts) that can be structured for example through tagging, text mining, etc. Therefore I don't like the term Unstructured/Non-structured Dataset for the Data Resource with unknown Data Structure.
Do we really need a new object to represent this? Can we tweak the existing Data Resource to cover the "yet unknown Data Structure" case also?
Not sure I understand why "unknown structure" would imply "unit data". The fact that dimensional data requires at least one dimension doesn't yet tell us the structure of such a dataset, only the type of structure. Also I think that unit data requires at least one (instance) variable - so if we apply the same reasoning, the type of structure would be implicitly clear for unit data also. To summarize: I don't think "unknown structure" provides information about the type of the dataset / its structure.
There is a problem here that needs to be solved. Although it is not explicit in the model a Non Structured Data Set cannot be linked to a Provision Agreement as the Provision Agreement must have a Data Flow which in turn must have a Data Structure. The introduction of a Non Structured Data Set was to cater for those data sets which have no formal structure (i.e. not linked to a Unit or Dimensional structure). As it is modeled the only thing a Non Structured Data Set has is a link to a Data Provider. It is not even a Data Resource as there is no link to it from Data Resource and it has no Data Location. So, in reality it should not be on this diagram.
However, if it is deemed to be a Data Resource (which it arguably is not as it has no formal structure documented) then we need to decide what sort of animal it is. It could be linked to a Data Location and the link between Provision Agreement and Data Provider is constrained to Data Set.
I notice that we talk here about 'Non-Structured' Data Sets. Is there a distinction here from 'Unstructured' information referred to in the explanatory text for DataResource object on page 264?
I see two cases that might require separate treatment. Structured data for which the structure has not been established yet, and inherently unstructured data. The conversion to a data source is different. In case of structured data the data are analysed and probably the owners contacted and the process of producing the data is inspected to establish the data structure. The data itself remain untouched. In case of inherently unstructured data on top of the previous actions, some kind of reduction/aggregation/summary of the data itself has to be performed. This will produce the structure necessary for further use in the statistical process (e.g. counting words, counting the order of words).
The unstructured data become a resource after preprocessing.
Whichever type of unstructured data exists the Data Resource cannot, as it is is currently modeled, support unstructured data. I suspect it is former of the two cases identified by Wim that is the intent of the "unstructured" reference in Data Resource on page 264. In order to support this (and the fact that there could be a data location) there needs to be link from Data Resource to Non Structured Data Set and from Non Structured Data Set to Data Location. Originally the "groups" relationship from Data Resource was attached to Data Location but this was changed in Geneva to link to DataFlow which has a better intuitive feel for how one might navigate the structure (type of data and then the locations of the data rather than locations and then the type). Whichever way this is modeled one gets to the same information,
The right way to model this will need some careful thought because it we just put in the two new relationships as indicated above the way to the Data Location for unstructured data will be via the Non Structured Data Set whereas for Data Set it will be via Data Flow/Provision Agreement. If we move the"groups" relationship from Data Resource back to Data Location then the "navigation" to the Data Set and Non Structured Data Set will be the same, but this will imply that all Non Structured Data Set have a Data Location.
Note that one reason for having a Non Structured Data Set is to link it to a Data Provider even though it may not be a Data Resource (e.g. raw data supplied as a result of a survey or even data derived from a web scraping robot).
2/7/13 meeting: Alberto Sanchez to review
ISSUE #6 CONCLUSIONS/IDEAS
We can say that any dataset has a certain structure. If we don’t know anything about its structure then it is not a dataset (at lest not a useful dataset for GSIM). What is the use of a dataset in which we only have, eg: numbers, but don’t know what they mean at all?.
I will follow three steps:
Besides our study of GSIM here at the IMF (See Gareth’s GSIM level 2 model), I have used some very useful comments from GSIM Issue 6 discussion group colleagues to support my ideas:
Alistair (March 14):
“Another option might be to call such an object something different to a DataSet. […] That way you can say that by definition - as an abstract class - DataSets have a known structure associated with them.”
“For an external data source you may only wish to invest in investigating certain "sets" of data thoroughly, and understanding their structure, only at the time you have a potential use for them. If you can't even document that you know such sets of data exist, however, you won't know to come back and investigate them more fully later.”
Alistair (March 26):
“I am keen not to lose the idea entirely that we can know a (potentially "tappable") resource exists without knowing everything about that resource. […] we're going to need better records of what is out there. The record of what is out there, however, might not always be documented to the level of detail of defined data flows based on defined data structures at first.”
Michaela (April 9):
“Do we really need a new object to represent this? Can we tweak the existing Data Resource to cover the "yet unknown Data Structure" case also?”
“I don't think "unknown structure" provides information about the type of the dataset / its structure.”
Chris (April 9):
“The introduction of a Non Structured Data Set was to cater for those data sets which have no formal structure (i.e. not linked to a Unit or Dimensional structure). As it is modeled the only thing a Non Structured Data Set has is a link to a Data Provider. It is not even a Data Resource as there is no link to it from Data Resource and it has no Data Location. So, in reality it should not be on this diagram.”
Wilhelmus (April 9):
“In case of inherently unstructured data on top of the previous actions, some kind of reduction/aggregation/summary of the data itself has to be performed. This will produce the structure necessary for further use in the statistical process”
“The unstructured data become a resource after preprocessing.”
“Note that one reason for having a Non Structured Data Set is to link it to a Data Provider even though it may not be a Data Resource”
Action Item: There is something there that needs to be modelled somehow, but how? We need to come up with examples. If we can't find these examples, then we should remove it for the moment (at least until a use case arises). If no use case by end of August, we remove.
Three possible use cases:
1) I am asked to do a metaanalysis of 10 year old data. I am handed a password protected floppy disk. Nobody has the password. The data is useless. I do not need to model it.
2) I identify a potentially interesting data source. I contact them and ask if they can send me data. At this stage I do not know the structure of the data. They say a)yes you can have it, but only if you accept it in our format or maybe b) yes you can have it, what format would you like us to send it in? In either case after a discussion with the data source I then know the (planned) data structure. Of course this may need to be adjusted when I actually get the data.
3) I am a researcher ploughing through old data. I find a paper document with a table containing interesting data for my research. I make a data structure and enter all the data by hand.
Just to put this topic in the context of the model, it is important that we understand what we mean by “structured” data set. In the model the (structured) Data Set has a mandatory link to a Data Structure (via Provision Agreement and Dataflow). The Data Structure is a collection of Represented Variables described in a logical structure specifying which variables are “Identifiers”, which are “Attributes” and which are “Measures”. There is no support for the specification of physical structures in GSIM. So, if there is such a thing as an Unstructured Data Set then it must be a collection of Data Points where either or both of the Represented Variables and the role each plays in the data set (Identifier, Attribute, Measure) is unknown.
The Unstructured Data Set in the model is not a Data Resource (it does not even have to be associated to a Data Provider) and so, even if it is a real object, there should be constraint on the association between Data Provider and Provision Agreement as shown below.
I think Jenny’s examples are quite challenging. However, I still see they all have something in common: you need to figure out those pieces of information before you can do anything with them. I mean, until you somehow identify their potential use they don’t enter the statistical industrial process, simply cannot be modeled.
Your examples show something, though, that we can drill down as much as we want and create objects that will explain certain particular behaviors. The problem is that if we don’t set the limit somewhere we will end up with a lot of fussy objects, which will not help communicating GSIM.
Maybe the whole problem is with Metadata. There is no clear way to model it and this leads to confusion and to include objects like unstructured dataset. From Chris’s diagram, we see:
“Abstract placeholder for… such as metadata”
“Non structured dataset”
The model would be simpler if we just merged these three entities into 1 object: “structured information” (call it “metadata” if you want).
For instance, from your use case #1: If you are not interested in the data, then anything you do with it will not lead you to a structured version of that data, you may use the information drawn from that dataset (through different processes) to populate structured information in whatever place you need, that is, you are creating metadata, which is not yet fully modeled in GSIM.
I see two cases: 1) structured data, but you do not have the information on the structure. For example archived 1971 Census data, but you lost the structure. In order to use the data, you have to reconstruct the structure by prior analysis. 2) unstructured data. I see this as a provision for all kinds of future developments in the area of big data. The information you have is on how and when the data set was produced. Before you can use the data, a structure should be imposed (for instance by defining separaters and reducing the information by counting the positions between the separaters).
My feeling is that we do not need to create a new object for the first case. An explanatory note might do, saying that the structure should be known. If not, preprocessing is required. I am not so sure whether the same goes for the unstructured data. My guess is that the structuring of unstructured data could become a significant business process step. In case the object would be a place holder for future developments in this domain (GSIM should facilitate future developments). On the other hand, we might also argue that future developments will probably require such an object, but that it can not be fully specified yet.
Discussion 13/ 8:
The group discussed the use cases provided. The view is still leaning towards the dropping this object from the model. Unstructured dataset will always be a process input - they have to go through a process of being structure before they can be used further by the statistical process.
It is recognised that dropping this object would be a missed opportunity for modernisation.
Action: Remove and make small text explanation that we may need this in the future.
Powered by a free Atlassian Confluence Community License granted to UNECE. Evaluate Confluence today.