Issue raised by GSIM / DDI mapping work:
What is a non structured data set? Why do we need it?
14 Mar, 2013
A nonstructured dataset is a A Data Set whose structure is not described in a Data Structure. Basically UnitDataSet and DimensionalDataSet require that there is a known structure (Data Structure). You couldn't use those concrete classes in GSIM to model the fact you know a dataset exists but you don't (yet) know its structure. In theory you might know whether it is a Unit data set or a Dimensional data set but you can't describe it formally as one or the other because you don't know its precise structure. (Note that to really define the structure you not just need to know what columns/fields exist but the variable being represented within each column/field.)
I agree the idea of a "nonstructured dataset" sounds odd.
One alternative would be to say that UnitDataSets and DimensionalDataSets can exist without having a (known) DataStructure. I'd be wary of that option, it means knowning there is a UnitDataSet or DimensionalDataSet does not imply it has a (knowable) DataStructure - it makes for a very "loose" model.
Another option might be to call such an object something different to a DataSet. I am not sure what the alternative name would be. That way you can say that by definition - as an abstract class - DataSets have a known structure associated with them.
I don't like the idea of simply saying such an object can't be described. You may wish to be able to say that a Data Resources exists with, eg, UnitDataSets associated with it (with known Structures), DimensionalDataSets associated with it (with known Structures) and some other "sets" of data for which the structure is not (yet) known. If there is no way to talk about other "sets" of data then you can't catalogue everything you know about the (eg external) data resource unless you know absolutely everything about every DataSet. For an external data source you may only wish to invest in investigating certain "sets" of data thoroughly, and understanding their structure, only at the time you have a potential use for them. If you can't even document that you know such sets of data exist, however, you won't know to come back and investigate them more fully later.
15 Mar, 2013
Whilst I remain to be convinced either way that we really need this object, perhaps a better term could be "DataPool". This would sort of mirror the approach in the Nordic Metamodel (used for some implementations of PC-Axis) where value sets are drawn from value pools, based on certain criteria. A DataPool could then be defined as a DataResource with an unknown structure. For me, this would logically refer just to unit data, as dimensional data must have at least one dimension, which means that structure is at least implicit.
Another option could be simply to say that a DataResource may or may not have a known structure. The object we are talking about would then be implicit (the UML would need tweaking!)
In both cases, I agree with Al that DataSets could be defined as having a known structure
26 Mar, 2013
I am keen not to lose the idea entirely that we can know a (potentially "tappable") resource exists without knowing everything about that resource. If NSIs are going to make more use of existing data resources outside the NSI if/when a new need arises, we're going to need better records of what is out there. The record of what is out there, however, might not always be documented to the level of detail of defined data flows based on defined data structures at first.
After all, when it actually comes time to "tap" that resource we may only want to use a small number of the "fields" from the data which are available. Realistically, I think NSIs will invest effort in defining detailed structures based on what they want to use from the resource, not necessarily "chart in complete detail" the totality of the resource when they first learn it exists.
For me there are strong parallels with natural resource economics. We may not assess every resource in detail as soon as we discover it - we may judge that for the time being the resource is not going to be of economic interest - but we still want to record the resource exists and what we know about it. By analogy, it is when we've established the resource is likely to be an economic asset (eg because demand for that commodfity has increased) that we will assess it to the nth degree of detail (and, as a producer of official statistics, formalise a data structure definition for the portion of the overall resource that is of interest to us).
Maybe this does lead to "scope creep" for Data Resource and Information Resource. More broadly than statistics, however, what is being described are "resources" that are not necessarily yet under management as "assets", but which may make that transition in future.
09 Apr, 2013
I agree that we may want to represent some data resource whose structure we don't yet know, at least not in detail, as an information object. Conceptually, it would still be a data resource. And its structure exists, we just haven't yet found out (probably not even tried to find out) what that structure looks like. This is not the same as unstructured data (typically texts) that can be structured for example through tagging, text mining, etc. Therefore I don't like the term Unstructured/Non-structured Dataset for the Data Resource with unknown Data Structure.
Do we really need a new object to represent this? Can we tweak the existing Data Resource to cover the "yet unknown Data Structure" case also?
Not sure I understand why "unknown structure" would imply "unit data". The fact that dimensional data requires at least one dimension doesn't yet tell us the structure of such a dataset, only the type of structure. Also I think that unit data requires at least one (instance) variable - so if we apply the same reasoning, the type of structure would be implicitly clear for unit data also. To summarize: I don't think "unknown structure" provides information about the type of the dataset / its structure.
There is a problem here that needs to be solved. Although it is not explicit in the model a Non Structured Data Set cannot be linked to a Provision Agreement as the Provision Agreement must have a Data Flow which in turn must have a Data Structure. The introduction of a Non Structured Data Set was to cater for those data sets which have no formal structure (i.e. not linked to a Unit or Dimensional structure). As it is modeled the only thing a Non Structured Data Set has is a link to a Data Provider. It is not even a Data Resource as there is no link to it from Data Resource and it has no Data Location. So, in reality it should not be on this diagram.
However, if it is deemed to be a Data Resource (which it arguably is not as it has no formal structure documented) then we need to decide what sort of animal it is. It could be linked to a Data Location and the link between Provision Agreement and Data Provider is constrained to Data Set.
I notice that we talk here about 'Non-Structured' Data Sets. Is there a distinction here from 'Unstructured' information referred to in the explanatory text for DataResource object on page 264?
I see two cases that might require separate treatment. Structured data for which the structure has not been established yet, and inherently unstructured data. The conversion to a data source is different. In case of structured data the data are analysed and probably the owners contacted and the process of producing the data is inspected to establish the data structure. The data itself remain untouched. In case of inherently unstructured data on top of the previous actions, some kind of reduction/aggregation/summary of the data itself has to be performed. This will produce the structure necessary for further use in the statistical process (e.g. counting words, counting the order of words).
The unstructured data become a resource after preprocessing.
Whichever type of unstructured data exists the Data Resource cannot, as it is is currently modeled, support unstructured data. I suspect it is former of the two cases identified by Wim that is the intent of the "unstructured" reference in Data Resource on page 264. In order to support this (and the fact that there could be a data location) there needs to be link from Data Resource to Non Structured Data Set and from Non Structured Data Set to Data Location. Originally the "groups" relationship from Data Resource was attached to Data Location but this was changed in Geneva to link to DataFlow which has a better intuitive feel for how one might navigate the structure (type of data and then the locations of the data rather than locations and then the type). Whichever way this is modeled one gets to the same information,
The right way to model this will need some careful thought because it we just put in the two new relationships as indicated above the way to the Data Location for unstructured data will be via the Non Structured Data Set whereas for Data Set it will be via Data Flow/Provision Agreement. If we move the"groups" relationship from Data Resource back to Data Location then the "navigation" to the Data Set and Non Structured Data Set will be the same, but this will imply that all Non Structured Data Set have a Data Location.
Note that one reason for having a Non Structured Data Set is to link it to a Data Provider even though it may not be a Data Resource (e.g. raw data supplied as a result of a survey or even data derived from a web scraping robot).
Powered by a free Atlassian Confluence Community License granted to UNECE. Evaluate Confluence today.