Skip to end of metadata
Go to start of metadata

 

Link to Simple Data Description Team page

Meeting 22 September 2014

Attendees: Steve McEachern, Dan Gillman, Larry Hoyle, Jay Greenfield, Ornulf Risnes, Justin Lynch

Meeting notes:

Meeting commenced 10.10pm (AEST)

Dan provided an overview of the updated variable cascade model distributed by Jay on Sept. 20 - see A Variable Cascade 20140922.docx. He highlighted the key approach now adopted in the model, focussing on sentinel values, but also the key remaining issue:

  1. Whether the sentinel values should be managed as a separate domain similar to the substantive domain, OR

  2. available to be selected from a broad (unmanaged) list - possibly along the lines of the category set (i.e. a “master sentinel category set”)

Approach 1 produces a (potential) exponential growth in value domains if we  manage of each domain in turn - consider the example in use case three in Jay’s document.

Approach 2 uses a simpler mechanism by providing (basically) a single “one big list” - using a map to link the sentinel values and the codes used in the instance variable. This is simpler, but does not really allow for the management of the sentinel code list - which may be important to us.

Question raised - Is there value in managing the sentinel value domain in the same way that we would manage the substantive value domain (in the represented variable)?

Point raised by Larry - we may wish to manage common sentinel value domains (e.g. SAS missings, SPSS missings, etc.). In particular, this might necessitated if the different studies or software use different data types (eg. SAS vs SPSS missing, date formats). This tends towards Approach 1.


Next steps

Resolution: team members are to explore the cascade paper further in light of the discussion today and to (hopefully) identify their preferred option, to be brought to the next meeting.

Additional activities:

  • Dan and Jay to explore the problem of reconciling data types within the proposed cascade

  • Steve to review the Physical (PHDD) and Logical (Variable Cascade) models to assess the points of intersection of the two sides and highlight any outstanding issues for Dagstuhl, then review the overview status of the Simple Data Description package/library/view to determine it's readiness for discussion at the Dagstuhl sprint.

Next Meeting: Monday October 6th, 2014, 1400 Central European Time

 

 

 

Meeting 8 September

Attendees: Steve McEachern, Dan Gillman, Larry Hoyle, Jay Greenfield

Meeting notes:

To open, Steve reviewed the activity from the last meeting

The meeting primarily consisted of discussion of the "straw man" model of the conceptual/represented/instance variable model developed by Jay in collaboration with Dan. The visual representation of the model is represented below:

Jay's notes on the model are as follows:

Note that a “Conceptual Variable” here maps to the GSIM “Variable”. Also, note that an Instance Variable inherits its value domain from the Represented Variable that it takes its meaning from.

Dan Gillman has an example:

The conceptual variable marital status might be measured with two different sets of categories (in separate studies) as follows:

    1. 1.     Single, Married
    2. 2.     Single, Married, Widowed, Divorced 

 

These 2 categorizations result in 2 represented variables [mstat_simple, mstat_ex] in my mind. I was saying some people (outside the DDI community) want to say that even the conceptual variable has to change in this case. I think that makes little sense, and I hope everyone in our group agrees the conceptual variable does not change in situations such as this.

Continuing along these lines, represented variables like mstat_simple and mstat_ex may get sentinel values “along the processing cascade”. In this instance each new value set does NOT necessitate a new represented variable. Instead there may be multiple instance variable associated with one represented variable. In a process model we would reference one or more of these instance variables at different points along the processing cascade.

A Master Sentinel List (MSL) facilitates this arrangement. Again, quoting from Dan:

MSL should be structured so that categories are separated from designations (codes or other). The links are between the designations and the instance variables. It might go something like this:

IV -> MSL-codes <- MSL-categories, where the -> symbol indicates a one-to-many relationship, in the direction of the arrow. 

 

Thus, the MSL-codes structure resolves a many-to-many relationship between IVs and SVs, and the SVs are categories, not the designations. An IV uses possibly many SVs, and each SV may be used by possibly many IVs. 

There probably needs to be more discussion around the Conceptual Variable Unit Type and the Instance Variable Population. It would be neat and I would like to argue that the difference between Unit Type and Population is a function of sentinel values.

 

Discussion of the Variable model:

The discussion of the model was largely supportive of the model as presented, with agreement among the attendees regarding the basic conceptual/represented/instance variable distinction.

There was some discussion over the role of the "Master Sentinel Category Set" and Extension Code List, particularly with regard to the issue of respondent-driven issues such as "Refused" or "Dont know" responses. Additional use cases are to be considered to explore this set of objects in more fine grained detail - Dan and Jay will consider this further before the next meeting.

There was general agreement on the distinction between population and unit type - where unit type is the general unit being observed, and the population is that set of units within a given temporal and spatial context - eg. voters is the Unit Type, where voters enrolled to vote in Australia as at 1 January 2014 would be the Population.

There was some short discussion of the two data point and datum classes in the model. Dan identified that the two needed to be reversed - that "Datum" was the appropriate class to link to the Instance Variable. "Data Point" was held as non-specific class that likely connects this package to others (potentially a Cell in a table or in a Physical Data Set). This class shoul dbe reconsidered at a later point when this work is integrated into the broader DDI4 model.

Next steps

At the end of the meeting, there were two further actions:

  • Jay and Dan will complete further work to finalise the "Master Sentinel Category Set" modelling
  • Steve will review the overview status of the Simple Data Description package/library/view to determine it's readiness for discussion at the Dagstuhl sprint.

Next Meeting: Monday September 22nd, 1400 Central European Time

 

Meeting 25 August

Attendees: Steve McEachern, Dan Gillman, Larry Hoyle, Jay Greenfield, Ornulf Risnes

Action items from 11 August:

  • Everyone to review PHDD
  • Everyone to review Dan's document
  • Achim to provide some information on the issues around the complexity of data description in DDI 3

 

Meeting notes:

To open, Steve reviewed the activity from the last meeting

Continuation of discussion of the distinction between the represented variable and instance variable.

The following is a summary of the various lines of discussion that occurred.

  • Where do we draw the line between represented and instance? Eg. Larry’s case of sentinel values.
  • Do we need to split the GSIM “Instance Variable” into a Logical Instance Variable and a Physical Instance Variable?
  • What do we want to view in the Instance Variable? Physical - Quasi-physical - Logical

Examples/use cases for consideration:

  • How do we manage missing values?
  • What do we do when data is managed in different systems – eg. 32-bit vs 64-bit systems –which may not allow certain data formats (e.g double format)

In the data management example – if the data type changes, both the instance and represented variables change. We may have a more complex case of conceptual/represented/instance than GSIM accounts for – characteristics may be changing at more than one level here, which makes reuse much more challenging.

What are alternative approaches here?

  • Ornulf pointed out that it may be possible to manage the variable by changing the represented and instance level, but maintaining the conceptual level. 
  • Dan’s concern was that the tieing of the categories and the codes representing them can be done poorly (e.g. he noted this was the case in 11179). 
  • Jay noted that some of the harmonisation of longitudinal content can be achieved by thinking of some categories as concepts (e.g. certain missing categories are the same over time), and then merging/harmonising on those concepts over time. This may be a reflection of the represented variable.

Discussion centred around the issue that the core of the problem is ensuring that the reuse needs to be of concepts (e.g. conceptual variables or categories) rather than of codes. i.e. We need to ensure that we have semantic interoperability at the conceptual level, rather than necessarily at the representation level. Or in other words – need to clarify the relationship between the category and the code.

Representing cells or a "datum"

Continuing the discussion: Larry asked whether we may have a problem because we don’t have the notion of the representation of a cell (as opposed to a variable). There may be a need for representing an individual data point or datum (in the GSIM sense??) within a data file.

Next steps

At the end of the meeting, we noted that we now have two points of confusion to clarify:

  • Instance variable clarification
  • The need for datum as a class

The concern raised by Steve was that we have two important discussions, but need to find a way to “get out of the weeds”. To this end, two actions were proposed:

  • Jay will follow up with a “straw man” proposal around the instance/represented/conceptual framework to frame our next discussion. 
  • Larry will (time permitting) develop a similar idea for the “datum”.

Dan suggested increasing the frequency of meetings. For the next two weeks the other DDI meetings and US Labour Day make this difficult, but we will look at this possibility at our next meeting. We will also aim to continue to discuss out of session via email, with a summary to be posted to the wiki ahead of the next teleconference.

Next Meeting: Monday September 8th, 1400 Central European Time

 

Meeting 11 August

Attendees: Larry, Steve, Achim, Dan, Jay

Action items from 30 July:

  • Everyone to review PHDD
  • Everyone to review Dan's document
  • Achim to provide some information on the issues around the complexity of data description in DDI 3
  • Thérèse will provide a box and arrow diagram for Dan's work
  • Steve will take an initial look at how the two fit together

Discussion of PHDD and SCOPE (Dan Gillman) documents

Larry and Achim gave an overview of the PHDD framework, outlining the original intent and the basic elements of the model, which focus on physical data descriptions. Dan then followed with a similar overview of the SCOPE draft model - which focuses on the logical data description. Dan noted that the focus of the SCOPE group resulted from U.S. statistical agencies coordination - “ Statistical community of practice and engagement group”, SCOPE - which was intended to coordinate on metadata for agency activities, including the data.gov initiatives. It was particularly noted that data dictionaries were undefined, and this might form a particularly useful starting point for the SCOPE group - hence the proposed model.

What is in scope?

The group then continued on to consider the question of what should be in scope for a data description. In particular, we wanted to consider whether the focus should be on the Physical or the Logical - or potentially both. Steve provided a short overview of how he saw the intersection of the two models - see "Notes in advance of team meeting" below, noting particularly that the point of interaction appears to be at the variable - the physical representation of the variable within the data file, and the logical characteristics of that variable within the data description - i.e Logical = what it means Physical = how it is laid out

There was general acceptance that the group should continue to consider both physical and logical at this time - although the two may become separate packages/views at some appropriate point later in the Moving Forward process.

Which variable do we mean?

Given that the intersection of the physical and logical was seen to be the variable, there was then extended discussion regarding the characterisation of "variable" within the model. (This is also something that has been discussed without resolution in the Simple Instrument and Conceptual teams).

The focus was particularly on the logical variable representation:

  • Jay asked about which variable are we talking about: Represented? Instance? 
  • Jay felt that the emphasis should be on the Logical at the intensional level (with an s) 
  • Logical level might have to have two parts to it ( represented and instance) 
  • Achim asked about where wihtin the description we might represent the variable name – for example in different physical representations for same study
  • A third level on the logical side is instance variable 
  • Other considerations were characteristics such as Unit type , sentinel values (name), and how the population may be different for unit type (time and space)

It was noted that there is some consideration needed of the equivalence within this discussion to the GSIM framework - which includes an instance variable.

Several use case examples were discussed.

Jay's use case: 

  • Recode of age collapse values – representation changes 
  • New represented variable (new value domain) and instance variable 
  • It was noted that the Idea behind instance variable is as a "variable in use" – i.e. variable in a file somewhere 
  • Question whether Data from one format to another format is this the same instance variable? 
  • Dan's position was that a Copy of the data should be same instance variable. Includes changing format. 
  • Jay argued for further specialisation of the instance variable to make it useful (e.g. by adding attributes to GSIM instance variable)

Continuation of instance variable discussion: Dan argued that the physical side should be purely a map to logical

Larry's use case

  • Copy from SPSS to SAS – must change sentinel values
  • keep .d = 99 -> don’t know, .r=999 -> refused
  • Sas/stata = spss
  • Missing at represented level (categories) vs Missing at instance level with different codes.
  • Managed missing representation at the instance level

Other properties of variables:

  • Data type – physical(realized) or logical(envisioned)
  • Logical integer – physical number of bytes

Use case: Currency

  • real vs real with two digits of precision 
  • Cents are truncated differently than rounding of reals.

Given the number of examples and the increasing complexity of the discussion on instance variables, it was felt that some further articulation of a proposed approach would need to take place between the meetings. Dan GIllman agreed to provide a first cut of a possible bridge between the physical and logical. The group would then reconvene on August 25th to further the discussion.


Next meeting

The next regular meeting takes place on August 25th at 2pm CET.

(Steve McEachern: Notes in advance of team meeting 11/8/2014)

 

(Steve McEachern: Notes in advance of team meeting 11/8/2014)

The following is aiming to represent the relationship between several of the core elements across the DDI4 packages/views.

The final two columns are the likely relationships that exist between the PHDD and SCOPE (and their equivalents in other packages/views)

"Unit"ConceptualQuestionnaireDataDictionary(Logical)DataFile (Physical)
Basic unitConceptQuestion(Capture??)Variable (SCOPE)Column (PHDD)
AggregateConceptSchemeQuestionnaire(Instrument)LogicalDataset (DISCO)Table (PHDD)
Value domainConceptScheme(which??)ValueDomainRepresentedValueDomain (source??)Not Applicable??

 

PackageRelationships

Logical description picture (derived from Dan's doc)

First attempt at creating box and arrow diagram from Dan's document. I have made some things properties where it seemed appropriate. I had Alistair check to make sure if was not crazy. Feel free to modify it as you please.

logical

Meeting 30 July

Attendees: Larry, Steve, Achim, Dan, Thérèse

The group needed to nominate a new team leader. Steve agreed to do this.

After a break in meetings, the group needed to remind themselves of what was being achieved. The team is creating a view called Simple Data Description. A view is the subset of information that is important to a use case. Thérèse created the view during the meeting and added a random object to make the view appear (this object should be removed). See: http://lion.ddialliance.org/view/simpledatadescriptio

We expect that we will need to add some objects to the library as not everything that is important for our view is existing. The new objects should be added to the package called New objects for Simple Data Description (http://lion.ddialliance.org/package/newobjectsforsimpledatadescription). Note: There are objects already existing in this package presumably from previous work of this group. These objects should be reviewed!

The use case for Simple Data Description says:

PurposeTo develop a robust model that can describe all aspects of simple, rectangular data file in our domain.
Description of viewThe model must include bridges from the physical representation of a rectangular data file to high-level conceptual objects in the model.

There was some discussion about this. Are we only talking about physical (the layout of data in a file)? Is PHDD in scope? It was agreed that physical should be distinct from logical. The logical is often reused. PHDD has some links to high level conceptual objects - column in PHDD = Variable, rows = Data Records and Table = data file.

How do we know where to draw the line for simple data description? The simple group should create something that caters for the simplest use cases. The complex team that follows will extend this. Following this, we have something that everyone can understand and use quickly. It is important to have a something for the simple use case. A criticism of DDI 3 is that it was too complex to use for those who just have a rectangular data file. The logical description in 3 was just too complex to easily understand.

The group does not need to start from scratch. There is PHDD (http://www.ddialliance.org/Specification/RDF/PHDD), DDI 3... Dan told us about a specification for data dictionaries that he has recently created with other US statistical agencies. This specification gives less than 20 objects that describe a data file at a basic logical level. See: SCOPE - Metadata Element Set for Describing Variables - Updated.docx

The simple data description should include physical and logical. The group should use PHDD as a start for physical and the work by Dan as a start for logical. This would give someone a schema that would be fairly complete. We should then also look at DDI 3

Action items:

  • Everyone to review PHDD
  • Everyone to review Dan's document
  • Achim to provide some information on the issues around the complexity of data description in DDI 3
  • Thérèse will provide a box and arrow diagram for Dan's work
  • Steve will take an initial look at how the two fit together

Next meeting

The next meeting takes place in the week starting 11 August. A poll will be circulated to find the best meeting time.

March 17 meeting

2014-03-17 Meeting Minutes

Time:

15:00 CET


 

Meeting URL:

https://www3.gotomeeting.com/join/685990342 


 

Agenda:

1) Status update. Where are we now with SimpleDataDescription? (ØR)

 

2) Clarify relationship between domain experts and modeler. Define role responsibilities, desired workflow in group (ØR, AW?)

 

Domain expert adds object descriptions and relationships

Modeler puts them into the overall model

Then iteration


What is the status of round trip?

Drupal to xmi to EA? Yes.

Is there  machine actionable feedback into Drupal? No. It is possible but some work is required. It is not clear yet if there are resources for this task. Furthermore there are different positions on the issue if the roundtrip makes sense.

 

3) Identified issues with the current version (ØR/all)

a) Model is sparse on properties for InstanceVariable, RepresentedVariable, ConceptualVariable. Out of scope for this group?

Comments: These objects currently only exists in the SimpleDataDescription package. Discussion about GSIM/DDI 3.2 and who’s responsible for the “core variable objects”. 

b) Do we need DataSerialisation (the physical counterpart of DataDescription)? DataDescription already relates to InstanceVariable, which relates to Field (column) in the RectangularDataFile. Because of this, a path exists from the Fields in the RectangularDataFile via InstanceVariable up to DataDescription and “TOFKAS”

c) DataSerialisation has no relationship to RectangularDataFile. If we decide to keep DataSerialisation, surely the relationshop to RectangularDataFile must be added.

 

4) TODO; Identify outstanding tasks (ØR/all)

  • Dan shares info on data.gov-Data dictionary

  • Dan shares a set of example data descriptions

  • Ørnulf pulls info from GSIM to produce candidate objects/properties for InstanceVariable, RepresentedVariables, ConceptualVariables

  • Larry shares findings/glossary for terms in extended attributes for SAS Enterprise Guide tool (below)

  • Ørnulf to suggest some “benchmark datasets” that can be used to document our work, and to “prove” that we are able to model a set of different data sets with our new model

  • Barry to flag potential issues from fieldwork with 3.2

    • Still a couple of months down the road

  • Ørnulf to harmonize minutes document and bring Larry’s notes in the right place

  • Ørnulf to try to arrange a meeting in April

  • Larry remembers to invite Ørnulf in case he’s needed for a virtual meeting during the NADDI sprint.


 
5) Assign responsibilities for outstanding tasks (ØR/all)

See above.

 

6) Plan milestones (based upon TODO-list, goals and availability) (ØR/all)

Overall milestone plan/timelines to be clarified during NADDI sprint. Thérèse Lalor (ABS) is currently the project manager for DDI4 - but only until July 2014.


 

Other notes: