Message-ID: <111025754.40485.1461946569564.JavaMail.confluence@ece-vmapps> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_40484_1535712321.1461946569564" ------=_Part_40484_1535712321.1461946569564 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
|3. Statistical Metadata in each phase= of the Statistical Business Process (German Federal Statistical Office)||German Federal Statistical Office||5. System and design issues (Ge= rman Federal Statistical Office)|
GENESIS is a cube database used in the Verbund by many statistical offic= es. It is based on an extensive data and metadata model and handles its met= adata internally. First drafts of the system date back to 1994. At that tim= e, GENESIS was intended as a data warehousing solution mainly to store macr= o data for internal purposes. Although it is also used in this way, its mai= n purpose has become to serve as a dissemination database to internet users= (since 2002).
In many ways GENESIS overturned existing habits of disseminating data at= Destatis and in the Verbund when it was introduced. The cube model along w= ith the standardised metadata entry forced a new way of thinking onto subje= ct matter statisticians. Constrained by organisational issues - especially = coordination in the Verbund - and legacy IT-systems, it often stretched the= resources of the subject matter departments and the central coordination u= nit. Despite the age of the design, it is only now that its full potential = is being realised. Especially in combination with the centralised micro dat= a storage build by SteP, it is possible to populate the cubes faster and he= nce build larger cubes and publish faster. GENESIS is integrated into the w= eb pages of the offices in the Verbund. At Destatis it is linked to the pre= ss releases so that interested users can search for additional data.
GENESIS was implemented using a programming language called Natural and = a pre-relational database technology named ADABAS. ADABAS was first introdu= ced in 1971 and - with many updates - is still heavily used in legacy softw= are at public institutions in Germany.
GENESIS has several clones, with each office having its own database. Th= ere is also a GENESIS clone with nationwide data at a regional level. The G= ENESIS model itself has over the years proven its worth as a data and metad= ata model for a dissemination database. One Land office (Rheinland-Pfalz) d= eployed a new dissemination database a few years ago which is based on the = same GENESIS model while using relational technology.
With the establishment of research data centres (RDCs), statistical offi= ces in the Verbund began to realise the need for a database holding metadat= a that could explain the content of the research data files to the research= ers. The decision was made to expand the metadata part of the existing GENE= SIS-system. The result was a metadata system that contains information espe= cially on the level of individual data files, on the level of statistical a= ctivities and on the level of variables.
Each variable ought to be entered only once and is then tied to the data= file and thereby to the statistical activity it is used in. To avoid the d= uplicated entry of variables with different names but similar content, an e= ditorial team reviews each variable individually. This basically follows th= e same idea that was employed in the GENESIS database.
It is interesting to compare the (meta-) data model of the RDC-Metadata = system (essentially an expanded GENESIS model) with the other models like t= he Neuchâtel model or ISO 11179. In some ways they are similar, but t= he idea of a conceptual variable or of an ISO 11179 data element scheme doe= s not exist. Therefore, variable definitions have to be harmonised at a ver= y low level. The variable is modelled as an object with a definition and a = value domain. Categorical variables have their categories (called value dom= ain items in Neuchâtel speak) as objects of their own. The value doma= in is not modelled as an object on its own. Therefore, variations in the va= lue domain of a variable necessitate the entry of a new variable. As a resu= lt, the number of variables rises and the system today stores 5,600 variabl= es for the micro data files of 33 statistical activities.
Nevertheless, the RDC-Metadata system has been successfully implemented = and is popular with researchers using the data centres. Since it is not pos= sible to access the research data files via the internet, any prior informa= tion about their content is welcome. The system is not yet fully populated = as metadata exists only for 35 of the planned 60 statistical activities.
The success of the RDC-Metadata system quickly led to a decision to use = the same system not only for the RDC-relevant statistical activities but to= apply it across the board. The result is the idea of an "output orien= ted metadata system". Although a business case does not exist, it coul= d be used to document the metadata of the finalized micro data files. A cos= t analysis of this project still has to be undertaken, but from what can be= said today, it seems unlikely that the original idea of harmonising variab= les at such a very detailed level can be realized by way of an editorial te= am. With already 5,600 variables for the metadata of 33 statistical activit= ies in the RDC-system, the figure is likely to rise significantly when vari= ables for up to 390 statistical activities have to be stored. In any case, = the number of variables stored in the system will most likely be too high t= o harmonise the variables by comparing them one to one.
The Statistikdatenbank stores metadata for all statistical activities at= a very high level. It exists currently in the form of two MS-Access databa= ses. One is used to maintain the central catalogue of all statistical activ= ities (called EVAS - Einheitliches Verzeichnis aller Statistiken) of the Ve= rbund. The second one is used for management purposes, containing basic inf= ormation on methodology, legal background, etc. The reengineering will comb= ine this information in a single application that will allow accessing and = querying the information via the internal web portal of the Verbund. As a r= esult, general information on all statistical activities will be visible to= all users in the Verbund.
In the course of its further development, the Statistikdatenbank will be= come a central hub for the management of statistical activities at Destatis= and in the Verbund. Every new statistical activity will first have to be r= egistered in the Statistikdatenbank and is then identifiable by its unique = EVAS-code (registration meant as a business process, not necessarily in a s= trict IT-sense). The Statistikdatenbank can easily be amended and combined = with other metadata storages at Destatis that use the same EVAS-catalogue. = For example, it is planned to integrate the quality reports directly into t= his application. Quality reports contain partially overlapping information = but are currently stored as single text files written according to a given = template. It is conceivable that other EVAS-based systems - like the databa= se used to compile Destatis' Strategy and Programme Plan or internal accoun= ting databases - will also be loosely attached or linked to the Statistikda= tenbank.
KlassService is a tool developed by the Bavarian State Office for Statis= tics and Data Processing. It is used to classify and code answers entered i= n free text fields in questionnaires. It currently houses only two classifi= cations (the German NACE and PRODCOM versions). Since the administration of= standard classifications is under the responsibility of Destatis rather th= an the Länder, the classifications and the additional thesaurus are ma= intained by Destatis using a web interface. KlassService has also been decl= ared a standard IT-tool under the SteP guidelines. As such, it is used to s= upport the classifying and coding of responses in many offices of the Verbu= nd.
KlassService was built using an ADABAS database and is now considered a = legacy system. Because of rising maintenance costs, the Bavarian State Offi= ce expressed the wish to move to relational technology. At the same time, D= estatis' classification department was making plans to build a comprehensiv= e classification server. The classification department had previously advis= ed the Turkish National Statistical Institute on the design of such a syste= m.
As a result of these initiatives, a business case was drafted that invol= ved a redesign of the old KlassService in three successive stages. The firs= t stage basically consists of the database itself and basic import function= alities. The succeeding stages will focus amongst other things on the user = interfaces. The first stage is being carried out by the Bavarian State Offi= ce for Statistics and Data Processing. The later stages will be put out to = tender in the Verbund.
The new KlassService will bear little resemblance with the old system. I= t will be based on the Neuchâtel Terminology, Part I, which will only= be slightly altered to fit the relational technology employed. Web service= functionalities enable connections to other databases and IT-tools (namely= to other metadata systems). The system will also be designed to support mu= ltiple language versions of the classifications.
According to a decision made by the heads of the offices of the Verbund,=
a separate metadata system has to be developed for the Census 2011. The sy=
stem will be of modular design and so several drafts for individual busines=
s cases have to be written. Some of the applications could possibly remain =
in use after the census has been completed and - if applicable - be employe=
d in other statistical activities as well. A decision on the implementation=
will be made in collaboration with the IT-working group for the Census.
Issues of census metadata management include:
It is standard practice among statisticians to deliver most of the docum= entation in written form. A sophisticated methodology, the need for coordin= ation between many parties involved and a very intense preparation phase le= ad to an enormous amount of text files being written for the Census 2011. H= owever, there is currently no tool to store such documentation in a structu= red way in the Verbund. In order not to change existing work practices, the= first measure to be taken for the Census 2011 will be a relatively simple = document management system. It will be structured according to the Census p= rocess model (fig. 3); requiring statisticians to deliver documentation in = the form of a text file for each applicable phase on level 2 of the model (= see also 3.2.). The documents provided will largely be documentation alread= y existing but structured according to the process model.
To move the documentation of variables for the Census 2011 from written = text files to a more accessible and regularly updated form, a database for = variables will have to be realised. The draft for this application is curre= ntly being written. It will be based on the Neuchâtel Terminology Mod= el (Part II, Variables and related objects).
To fully document the statistical data collected, processed, analysed an= d disseminated in the 2011 census, the different data holdings ought to be = documented. According to the current plan, this will either be an extension= of the variable database or an independent system. A separate draft will b= e written for this part, but it will also be based on the Neuchâtel T= erminology Model (Part II).
To realize the potential of metadata and reduce duplicated entries, the = metadata system ought to be connected to production tools. A draft will be = written to explain the connections between the systems and how a coupling o= f the different tools can be realized.
Several standard classifications will be used in the census. Since these= classifications should be stored in the KlassService database, avoiding du= plicated entry, these systems must be linked in some way. A draft will be p= repared to explain the connections between the systems.
.BASE (Basis Anwendungen fuer Statistische Erhebungen) is the umbrella n= ame for several IT-tools - developed for the Verbund - to support a standar= dised e-workflow and forms an important part of the SteP-project. Some of t= he .BASE tools - notably a data editing tool - are metadata driven and load= their metadata from a central storage named "survey database".= p>
The survey database registers every survey in the Verbund. A statistical= activity may consist of one or more individual surveys. For every survey, = several resources can be uploaded and accessed in the survey data base. Apa= rt from text files and other documentation, several XML-files containing me= tadata to drive production processes can be stored. These XML-files contain= for example registered variables and executable code to drive data editing= processes in different IT-environments.
The metadata in the survey database is clearly on a technical level. In = the terminology of the Neuchâtel model the variables are on a level l= ower than the conceptual level. It is obvious that the survey database woul= d therefore provide an almost perfect vehicle to transport conceptual metad= ata (being stored in classification servers or variable databases) into the= production process. This would link conceptual and production metadata and= - together with a powerful data warehouse at the end of the statistical va= lue chain (see GENESIS above) - would almost finalise a metadata driven pro= duction process (provided other IT-tools would use the survey database as w= ell).
To that end, however, several steps will have to be taken beforehand. Th= e survey database was not designed with international metadata standards in= mind. For obvious reasons, the focus of the designers was to connect produ= ction tools to the database. Given the myriad ways to design a statistical = activity, it is unsurprising that the definition of the term "survey&q= uot; remains somewhat ambiguous. This is currently not a problem for the ex= isting .BASE tools but it will become more of a problem when the survey dat= abase is connected to more production tools and other metadata storages (li= ke classification servers or variable databases). Therefore, an overarching= metadata model and a standardised terminology is needed to integrate addit= ional production and metadata systems, to facilitate the interoperability o= f the SteP-tools and thus to ensure the overall success of the SteP project= .
Metadata management contributes directly to the realisation of major obj= ectives in Destatis' corporate strategy. It enables the further standardiza= tion of processes, the harmonisation of statistics and the monitoring of da= ta quality (see corporate strategy). Metadata systems help with the documen= tation of surveys. To ensure that the public trusts in the data which Desta= tis and the Verbund produce and to be able to claim that the data has been = compiled according to an appropriate methodology, a good documentation is i= ndispensable. With central metadata systems in place, duplicated entry of m= etadata becomes unnecessary, it will be possible to share information easil= y, to drive production systems and to keep internal and external users info= rmed about the statistical activities. A metadata model that allows for the= correct representation of the metadata of all statistical activities can i= tself be a powerful tool in the standardisation of business processes and I= T-systems because it represents a common structure for all statistical acti= vities.
Since several IT-systems that run on metadata are already in place and g= iven the complexity of the issue, we have decided in favour of a stepwise i= mplementation strategy. The Statistikdatenbank and the new KlassService are= the systems that will become operational first while a variable database a= nd a tool for managing textual documentation (both part of the Census 2011-= project) are next in line. A detailed project management is in place in the= Census project. The development of KlassService is managed by the Bavarian= State Office for Statistics and Data Processing. The progress of the Stati= stikdatenbank is dependent on the resources of Destatis' IT-department.
The major design work on the Census 2011-related systems will have to be= finished in the first half of 2010. As the census will be conducted in May= 2011, maintenance and helping users with the systems could have become a m= ajor task by then. After 2011 the attention might turn to generalising the = lessons learned and broaden the metadata management with involvement in Ste= P gaining in importance.