|4. Statistical Metadata Systems (Statistics Canada)||Statistics Canada||6. Organizational and workplace culture issues (Statistics Canada)|
5.1 IT Architecture
Statistics Canada is moving towards a SOA. A key enabler of SOA is the Enterprise Application Integration Platform (EAIP) that allows the delivery of solutions based on meta-data driven, reusable software components and standards. Most business segments will benefit from the common core business services, standard integration platform, workflow and process orchestration enabled by the EAIP. The platform also simplifies international sharing and co-development of applications and components.
Web services currently in use and under development by EAS are associated to information objects representing core business entities (e.g., questionnaires, classifications, tax data, business registry) that are classified into GSIM’s Concepts and Structures groups. This fits nicely with GSBPM as well: services provide the inputs and outputs to GSBPM statistical processes. They satisfy a basic set of SOA principles, i.e., they are loosely coupled (consumer and service are insulated from each other), interoperable (consumers and services function across Java, .NET and SAS), and reusable (they are used in multiple higher-level orchestrations and compositions). Work continues to establish a complete framework, including discoverability (via a service registry and inventory) and governance.
At this point, Statistics Canada has a combination of services and silo-based/point-to-point integration that can be described as a combination of maturity levels 3 and 4 in terms of the Open Group Service Integration Maturity Model (OSIMM) maturity matrix (see Figure 1). During the transition years to a corporate-wide SOA, incremental changes are being made by applying SOA adoption and governance by segment in which cross-silo services and consumers coexist with point-to-point integration of systems and data. Early adopters of SOA services include IBSP, SSPE and SNA.
Developing Data Service Centres (DSC) is a key initiative that fits into Statistics Canada’s emerging SOA. The objective of the DSC is to manage statistical information as an asset – to maximize its value by improving accessibility, utility, accuracy, security and transparency through the use of a centralized inventory of statistical data holdings, associated metadata and documentation. Key statistical files and associated standard metadata (i.e., file name, type, description, creators, owners, etc) will be registered and integrated into statistical processes via SOA. This integration will rely on a data access layer with common interfaces to access statistical files without the user needing to know their location, format and/or technology.
5.2 Metadata Management Tools
Information discovery (available for internal use only) is through a wiki-based solution. Each Wiki page provides the context of the information and provides all the available links to the information. The wiki view provides a non-linear view to the information as the user can decide on the path to take. The wiki engine selected for use at Statistics Canada is MediaWiki 1.8 (which is the same wiki engine used by Wikipedia). The wiki pages are programmatically generated. The information from the IMDB Oracle Phase 2 and Phase 3 database is extracted using a VB .Net application. Specific wiki templates were developed for the IMDB and these are used to provide a consistent display presentation. Wiki tags and Wiki templates added to extracted IMDB information and this information is directly populated into the MySQL database of the MediaWiki engine. The IMDB wiki pages are refreshed daily.
5.3 Standards and formats
An initial investigation was done by a development team to determine if there were already existing software tools both internally and externally to support collection. Existing software tools were not discovered, therefore, the decision was made for in-house development.
Oracle 8i was selected as the database and IBM Visual Age for Java was selected as the development tool. This system referred to as MetaStat was in development and production from 1999-2002. The development was ceased in 2002 because the IBM Visual Age for Java product was discontinued by the vendor and a migration path to another product was not supplied by the vendor. The data content collected and managed by the MetaStat system consists of Statistical Activity, Survey, Instance, Frame, Universe, Instrument, Data Files, Survey Methodology and Documentation. Supporting content also collected by MetaStat includes Organization, Contact, Keyword and Theme. TheMetaStat system is still in current use for collection of Phase 2 information. New development for MetaStat was frozen in 2002. MetaStat support currently consists of ensuring the Oracle database drivers for Java and the Java classes (currently tested to support Java 1.3) will continue to behave as expected as we migrate to newer versions of the Oracle database. The current production version is Oracle 10g. MetaStat is being retired and the functionality will be incorporated into architecture of the Phase 3 system.
The decision was made by the development team to move towards open source development tools in order to reduce the risk of vendor lock-in as was experienced during the development of the MetaStat system. The development of the Phase 3 also provided an opportunity to enhance the data model to provide multilingual data support.
The system developed for Phase 3 is referred to as MetaWeb. It is a Java JSP and Servlet based solution. The data content collected by the MetaWeb system consists of Object Class, Property, Data Element Concept and Data Element. Conceptual Domain and Value Domain information is collected and populated into the IMDB database via a Microsoft Excel IMDB Extraction/Loader and an Oracle PL/SQL IMDB Loader. The decision to used Excel as a collection tool for the Conceptual Domain and Value Domain information was based on the functionality present in Excel for data manipulation (such as sorting), facilitation of presentation of complex multi-level information by using individual worksheets, familiarity of use of Excel in the organization and the ease of sharing of the Excel data with other applications (such as Datawarehousing) with in the organization.
Initial preparations for migration of the Phase 2 content into the Phase 3 architecture has started by creation of a bridge between the two systems which consist of a creation of Phase 3 system identifiers mapped to the Phase 2 system identifiers. Additional collection interfaces on the development schedule for MetaWeb include: Question, Question Response Choices, Question Block and a Value Meaning manager to support the Survey planning and designphase of the statistical cycle.
5.4 Version control and revisions
The open source version control tool Concurrent Versions System (CVS) and Windows Clients for CVS is used for managing for all software source code. Separate environments are set up for development, acceptance testing and production. When software is promoted from the test acceptance testing to the production environment, the software suite is tagged in CVS with a production release number.
5.5 Outsourcing versus in-house development
Most development of the IMDB has been done in-house by systems developers from Statistics Canada's Systems Development Division - a centralized service responsible for developing the Agency's applications. During periods of shortages of systems developers, contractors were hired but worked on-site with our systems developers.
5.6 Sharing software components of tools
The current IMDB system is ten years old and is currently being upgraded. Statistics Canada is willing to share its documentation on the IMDB model but there is limited scope to share any IMDB tools at this time.
5.7 Additional materials
---- Daniel W. Gillman, 1999: Corporate Metadata Repository (CMR) Model; U.S Bureau of Labor Statistics.