HLG Wiki Home

Site Map

Modernisation Committees

[MC Org. Framework & Evaluation]

[MC Production & Methods]

[MC Products & Sources]

[MC Standards]

HLG resources

Workshops

HLG documentation

 ...

Other Resources

 ...

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE                            

CONFERENCE OF EUROPEAN STATISTICIANS

Working paper 2013/28

27 November 2013

 

High-level group for the modernisation of statistical production and services (HLG)

Meeting, Geneva, 27 November 2013

 

 

Istat generalised software development and sharing: experiences and strategy [1]

 

 

1.     Background

 

By “generalised (or generic) software” for statistical production we mean the set of systems and IT tools specifically designed to ensure production capabilities in the different phases of a statistical production process (as defined by the Generic Statistic Business Production Model). Aside from the productivity benefits that result from the use of products applicable with little or no need for ad hoc code development, it should be noted that generalized systems usually implement methodological solutions that ensure the maximum quality of the produced statistical information.

The limited number of users of these systems (essentially National Institutes of Statistics and, to a lesser extent, other bodies involved in statistical research), has meant that private companies are not interested in this development area (with some exceptions), which has instead being covered by leading official statistical institutions such as Statistics Canada, Statistics Netherlands, the U.S. Bureau of the Census and a few others.

Also Istat, the Italian National Institute of Statistics,  has been developing its own generalised IT tools since the early 90’s (mainly edit & imputation and sampling design & estimation generalised systems: Concord, Diesis, Mauss, Genesees). Where necessary, generalised systems were acquired outside (namely Blaise, ACTR and GEIS/Banff), so not to leave any suitable phase of the production process uncovered.

As a matter of fact, so far no strategic decision has been taken inside the community of National Statistical Institutes in order to coordinate their efforts and produce a complete and optimised suite of products to cover the whole statistical production chain. Only recently interest for this coordination has arisen and initiatives have been undertaken.

A first initiative was carried on by UNECE with its Sharing Advisory Board, aiming at collecting information on generalised software used by official statistical institutions [2] .

A second set of initiatives were the launch of ESSnet projects dedicated to the definition of a common architecture and environment (respectively, CORA and CORE) facilitating the sharing of IT tools, and, more recently, the constitution of the HLG group for the definition of a Common Statistical Production Architecture (CSPA), based on a Service Oriented Architecture approach.

All these initiatives are not directly aiming at the cooperative development of software systems, but rather to prepare the ground for facilitating the adoption of common optimal IT solutions.

In the VIP.Programme the project “Shared Services“ will contain an ESSnet (to be launched in 2014) dedicated to the investigation of the opportunities of Free and Open Source Software (FOSS) projects useful to NSI’s and the ESS. The main objectives include stocktaking, standard setting, and benchmarking. The project will also investigate opportunities for a Center of Expertise on FOSS projects for official statistics.

It is thus unappropriate to proceed in an uncoordinated and isolated manner in the field of generalised software production: NSI’s have to define a clear strategy in order to achieve the goal of developing and adopting a common set of standards or recommended IT tools.

 

 

2.     Generalised software in Istat

 

Since the early 90’s ISTAT goal was to make available for each step of the production process one or more recommended software systems enabling to perform these steps in an optimal way, from the point of view of both effectiveness (quality of results) and efficiency (reduction of related costs).

 

In table 1, the current situation is reported.

 

It can be seen that whenever an IT tool has been developed by ISTAT, a clear choice in favor of an open source approach has been made, at least since 2007.

This is due to ISTAT policy regarding the development and / or acquisition of IT tools, that must meet the following requirements: (1) implement the most advanced methods and techniques from the point of view of their gains in terms of quality and efficiency, and (2) be fully interoperable, and therefore usable in the first place by the whole community of official statistics.

 

As for the first requirement, it is well known that the open source system R is on the edge of research, because of the huge amount of libraries (5000) produced by a vast community of developers: the availability of these libraries is a great advantage for the development of IT tools that can harness already available solutions.

 

A pre-requirement for the second condition is the adoption of development technologies that do not jeopardise interoperability. Therefore, whatever tool we develop should not be tied to particular commercial DBMS or systems. 

For instance, ISTAT standard DBMS is Oracle, but the aim is to develop IT tools that, when a relational database is required, use ODBC to connect instead to declare directly a particular DBMS.

The use of SAS to develop generalized software is a blocking factor, because it has to be installed to run SAS-based IT tools. To overcome this limit, some ten years ago we asked SAS Institute to grant the possibility to create executables not requiring SAS installations or to provide the tools functionalities as a kind of web services, but as answers were negative in both cases we decided to abandon SAS as a development technology.

 

As a consequence, since 2007 the R system environment and programming language have been elected as preferred instruments to develop generalized software..SAS was substituted by R not only because of a general policy aimed at decreasing ISTAT’s dependency from proprietary software vendors, but also because of interoperability requirements. In fact, ISTAT has many relationships with other entities in the National Statistical System, and also with other Statistical Institutes in developing countries: in transferring ISTAT software to partners, one capital requirement is that this cannot imply any obligation for them to purchase commercial software to run the software.

 

Currently, the software acquired on a commercial basis is from other Statistical Institutes or International Organisations: Canada (ACTR, Banff), Netherlands (Blaise), OECD (OECD.Stat). Some of them are planned to be substituted with FOSS solutions.

In particular, ACTR is being migrated to an R version, and Banff will be substituted by a set of alternative tools:

o         “editrules” for error localisation (developed by Statistics Netherlands as an R package),

o         an R package for nearest neighbor and Predictive Mean Matching imputation, being developed in ISTAT, 

o         “RSPA” for minimal record adjustment after imputation (another R package developed by Statistics Netherlands).

 

As for licensing, from the very beginning we decided to disseminate the software produced on our own on a complete free basis. Our first generalized products (Concord and Genesees, SAS versions) have been offered to any Institution or private body asking for them, with no restrictions.

Now, ISTAT policy is to license the software on the basis of a FOSS licensing. Almost all of them are EUPL (“European Union Public Licence”). The EUPL ensures the following rights to the licensee:

1.       Obtain the source code of the software from a free access repository

2.       Use the software in any circumstance and for all usage

3.       Reproduce (copy, duplicate) the software

4.       Modify the original software, and/or make derivative works out of it

5.       Communicate the software to the public (i.e. using it through a public network or distributing services based on the software via Internet)

6.       Distribute the software or copies thereof to other users (inside or outside the licensee's organisation)

7.       Lend and rent the software or copies thereof

8.       Sub-licence rights in the software or copies thereof.

 

GSBPM phases and sub-processes

Software

Functions

Developer

Characteristics

State

Licensing conditions

2.         Design

2.1 Design frame and samples

MAUSS-R (“Multivariate Allocation of Units in Sampling Surveys”)

Design single-stage stratified samples

ISTAT

R-based system

(R core + Java interface)

Currently used in ISTAT and available for anyone

FOSS

(EUPL)

2.         Design

2.1 Design frame and samples

BEAT (“Bethel Allocation

Design two-stage stratified samples

ISTAT

 

R package

Completing development

and testing

FOSS

2.         Design

2.1      Design frame and samples

4. Collect

4.1 Select samples

FS4 (First Stage Sample Stratification and Selection )

 

ISTAT

R packages

(core + Tcl-TK interface)

Beta version

FOSS

2.         Design

2.1      Design frame and samples

4. Collect

4.1 Select samples

SamplingStrata

Frame optimisation for stratified sampling and selection of units from optimised strata

ISTAT

 

R package

Currently used in ISTAT and available for anyone

FOSS

(GPL2)

(on the CRAN)

4.         Collect

4.3 Run collection

5.         Process

5.2. Classify and code

Blaise

Develop applications for CAPI, CATI and CADI

Develop applications for interactive coding

Statistics Netherlands

 

Currently used in ISTAT

Commercial

6.         Process

5.1 Integrate

StatMatch

Statistical matching

ISTAT

 

R package

Currently used in ISTAT and available for anyone

FOSS

(EUPL)

(on the CRAN)

5. Process

5.1 Integrate

RELAIS (REcord Linkage At IStat)

Develop record linkage applications

ISTAT

R-based system

(R core + Java interface)

Currently used in ISTAT and available for anyone

FOSS

 

5.         Process

5.2. Classify and code

ACTR

Develop applications for batch coding

Statistics Canada

 

Currently used in ISTAT, but migrating to an R version

Commercial

5.Process

5.3 Review, validate and edit

5.4 Impute

 

Concord-Java

Develop applications for Fellegi-Holt based edit and imputation procedures (categorical data)

ISTAT

Fortran (core) + Java (interface)

Currently used in ISTAT and available for anyone

FOSS

(EUPL)

5.Process

5.3 Review, validate and edit

5.4 Impute

 

Diesis

Develop applications for Fellegi-Holt and/or Nearest-Neighbour based edit and imputation procedures

(categorical and continuous data)

ISTAT

C

Currently used in ISTAT but not yet available outside

Will be FOSS

5.         Process

5.3 Review, validate and edit

 

 

editrules

Develop applications for Fellegi-Holt based edit and imputation procedures (categorical and continuous data)

Statistics Netherlands

R package

Testing in ISTAT

FOSS

(GPL2)

(on the CRAN)

5.         Process

5.3 Review, validate and edit

5.4 Impute

Banff

Develop applications for Fellegi-Holt based edit and imputation procedures (continuous data)

Statistics Canada

SAS procedures

Currently used in ISTAT,  planned substitution with open source alternative

Commercial

5.         Process

5.3 Review, validate and edit

 

SelEMix (“SELective Editing via MIXture models”)

Develop applications for optimised selective editing

ISTAT

R package

Currently used in ISTAT and available for anyone

FOSS

(EUPL)

(on the CRAN)

5. Process

5.6 Calculate weights

5.7 Calculate aggregates

 

ReGenesees (“R evolved GENeralised Software for Sampling Estimates and ErrorS”

Calibration and sampling errors estimation (using analytic methods)

ISTAT

R packages

(core + Tcl-TK interface)

Currently used in ISTAT and available for anyone

FOSS

(EUPL)

 

5. Process

5.6 Calculate weights

5.7 Calculate aggregates

 

EVER (“Estimation of Variance by Efficient Replication”)

Calibration and sampling errors estimation (using replication methods)

ISTAT

R package

Used in ISTAT and available for anyone

FOSS

(EUPL)

(on the CRAN)

6.         Analyse

Ranker

Calculation and analysis of composite indicators

ISTAT

Client version: VisualBasic

Web version: Java

Used in ISTAT, not yet available outside

FOSS

7.         Disseminate

 

OECD.stat

Statistical data warehousing

OECD

 

Used in ISTAT

Commercial

Table 1 – Generalised software and IT tools for statistical production in ISTAT

(in Annex A detailed information is reported with respect to the ISTAT developed software)

 

 

 

 

 

3. Future developments and collaboration strategy

 

ISTAT will continue its policy of developing and / or acquiring IT tools fully interoperable and shareable with other entities in the official statistical community.

 

ISTAT is a promoter of the “ESSnet on Free and Open Source Software for Statistical Production”, already approved by the ESSC, to be launched in 2014. The aim of this ESSnet is “ is to document, explore, and educate the ESS on the use of FOSS projects for statistical production and to evaluate the case for an ongoing Centre of Expertise to support the Official Statistics FOSS community. In doing so it will provide some investment to properly address the potential role of FOSS within a Generic Statistical Information Model and its applicability in the ESS. It will also spread knowledge of FOSS solutions and how they may benefit the business architecture, and provide NSIs and the ESS with informed research on which to base decisions on the future of statistical production. The importance of “plug and play” solutions to statistical production has been noted by the High Level Group for the Modernisation of Statistical Production and Services. To this end while a one off project in the form of an ESSnet to assess some of these plug and play solutions will be valuable, evaluating the case for a Centre of Expertise to support the Official Statistics FOSS community that has a strong methodological viewpoint is also critical to the success of moving away from the stovepipe model of statistical production and embracing new and improved software as they are developed.

 

In this framework, ISTAT has a positive attitude towards collaborative development of new IT tools.

With reference to Statistics Canada proposal [3] , ISTAT agrees on the particular collaborative development model suggested, and is willing to participate to a pilot experience.

 

A crucial point regards the choice of the tool to be developed in a collaborative way:

o         first, the area of interest (or, with a more precise term, the GSBPM sub-process) in which the tool will be used, has to be identified. In choosing this area, a requirement for related methods and techniques that can be employed in it, is that they have to be well established, with no need for further research. Sub-processes that fulfill this requirement are, for instance, “Review, validate and edit”, “Imputation”  or “Calculate weights”, thus being  natural candidates;

o         on the basis of the best methods conceptually available for the chosen sub-process, a review of existing tools (if any) implementing them should be carried out (updating the one contained in the Sharing Advisory Board website); on the basis of this review, different possible situations can occur:

1)       one existing tool is deemed to be completely satisfactory and is adopted as standard for official statistics community;

2)       one existing tool is considered to be as potentially adequate, but it needs more developments to ensure additional functionalities or to improve the available ones;

3)       no tool is adequate.

 

In cases (2) and (3) a new development project can be launched, accordingly to collaborative development model proposed by Statistics Canada.

 

 

 


Annex A — ISTAT IT tools (R packages / R based systems)

 

 

 

Package StatMatch
Author: Marcello D'Orazio (
madorazi@istat.it )
Paper: (package Vignette) Statistical Matching and Imputation of Survey Data with the Package StatMatch for the R Environment
http://rm.mirror.garr.it/mirrors/CRAN/web/packages/StatMatch/vignettes/Statistical_Matching_with_StatMatch.p d f
Phase: 5.4 Impute / 5.1 Integrate
Link: http://cran.r-project.org/web/packages/StatMatc h /index.html
 

StatMatch provides some R functions to perform statistical matching, i.e. the integration of two data sources referred to the same target population which share a number of common variables. Some functions can also be used to impute missing values in data sets through hot deck
imputation methods. Methods to perform statistical matching when dealing with data from complex sample surveys (via weights calibration) are available too.

 

 

Package SeleMix
Author: Ugo Guarnera, M. Teresa Buglielli ( buglielli@istat.it )
Paper: 

PAPER_Q2010.pdf
Phase: 5.3 Review, validate and edit
Link: http://cran.r-project.org/web/packages/SeleMix/index.html

 

SeleMix (Selective Editing via Mixture models) is an R package for selective editing. It includes functions for identification of outliers and influential errors in numerical data. For each unit, it provides also anticipated values (predictions) for both observed and non observed variables. The method is based on explicitly modelling both true (error-free) data and error mechanism. Specifically, true data are supposed to follow normal or log-normal distribution. We assume that only a subset of data is affected by error and that the error mechanism is specified through a Gaussian random variable with zero mean vector and covariance matrix proportional to the covariance matrix characterising the true data distribution.

 

 

 

Package SamplingStrata
Author: Giulio Barcaroli ( barcarol@istat.it )
Paper: (package Vignette) Optimization of sampling strata with the SamplingStrata package
http://cran.r-project.org/web/p a ckages/SamplingStrata/vignettes/SamplingStrataVignette.pdf

Phase -  2.4 Design frame and sample methodology
Link: http://cran.r-project.org/web/packages/SamplingStrata/index.html

 

In the field of sampling design (in particular for stratified sampling), this package offers an approach for the determination of the best stratification of a sampling frame, the one that ensures the minimum sample size under the condition to satisfy precision constraints in a multivariate and multidomain case. This approach is based on the use of the genetic algorithm: each solution (i.e. a particular partition in strata of the sampling frame) is considered as an individual in a population; the fitness of all individuals is evaluated by calculating (using the Bethel-Chromy algorithm) the sampling size satisfying accuracy constraints on the target estimates. Functions in the package allows to: (a) analyse the obtained results of the optimisation step; (b) assign the new strata labels to the sampling frame; (c) select a sample from the new frame accordingly to the best allocation. There is also a function that allows to build the most important input to the optimisation step, i.e. the ‘‘strata’’ dataframe, containing information (means and standard errors) regarding the distributions of the target variables in the different strata,using the sampling frame or using data from previous rounds of the same survey.

 

 

 

Software MAUSS-R Multivariate Allocation of Units in Sampling Surveys

Authors: - Teresa Buglielli, Daniela Pagliuca

Paper:

1- User and methodological manual

http://www.istat.it/it/files/2011/02/user_and_meth o dological_manual.pdf

2- mauss reference manual

mau s s.pdf

Phase -  2.4 Design frame and sample methodology
Link: http://www.istat.it/it/strumenti/metodi-e-software/software/mauss-r

 

Mauss is a tool for defining the sampling design for sample surveys on finite populations. It guarantees optimality criteria, flexibility and easy management for those who have the responsibility to design and conduct such surveys. It enables the user, once defined the objectives and the operational constraints of the survey, to choose the best sampling design between those obtained by adopting different definitions of the key features of the survey, such as the type of stratification, the desired accuracy of the estimates, the sample size, the type of domains of study, the variables of interest.

The use of this software also ensures transparency, standardization and accuracy of the methods used.

 

RELAIS

Authors: Nicoletta Cibella, Marco Fortini, Monica Scannapieco, Laura Tosco, Tiziana Tuoto, Luca Valentino

link download software and user guide:

http://www3.istat.it/strumenti/metodi/software/registrazione/datiuser.html?software=relais22
https://joinup.ec.europa.eu/sof t ware/relais/release/all

Selected papers:
- Cibella N., Fernandez G.L., Fortini M., Guigò M., Hernandez F., Scannapieco M., Tosco L., Tuoto T.- (2009) "Sharing Solutions for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences" - NTTS 2009
- Cibella N., Fortini M., Scannapieco M., Tosco L., Tuoto T. "Theory and practice of developing a record linkage software" - "Combination of surveys and administrative data" Wien 29-30 May - 2008
- Tuoto T., Cibella N., Fortini M., Scannapieco M., Tosco L. "RELAIS: Don't Get Lost in a Record Linkage Project" - FCSM 2008
- Fortini M., Scannapieco M., Tosco L., Tuoto T. "Towards an Open Source Toolkit for Building Record Linkage Workflows" - IQIS 2006
 

- User’s Guide Version 2.2

http://www.istat.it/it/files/2011/03/Relais2.2UserGuide.pdf

 

Phase: 5.1 Integrate data

RELAIS (Record Linkage At Istat) is a toolkit for record linkage. RELAIS allows combining techniques for each of the record linkage phases, so that the resulting workflow is actually built on the basis of the requirements of the application at hand. More specifically, the RELAIS toolkit is composed by a collection of techniques for each record linkage phase that can be dynamically combined in order to build the best record linkage workflow. RELAIS has been implemented in Java and R and has a database architecture (MySQL). Specifically, the estimation phase (EM) for probabilistic decision model has been implemented in R as the 1:1 reduction phase that implements the LP-solve algorithm. The other tecniques and GUIs are implemented in Java.

 

 

 

Software ReGenesees (R evolved Generalised software for sampling estimates and errors in surveys)

Author: - Raffaella Cianchetta, Diego Zardetto

Paper:

ReGenesees reference manual

ReGenesees.pdf

ReGenesees.GUI.pdf

Phase -  5.6 Calculate weights / 5.7 Calculate aggregates / 6.2 Validate outputs

Link:

https://joinup.ec.europa.eu/software/regenesees/home

 

 

ReGenesees is a full-fledged software system entirely developed in R. It has a clear-cut two-layer

architecture. The application layer of the system is embedded into an R package named itself

ReGenesees . A second R package, called ReGenesees.GUI , implements the presentation layer of

the system. Both packages can be run under Windows as well as under most of the Unix-like

operating systems. While the ReGenesees.GUI package requires the ReGenesees package, the

latter can be used also without the GUI on its top. This means that the statistical functions of the

system will always be accessible by users interacting with R trough the traditional command-line

interface. On the contrary, less experienced R users will take advantage from the user-friendly

mouse-click graphical interface.

 

 

 

Software EVER (Estimation of Variance by Efficient Replication)

Author: - Diego Zardetto

Paper:

EVER reference manual

http://cran.r-project.org/web/packages/EVER/EVER.pdf

Phase -  5.6 Calculate weights / 5.7 Calculate aggregates / 6.2 Validate outputs

Link:

http://cran.r-project.org/web/packages/EVER/index.html

 

EVER is mainly intended for calculating estimates and standard errors in complex surveys.

Variance estimation is based on the extended DAGJK (Delete-A-group Jackknife) technique

proposed by Dr. Phillip S. Kott.

The advantage of the DAGJK method over the traditional jackknife is that, unlike the latter, it

remains computationally manageable even when dealing with “complex and big” surveys (tens of

thousands of PSUs arranged in a large number of strata with widely varying sizes). In fact, the

DAGJK method is known to provide, for a broad range of sampling designs and estimators, (near)

unbiased standard error estimates even with a “small” number (e.g. a few tens) of replicate weights.

Besides his peculiar computational efficiency, the DAGJK method takes advantage of the strong

points it shares with the most common replication methods. As a remarkable example, EVER is

designed to fully exploit DAGJK's versatility: the package provides the user with a user-friendly

tool for calculating estimates, standard errors and confidence intervals for estimators defined by the

user themselves (even non-analytic). This functionality makes EVER especially appealing

whenever variance estimation by Taylor linearisation can be applied only at the price of crude

approximations (e.g. poverty estimates).

 

 


[1] Prepared by Giulio Barcaroli (“Methods, Tools and Methodological Support” Division, ISTAT, barcarol@istat.it)

[2] http://www1.unece.org/stat/platform/display/msis/Software+Sharing

[3] Software Collaboration and Sharing at Statistics Canada ” – HLG meeting Geneva June 2013,  Working paper 2013/19