Skip to end of metadata
Go to start of metadata

Prepared by Valentin Todorov, UNIDO

I.                   INTRODUCTION

The term Record linkage defines the process of identifying records on two or more different data sets (stored as electronic files) that contain information about the same object. The object referred to by these records can be a person, organization or institution, enterprise or establishment. A similar process is the data de-duplication in which data are cleaned from duplicate records due to misspelling, field swap or any other mistake or data inconsistency. This process requires that we identify objects that are included in more than one list.

Record linkage can be performed for different purposes making it a powerful instrument to support decisions in business and government institutions. Record linkage is necessary at least in two basic cases: data collation and list construction (e.g. maintenance of a statistical business register). Data Collation is applied when a project might require data that are not all available from the same source.  List Construction is applied when a project requires a list of all the members of a population to serve as a sampling frame, the contact list for a census, or for collation with data from other sources. Usually there is no single list of the population, but a combination of several lists can be expected to include all or most of it. Usually, there is some overlap between these lists. A de-duplication is necessary to avoid biasing the sample or census so that each member of the population is included only once.

Figure 1: Schematic representation of the record linkage process – [4]

Performing record linkage is unthinkable without appropriate software and the development of software packages that implements computational models for record linkage and statistical matching has increased over the last decade. It is possible to find several software packages that perform similar tasks. Such packages may improve the process and its results but, however a lot of time and effort is necessary to decide which one better meets the Organization needs (see for example [7]).

The purpose of this study is to review and evaluate one major area of data integration tools: Record linkage and statistical data matching. Only free and open source software package are considered.  An ESSNET project [17, WP3] considers a number of software tools, including many commercial packages. Here we will focus on only open source tools and will discuss them in detail. The first tools to be evaluated are:

1)      Febrl (Freely Extensible Biomedical Record Linkage)

2)      RELAIS (REcord Linkage At IStat)

3)      The R package Record Linkage

The study is a work in progress and further details and experiences will be added.

 

II.                Febrl (Freely Extensible Biomedical Record Linkage)

A.                 About FEBRL

The Febrl (Freely Extensible Biomedical Record Linkage) system (Christen et al. [1, 2, 3]), developed in Python by the Data Mining Group of the Australian National University, is able to perform de-duplication (same database), linkage between two databases and data standardization. It  is available under an open source software licence. It contains many recently developed advanced techniques for data cleaning and standardisation, indexing (blocking), field comparison, and record pair classification, and encapsulates them into a graphical user interface. Febrl can be seen as a training tool suitable for users to learn and experiment with both traditional and new record linkage techniques, as well as for practitioners to conduct linkages with data sets containing up to several hundred thousand records. The authors of the package claim that Febrl is the only free linkage system with a graphical user interface.

Febrl is implemented in Python, a free, object oriented programming language that is available on all major computing platforms and operating systems. Python is an ideal platform for rapid prototype development as it provides data structures such as sets, lists and dictionaries (associative arrays) that allow efficient handling of very large data sets, and includes many modules offering a large variety of functionalities. For example, it has excellent built-in string handling capabilities, and the large number of extension modules facilitate, for example, database access and graphical user interface (GUI) development. For the Febrl user interface, the PyGTK4 library and the Glade5 toolkit were use, which, combined, allow rapid platform independent GUI development. Febrl is published for free under an open source software licence. The software is available from the Sourceforge repository. Due to the availability of its source code, Febrl is suitable for the rapid development, implementation, and testing of new and improved record linkage algorithms and techniques, as well as for both new and experienced users to learn about and experiment with various record linkage techniques.

B.                 Availability

Febrl is freely available under an open source software licence and can be installed on all major computing platforms and operating systems. It is based only on software components and libraries that are freely available themselves, thus no costs are involved for the user. Febrl is implemented in Python, a free, object oriented programming language that is available on all major computing platforms and operating systems. Python is an ideal platform for rapid prototype development as it provides data structures such as sets, lists and dictionaries (associative arrays) that allow efficient handling of very large data sets, and includes many modules offering a large variety of functionalities. For example, it has excellent built-in string handling capabilities, and the large number of extension modules facilitate, for example, database access and graphical user interface (GUI) development.

The software is available from the Sourceforge repository - http://sourceforge.net/projects/febrl/.

C.                 Input data Formats

Febrl supports several most popular text-based file formats, including the most widely used comma separated values (CSV).  The user can specify additional parameters defining the file format more concretely (whether there is a header file with field names, which field contains the unique identifiers (if any), how are missing values identified, what is the field separator for a CSV file, etc.).

Database access using SQL is currently not available and will be included in the future.

D.                 Functionality

Febrl contains many recently developed advanced techniques for data cleaning and standardisation, indexing (blocking), field comparison, and record pair classification, and encapsulates them into a graphical user interface. In the following paragraphs these functions will be briefly described. For more details see [3].

Exploratory analysis: the package allows the use to explore the data set in order to get better insight into its characteristics and to understand better the data content and quality. Thus obtained knowledge will be used in the further phases of standardization, de-duplication and/or linkage. In case of larger data sets it is possible to select sampling rate as a percentage of the number of records in the data set. This will speed up the performance.

Data cleaning and standardization: Cleaning and standardization functionality is supported by the package but is not fully integrated into the phases of a deduplication or linkage project. That is, when data cleaning and standardization is performed, a new data set is created and stored. This cleaned and standardized data set can be further used as an input for the next step of the process. Different standardisers are available, currently for names, addresses, dates and telephone numbers. Most of these standardisers are rule-based, for details see [1] and the references therein. Each standardiser accepts one or more input fields from the input data set and represents each component into segments stored into a number of output fields (e.g. six for names, 27 for addresses).

Indexing/Blocking: In order to perform the de-duplication or record linkage process it is necessary to specify the indexing method and the corresponding indexing keys. Seven different indexing methods are provided:

1)      Full index

2)      Blocking index

3)      Sorting index

4)      QGram index

5)      Canopy index

6)      String map index

7)      Suffix array index

While the Full index is the trivial method (all records are taken into account) and the Blocking index is implemented in most of the commercial and open source software packages, the rest five methods are recently developed.

After selecting the index method the user has to choose the actual index key definitions and to set their various parameters. Index keys consist of one field from the input data set or a concatenation of several input fields. These several fields are often phonetically encoded so that similarly sounding values will be grouped together and will fall into the same block. The encoding is performed by one of the encoding functions implemented Febrl.  Some of the encoding functions are Soundex, Fuzzy soundex, Double Methaphone, etc. For a complete list see Christen (2010) [4] and Christen (2006) [3].

Comparison: Field values of record pairs are compared using one of the many comparison functions implemented in Febrl. Most of these functions are variations of the approximate string comparison functions - see Christen (2006) [3] and Cohen et al. (2003)  - while others are used for comparison of numerical values, dates or times. The comparison functions return a raw similarity value between 0 (for total dissimilarity) and 1 (for an exact match). It is possible to adjust these values by setting agreement and disagreement weights.

Weighting vector classification: The record linkage problem could be understood as a classification problem and a number of methods can be utilized for this purpose. Febrl implements six classification methods:

1)      Felegi-Sunter: allows manual setting of two  thresholds

2)      Optimal Threshold: supervised classification

3)      Kmeans: unsupervised clustering approach

4)      Farthest First: unsupervised clustering approach

5)      Support Vector Machine: uses a supervised support vector machine (SVM); based on the libsvm library

6)      Two Step: unsupervised approach

Experimental results have shown that unsupervised approach in many cases achieves results not worse than the supervised one (Christen [4]).

E.                 Natural language support

Febrl supports only English.

F.                  Output, Evaluation and Clerical review

The results of the de-duplication or record linkage are stored in text files and later can be evaluated and further submitted to a clerical review. All user inputs are used to generate a Python file which later can be run outside of the GUI. Intermediate results (e.g. weight vectors)  are stored too. The main visualization tool for evaluation of the procedure is a histogram of the summed matching weights of all compared record pairs. Further, measurements for accuracy, precision, recall and F-measure (or F-score) can be generated.

Additional evaluation measures, such as the ROC, will be implemented in future versions. An interactive feature that allows manipulation of a classification threshold on the shown histogram will also be added. The ‘Review’ page, which will also be added in future releases, which will allow a user to view and manually designate record pairs as matches or non-matches (originally classified as possibly matches).

G.                Graphical User Interface (GUI)

The authors claim the Febrl is the only free record linkage software. The reason to implement GUI for Ferl was to make it available to people who are not experienced in programming, non-technical record-linkage practitioners. The Febrl GUI follows the structure provided by the Rattle open source data mining tool (Williams 2007). The basic idea of this GUI is to have a window that contains one per major step of the process (see Figure 1). For the Febrl user interface, the PyGTK4 library and the Glade5 toolkit were use, which, combined, allow rapid platform independent GUI development.

H.                Extendibility

Due to the availability of its source code, Febrl is suitable for the rapid development, implementation, and testing of new and improved record linkage algorithms and techniques, as well as for both new and experienced users to learn about and experiment with various record linkage techniques.

I.                   Privacy preserving

The privacy preserving issue is not considered in the current implementation of Febrl, but it is envisaged by the authors to implement privacy preserving record linkage techniques in subsequent releases.

J.                  Specific functions

None

 

III.             RELAIS (REcord Linkage At IStat)

A.                 About RELAIS

RELAIS (Tuoto et al. 2007 [13]) has been designed and developed to decompose the record linkage process into its constituent phases and to address each of these phases with the technique most appropriate to the nature of the data. The constituent phases are:

  • Pre-processing of the input files
  • Choice of the identifying attributes (matching variables)
  • Choice of the comparison function
  • Creation of the search space of link candidate pairs
  • Choice of the decision model
  • Selection of unique links
  • Record linkage evaluation

Several record linkage systems and tools have been proposed, in both the academic and private sectors. Some provide the user a certain degree of flexibility, e.g. allowing a choice of comparison functions. However, none of them allow multiple choices for each record linkage phase, or dynamically build a record linkage workflow by combining the most appropriate technique at each phase.

The RELAIS toolkit comprises a collection of techniques for each record linkage phase, which can be dynamically combined to build the best record linkage workflow, given a set of application constraints and data features. The current version (2.0) includes the following functionalities:

  • Creation of the search space of candidate pairs by means of the cross product of the input files;
  • Search space reduction using a "blocking" method or a "sorted neighbourhood" method;
  • Choice of the matching variables;
  • Choice of a distance function for approximate string comparisons;
  • Deterministic decision models (Equality or Rule based);
  • Probabilistic decision model (Fellegi and Sunter);
  • Reduction from N:M to 1:1 matching solution.

A data profiling phase that helps the user to choose the best blocking or matching variables. RELAIS is an open source project. There are at least two reasons for this choice. First, there are many possible techniques that can be implemented for each of the record linkage phases: relying on a community of developers such a set can be increased and maintained very rapidly. Second, in recent years there have been several independent efforts to develop record linkage projects for specific purposes, which have not led to the best generic solution. An open source approach gives the possibility to gather together work already done and make it available to the community for the most appropriate usage.

RELAIS is mainly implemented in Java, due to the well-known features of strongly typing and platform independence; some phases are implemented in R, (see previous article). It has been implemented using relational database architecture, based on a MySQL environment, which is also in line with the open source philosophy of the RELAIS project.

B.                 Availability

RELAIS has been developed as an open source project, so several solutions already available for record linkage in the scientific community can be easily re-used. It is released under the EUPL license (European Union Public License). RELAIS is mainly implemented in Java, due to the well-known features of strongly typing and platform independence; some phases are implemented in R. The choice of Java and R is also in line with the open source philosophy of the RELAIS project. Further it has been implemented using relational database architecture; in particular it is based on a MySql environment that is also in line with the open source philosophy of the RELAIS project.

The software is available either from the project link at the ISTAT web site http://www.istat.it/it/strumenti/metodi-e-software/software/relais or from the JOINUP portal https://joinup.ec.europa.eu/software/relais/home

C.                 Input data Formats

RELAIS has been implemented using relational database architecture, based on a MySQL environment, which is also in line with the open source philosophy of the RELAIS project.

D.                 Functionality

The idea of decomposing the record linkage process in its phases is the core of the RELAIS toolkit and makes the whole process easier to manage; each phase has its own windows. In the following paragraphs these functions will be briefly described. For more details see [13, 14].

Exploratory analysis: no exploratory tools available

Data cleaning and standardization: RELAIS provides two categories of functions related to data cleaning and standardisation: (i) checking functions and (ii) converting functions. The checking functions validate the compliance of the data with the format of the fields (if available) and create a list of inaccurate data records. The converting functions generate new variables which are added to the dataset (while also maintaining the old variable). The new variable must have a name that is different from the original variable name. The following features for data manipulation are available in the sub-menu ‘Conversion function’:

1)      Field Standardization: allows performing a series of simple, but often very useful, cleanup tasks of the values of a variable

2)      Fields Merge: allows creating a new variable by concatenation of filler and/or the existing variables of the dataset

3)      Field Parse: allows splitting a variable creating two new fields in the dataset

4)      Inaccuracy Repair: allows creating a new variable for a dataset as a copy of an existing variable where the inaccurate values are repaired by corresponding “correct” values. The conversion of an inaccurate value with the corresponding correct value must be provided as an input by an external text file named “Conversion Input File”.

5)      Schema reconciliation: RELAIS performs automatically a schema reconciliation combining the variables of the two input datasets according to their names (read in the headers of the input files). The same reconciliation is performed for the new variables, created from pre-processing functions.

Indexing/Blocking: In order to perform the de-duplication or record linkage process it is necessary to specify the indexing method and the corresponding indexing keys. The search space of the candidate pairs is naturally formed by the cross product of the records stored in each input data set. The following indexing methods are provided:

1)      Cross product

2)      Blocking

3)      Sorted neighbourhood method

4)      Nested blocking method.

Comparison: Field values of record pairs are compared using one of the many comparison functions implemented in RELAIS. Several comparison functions are available in RELAIS, namely: Equality, Numeric Comparison, 3Grams, Dice, Jaro, JaroWinkler, Levenshtein, Soundex.

Weighting vector classification: RELAIS has the objective to provide different approaches and techniques to deal with the various record linkage problems. RELAIS implements both a method for probabilistic record linkage, according to the Fellegi and Sunter theory [9] and two methods for deterministic record linkage, based on comparisons of matching variable values. The principal steps to apply the different methods are treated in the User guide and methodological details on the probabilistic implementation are given in its appendix A.

E.                 Natural language support

RELAIS supports most of the western European languages. No information is available for language support of other languages (Russian, Bulgarian, Arabic, etc.).

F.                  Output, Evaluation and Clerical review

The results of the de-duplication or record linkage are stored in text files and later can be evaluated and further submitted to a clerical review: Match file, Possible match file, Residual set –A and Residual set-B.

G.                Graphical User Interface (GUI)

The RELAIS was developed with the aim to provide record linkage techniques easily accessible to not-expert users. Indeed, the developed system has a GUI (Graphical User Interface) that permits to build record linkage strategy with a good flexibility and checks the execution order among the different provided techniques whereas precedence rules must be controlled.

H.                Extendibility

While very flexible to define the record linkage process, it is not easy for a user with programming skills to extend the package.

I.                   Privacy preserving

The privacy preserving issue is not considered in the current implementation of RELAIS.

J.                  Specific functions

Instead of using RELAIS GUI, in some applications it is needed to run RELAIS in a batch mode, by specifying all the input parameters once before the record linkage process execution. The first step is the specification of the input parameters file, named “batchparam.txt”.

 

IV.              THE R PACKAGE RecordLinkage

A.                 About RecordLinkage

The package RecordLinkage provides means to perform and evaluate different record linkage methods. A stochastic framework is implemented which calculates weights through an EM algorithm. Machine learning methods are utilized, including decision trees (rpart), bootstrap aggregating (bagging), ada boost (ada), neural nets (nnet) and support vector machines (svm). The generation of record pairs and comparison patterns from single data items are provided as well. Comparison patterns can be chosen to be binary or based on some string metrics. In order to reduce computation time and memory usage, blocking can be used.

B.                 Availability

RecordLinkage is available from our project home page on R-Forge http://r-forge.r-project.org/projects/recordlinkage/ as well as from CRAN.

C.                 Input data Formats

Data must reside in a data frame where each row holds one record and columns represent attributes. The package includes two example data sets, RLdata500 and RLdata10000, which differ in the number of records. The data were created randomly from German name statistics and have no relation to existing persons

D.                 Functionality

The package includes two functions for the creation of comparison patterns from data sets: compare.dedup for de-duplication of a single data set and compare.linkage for linking two data sets together. In the case of three or more data sets, iteratively two of them are linked and replaced by the data set which is the result of the linkage. This leads to n-1 linkages for n data sets. Both compare functions return an object of class "RecLinkData" which includes, among other components, the resulting comparison patterns as component pairs. In the following paragraphs these functions will be briefly described. For more details see [16] and the user manual of the package.

Exploratory analysis: no specific functionality is provided in the package.

Data cleaning and standardization: First steps in data pre-processing usually include standardization of data, for example conversion of accented characters or enforcing a well-defined date format. However, such methods are not included in RecordLinkage. In the package, the user is responsible for providing the data in the form the package expects it.

Indexing/Blocking: In order to perform the de-duplication or record linkage process it is necessary to specify the indexing method and the corresponding indexing keys. A blocking specification can be supplied to the compare functions via the argument blockfld. The most simple specification is a vector of column indices denoting the attributes on which two records must agree (possibly after applying a phonetic code, see below) to appear in the output. Combining several such specifications in a list leads to the union of the sets obtained by the individual application of the specifications.

Comparison: Package RecordLinkage includes the popular Soundex algorithm for English and a German language algorithm introduced by Michael (1999), implemented through functions soundex and pho_h respectively. The argument phonetic of the compare functions controls the application of the phonetic function, which defaults to pho_h and can be set by argument phonfun. Typically, an integer vector is supplied which specifies the indices of the data columns for which a phonetic code is to be computed before comparison with other records. The algorithms by Winkler (1990) (function jarowinkler) and one based on the edit distance by Levenshtein (function levenshteinSim) are included in the package. String comparison and phonetic encoding cannot be used simultaneously on one attribute but can be applied to different attributes within one set of comparison patterns.

Weighting vector classification: The record linkage problem could be understood as a classification problem and a number of methods can be utilized for this purpose. With this view, a vast range of machine learning procedures, such as clustering methods, decision trees or support vector machines, becomes available for de-duplicating or linking personal data. Currently, the   supported methods are:

1)      "rpart" Recursive partitioning trees, provided by package rpart (Therneau et al., 2009).

2)      "bagging" Bagging of decision trees, provided by package ipred (Peters and Hothorn, 2009).

3)      "ada" Stochastic boosting, provided by package ada (Culp et al., 2006).

4)      "svm" Support vector machines, provided by package e1071 (Dimitriadou et al., 2009).

5)      "nnet" Single-hidden-layer neural n etworks, provided by package e1071 (ibid.).

E.                 Natural language support

English and German are supported

F.                  Output, Evaluation and Clerical review

The output is and R data object which can be subjected to further processing using R functions or printed by the user.

Improved print facilities for clerical review, such as highlighting of agreement or disagreement in record pairs will be added in future versions.

G.                Graphical User Interface (GUI)

Not available

H.                Extendibility

The complete source code is available in R and a user skilful in R programming can easily extend the package by adding new functions.

I.                   Privacy preserving

The privacy preserving issue is not considered in the current implementation

J.                  Specific functions

None

 

V.                 SUMMARY

The present study, after discussing the basics of record linkage and referring to the appropriate literature for further details investigates several open source software packages for record linkage. These packages are Febrl developed by the Australian National University, RELAIS developed by ISTAT and the R package RecordLinkage, available at CRAN

The study is considered a work in progress which will be further extended to cover further  experience with the considered software packages for record linkage as well as to monitor the appearance of new versions of these and other tools. Of particular interest is the applicability of the software to different natural languages (e.g. Arabic) and their usefulness for maintenance of a statistical  Business register.


VI.              ACKNOWLEDGEMENTS

We would like to thank to all people that contributed to this study, especially the members of the Sharing Advisory Board (SAB) for provided information, comments and suggestions.


VII.           REFERENCES

[1] Christen, P., Zhu, J.X., Hegland, M., Roberts, S., Nielsen, O.M., Churches, T. & Lim, K. (2002), High-performance computing techniques for record linkage, in ‘Australian Health Outcomes Conference’  AHOC’02), Canberra.

[2] Christen, P., Churches, T. & Hegland, M. (2004), Febrl – A parallel open source data linkage system, in ‘Pacific-Asia Conference on Knowledge Discovery and Data Mining’ (PAKDD’04), Sydney, Springer LNAI 3056, pp. 638–647.

[3] Christen, P., Willmore, A. & Churches, T. (2006), A probabilistic geocoding system utilising a parcel based address file, in ‘Selected Papers from AusDM’, Springer LNCS 3755, pp. 130–145.

[4] Christen, P. (2010), Febrl –A Freely Available Record Linkage System with a Graphical User Interface.

[5] Cohen W.W., Ravikumar P. & Fienberg S.E. (2003), A comparison of string distance metrics for name-matching tasks, in ‘IJCAI-03 Workshop on Information Integration on the Web’ (IIWeb-03), Acapulco, pp. 73–78.

[6] Williams, G.J. (2007), Data Mining with Rattle and R, Togaware, Canberra. Software available at: http://datamining.togaware.com/survivor/

[7] Diniz da Silva, A. (2010), Study of Record Linkage Software for the 2010 Brazilian Census Post Enumeration Survey

[8] Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm. JRSS B 39:1-38.

[9] Fellegi, I.P.  and Sunter, A.B. (1969), A Theory for Record Linkage. JASA. 64:1183-1210.

[10] Froeschl, K.A (1997), Metadata Management in Statistical Information Processing. Springer, Wien, Berlin.

[11] Froeschl, K.A. (1999). Metadata Management in Official Statistics - An IT-based Methodology Approach. Austrian Journal of Statistics. 28(2):49-79.

[12] Froeschl, K.A. and Grossmann, W. (2000), The Role of Metadata in Using Administrative Sources. Research in Official Statistics. 3(1):65-82.

[13] Tuoto T., Cibella N., Fortini M., Scannapieco M., Tosco T. (2007) RELAIS: Don't Get Lost in a Record Linkage Project, Proc. of the Federal Committee on Statistical Methodologies (FCSM 2007) Research Conference, Arlington, VA, USA.

[14] ISTAT (2010), RELAIS 2.2 User guide

[15] R Development Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

[16] Sariyar, M and Borg, A. (2012) RecordLinkage: Detecting Errors in Data. URL: http://CRAN.R-project.org/package=RecordLinkage, R package version 1.0

[17] ESSnet DI (2010), WP3: Software tools for integration methodologies

 

  • No labels