Skip to end of metadata
Go to start of metadata

Evaluated by J.-M. Fillion, Statistics Canada, 2002

SYSTEM INFORMATION

Full name:

LEO - Generalized error localization

Version:

1

Year:

2002

Developer:

Statistics Netherlands

DESCRIPTION

LEO is a prototype system developed by Statistics Netherlands to solve the error localization problem for mixtures of quantitative and qualitative variables. It was the basis for the development of CherryPie within the SLICE system (see the evaluation of CherryPie). It was developed in a Delphi environment with an interface for the Windows environment.

Editing rules identify the conditions to be satisfied in order for a record to be a good record. If one or more rules are violated then fields to be changed must be identified in order to satisfy the rules. The edit rules are defined as follows:
where

     vi :  the i-th categorical variable ( i = 1,..,m )

     xi :  the i-th numerical variable ( i = 1,..,n )

     V :  the set of values defining the conditions

It is possible to define rules that are only qualitative, only quantitative, or a mixture of both. LEO tries to minimize the number of variables to be imputed. It also allows the use of weights for each variable when the user wishes to exert some influence on the identification of the fields to be imputed.

The main idea behind the algorithm is to build a binary tree (branch-and-bound method) where at each step (or nod), a variable is selected for analysis and then it is split in two categories: (a) to impute or (b) not to impute the variable. In the case where a variable is not imputed, it is fixed to its initial value in the set of edit rules to create a new set of rules to be analyzed. In the case where a variable must be imputed, it is removed from the rules by using a Fourier-Motzkin technique.

If in a given step, we get inconsistent rules, then we go back to a previous step to resume the analysis there.

STRENGTHS

  • The algorithm is not complex and then it is easy to program. Furthermore, the approach allows splitting the problem in several components (or modules).
  • It is possible to process categorical and numerical variables simultaneously. The edit rules can include both types together.
  • It can process a high number of variables (more than 100 variables) in one run.
  • The system allows the processing of negative values.
  • Several parameters can be defined by the user: the weights associated to the variables, a flag or value that indicates the data is missing, etc.
  • All the solutions with minimum change are identified and only one is randomly selected for the imputation.
  • The algorithm can be modified easily to add new components like the processing of integer values (already included in a new version of LEO).
  • The algorithm and the performance of LEO are well documented.

WEAKNESSES

(note that some weaknesses were resolved in CherryPie)

  • The maximum number of fields to be imputed must not be too high in order to keep the binary tree at a reasonable size. It is recommended not to imputed more than five fields. This low number represents a limitation especially that it also includes the missing values.
  • There is no approximated solution in the case where no optimal solution is found.
  • The binary tree must be entirely visited to make sure the solutions are optimal, even when the optimal solutions are found early in the tree.
  • The standard verification rules are not that flexible: The IF-conditions can only include categorical variables while the THEN-conditions can only include numerical variables.
  • The number of implicit rules that are kept at each step may become very large. This can bring LEO in an unstable state.
  • It is not possible to specify a time limit for the processing of each record.
  • In some cases, precision problems may occur, especially where rules with high coefficient are combined.

FUNCTIONAL EVALUATION

 LEGEND

***

The implementation offers sub-functions or options being required by a wide range of survey applications.  

**

The implementation have a less complete set of options.

*

The implementation offers a partial functionality. Options are too restrictive or not generalized enough.

-

No stars are assigned when the functionality is not offered at all.


TYPE OF DATA

 

 

Quantitative data

***

 

Qualitative data

**

 

EDITING FUNCTIONS

 

 

Data verification

*

 

On-line correction

-

 

Error localization

***

 

Minimum changes

***

 

User-defined changes

-

 

Outlier detection

-

 

IMPUTATION FUNCTIONS

 

 

Deterministic imputation

-

 

Donor imputation

-

 

Imputation by estimators

-

 

Multiple imputation

-

 

GENERAL FEATURES

 

 

Graphical user interface

***

 

User-friendliness

**

 

On-line help

*

 

On-line tutorial

-

 

Documentation

***

 

Diagnostic reports

-

 

Integration

**

 

Reusable code

***

 

Portability

***

 

Flexibility

**

 

User support

-

 

Acquisition cost

-

 

REFERENCES

Quere, R. and De Waal, T. (2000). "Error Localization in Mixed Data Sets". Statistics Netherlands Technical Report.

De Waal, T. (2000). "An Optimality Proof of Statistics Netherlands' New Algorithm for Automatic Editing of Mixed Data".  Statistics Netherlands Technical Report.

  • No labels