Skip to end of metadata
Go to start of metadata

Evaluated by T. de Waal, Statistics Netherlands, 2003

SYSTEM INFORMATION

Full name:

CherryPie - Generalized error localization

Version:

1

Year:

2003

Developer:

Statistics Netherlands

DESCRIPTION

CherryPie was developed by Statistics Netherlands to solve the error localization problem for mixtures of quantitative and qualitative variables. It was developed in a Visual C++ environment based on the LEO prototype, but with no interface (see the evaluation of LEO).  CherryPie is part of SLICE, which is a collection of editing and imputation modules.

Editing rules identify the conditions to be satisfied in order for a record to be a good record. If one or more rules are violated then fields to be changed must be identified in order to satisfy the rules. The edit rules are defined as general IF-THEN statements, as opposed to LEO which expects the IF-statements to include categorical variables and the THEN-statements to include numerical variables.

The main idea behind the algorithm is to build a binary tree (branch-and-bound method) where at each step (or nod), a variable is selected for analysis and then it is split in two categories: (a) to impute or (b) not to impute the variable. In the case where a variable is not imputed, it is fixed to its initial value in the set of edit rules to create a new set of rules to be analyzed. In the case where a variable must be imputed, it is removed from the rules by using a Fourier-Motzkin technique. If in a given step, we get inconsistent rules, then we go back to a previous step to resume the analysis there.

CherryPie generates a list of all optimal solutions in order to minimize the number of variables to be imputed. If the users want to, they can develop a module to select one of these solutions themselves. Such a component may be based on subject-matter knowledge. The alternative consists in selecting one of the solutions at random. CherryPie allows the use of weights for each variable when the user wishes to exert some influence on the identification of the fields to be imputed.

In CherryPie, the maximum number of fields to be imputed may be around 10, in comparison to 5 in LEO. The system processes the errors and the missing values independently. Although the limit on the number of errors is relatively low (about 10 as mentioned above), the limit on the number of missing values can be as high as 50, or even more.

SLICE provides an independent module for finding an approximated solution in case the CherryPie branch-and-bound algorithm fails. CherryPie usually terminates quickly after it has found an optimal solution. Many branches of the trees can then be cut off. Therefore, visiting the entire tree is not always necessary.  CherryPie offers a parser that reads general IF-THEN rules and parses them to rules that CherryPie can read. The general IF-THEN rules accept numerical or categorical variables in both the IF-conditions and the THEN-conditions. Finally, the user may specify a time limit for the processing of each record.

STRENGTHS

  • The algorithm is not complex and then it is easy to program. Furthermore, the approach allows splitting the problem in several components (or modules).
  • It is possible to process categorical and numerical variables simultaneously. The edit rules can include both types together.
  • It can process a high number of variables (more than 100 variables) in one run.
  • The system allows the processing of negative values.
  • Several parameters can be defined by the user: the weights associated to the variables, a flag or value that indicates the data is missing, etc.
  • All the solutions with minimum change are identified and only one is randomly selected for the imputation. In such a case, the user can also develop a module to select a specific solution.
  • The algorithm can be modified easily to add new components like the processing of integer values.
  • It processes errors and missing values independently.
  • The integrated parser makes edit rules more flexible than LEO.

WEAKNESSES

  • The maximum number of fields to be imputed must not be too high in order to keep the binary tree at a reasonable size. It is recommended not to imputed more than ten fields. Although this would satisfy most applications, it may represent a limitation for some large ones.
  • There is no approximated solution in the case where no optimal solution is found.
  • The binary tree must be entirely visited to make sure the solutions are optimal, even when the optimal solutions are found early in the tree.
  • The number of implicit rules that are kept at each step may become very large. This can bring CherryPie in an unstable state.

FUNCTIONAL EVALUATION

 LEGEND

***

The implementation offers sub-functions or options being required by a wide range of survey applications.  

**

The implementation has a less complete set of options.

*

The implementation offers a partial functionality. Options are too restrictive or not generalized enough.

-

No stars are assigned when the functionality is not offered at all.


TYPE OF DATA

 

 

Quantitative data

***

 

Qualitative data

***

 

EDITING FUNCTIONS

 

 

Data verification

*

 

On-line correction

-

 

Error localization

***

 

Minimum changes

***

 

User-defined changes

-

 

Outlier detection

-

 

IMPUTATION FUNCTIONS

 

 

Deterministic imputation

-

 

Donor imputation

-

 

Imputation by estimators

-

 

Multiple imputation

-

 

 GENERAL FEATURES

 

 

Graphical user interface

-

 

User-friendliness

**

 

On-line help

-

 

On-line tutorial

-

 

Documentation

***

 

Diagnostic reports

-

 

Integration

**

 

Reusable code

***

 

Portability

***

 

Flexibility

**

 

User support

-

 

Acquisition cost

-

 

REFERENCES

- Quere, R. and De Waal, T. (2000). "Error Localization in Mixed Data Sets". Statistics Netherlands Technical Report.

- De Waal, T. (2000). "An Optimality Proof of Statistics Netherlands' New Algorithm for Automatic Editing of Mixed Data".  Statistics Netherlands Technical Report.

  • No labels