Evaluated by T. de Waal, Statistics Netherlands, 2003
CherryPie was developed by Statistics Netherlands to solve the error localization problem for mixtures of quantitative and qualitative variables. It was developed in a Visual C++ environment based on the LEO prototype, but with no interface (see the evaluation of LEO). CherryPie is part of SLICE, which is a collection of editing and imputation modules.
Editing rules identify the conditions to be satisfied in order for a record to be a good record. If one or more rules are violated then fields to be changed must be identified in order to satisfy the rules. The edit rules are defined as general IF-THEN statements, as opposed to LEO which expects the IF-statements to include categorical variables and the THEN-statements to include numerical variables.
The main idea behind the algorithm is to build a binary tree (branch-and-bound method) where at each step (or nod), a variable is selected for analysis and then it is split in two categories: (a) to impute or (b) not to impute the variable. In the case where a variable is not imputed, it is fixed to its initial value in the set of edit rules to create a new set of rules to be analyzed. In the case where a variable must be imputed, it is removed from the rules by using a Fourier-Motzkin technique. If in a given step, we get inconsistent rules, then we go back to a previous step to resume the analysis there.
CherryPie generates a list of all optimal solutions in order to minimize the number of variables to be imputed. If the users want to, they can develop a module to select one of these solutions themselves. Such a component may be based on subject-matter knowledge. The alternative consists in selecting one of the solutions at random. CherryPie allows the use of weights for each variable when the user wishes to exert some influence on the identification of the fields to be imputed.
In CherryPie, the maximum number of fields to be imputed may be around 10, in comparison to 5 in LEO. The system processes the errors and the missing values independently. Although the limit on the number of errors is relatively low (about 10 as mentioned above), the limit on the number of missing values can be as high as 50, or even more.
SLICE provides an independent module for finding an approximated solution in case the CherryPie branch-and-bound algorithm fails. CherryPie usually terminates quickly after it has found an optimal solution. Many branches of the trees can then be cut off. Therefore, visiting the entire tree is not always necessary. CherryPie offers a parser that reads general IF-THEN rules and parses them to rules that CherryPie can read. The general IF-THEN rules accept numerical or categorical variables in both the IF-conditions and the THEN-conditions. Finally, the user may specify a time limit for the processing of each record.
- Quere, R. and De Waal, T. (2000). "Error Localization in Mixed Data Sets". Statistics Netherlands Technical Report.
- De Waal, T. (2000). "An Optimality Proof of Statistics Netherlands' New Algorithm for Automatic Editing of Mixed Data". Statistics Netherlands Technical Report.