A description is mandatory to explain the key functions and the context in which the system performs. A length varying from 500 to 800 words seems reasonable to describe a system. A description that goes over that limit is too complex for the goal of the knowledge base. The description should mention the person(s) or the institute(s) that developed the system, as well as any supporting material such as papers, web pages, etc.
RATING A PRODUCT
Each criterion is to be evaluated using the following scale:
Since the description has no specific structure nor content, there is a need for specific criteria on which every system should be evaluated. The systems would be scored according to each individual criterion using the rating scale above.
A) TYPE OF DATA:
Since there are very few systems that process mixtures of data, it is important to mention what type of data the system can process. The evaluation should inform the users how well the system process both quantitative and qualitative data.
Quantitative data: We say that a system processes quantitative data when it uses algebraic operators or functions to process many continuous numeric variables in the editing or imputation steps.
Qualitative data: A system processes qualitative data when its editing or imputation functions deal with many variables describing attributes, properties or unordered categories. Although there exist qualitative variables with ordered values (e.g., classes defining small/large, poor/rich, young/old, etc.) represented by numerical characters sets, they usually should not be processed with algebraic functions.
B) EDITING FUNCTIONS:
The process of detecting and handling errors in data includes the definition of a consistent set of requirements, their verification on given data, and the elimination or substitution of data which is in contradiction with the defined requirements. The following functions are the most frequently used and should serve as evaluation criteria.
Data verification: The system can verify the validity of the data. This can be done at different levels (data item, questionnaire section, whole record) with various techniques (answer code, list of valid values, data format, range edit, relationship between variables, comparison with other records, historical comparisons, etc.). The process may suggest various actions like follow-ups, manual corrections, etc.
On-line correction: The system allows a direct access to a database in order to modify values with the help of follow-up results, historical data, editor judgement or any other source of information which needs interpretation. A good system would allow concurrent users and would keep track of the changes everybody makes.
Error localization: For the records in error, the system automatically identifies the fields which need to be modified in order to create good records. For a record in error, the error localization can be performed independently of the imputation process, or as part of an imputation method. A good error localization function should work with almost no manual intervention.
Minimum changes: As part of the error localization process, the system identifies the minimum number of fields to be imputed, or it minimizes the overall magnitude of the changes, or else, it minimizes other metrics. A good implementation should be able to solve the optimization problem even with high numbers of edit rules and variables.
User-defined changes: The system allows the user to specify which fields have to be modified for specific combinations of edit failures. This approach is the opposite of the automated error localization explained above. Ideally, the system will be flexible enough to allow a wide variety of user-defined rules.
Outlier detection: The system identifies data values that lie in the tail of the statistical distribution of the variable. Outliers may have to be imputed, or at least must not be used in the process of imputing other units because it is too different from the other values. A good function would let the user define the outlying criteria depending on the variable characteristics.
C) IMPUTATION FUNCTIONS:
The imputation process consists of replacing missing or unusable values with other values. Post-imputation rules may be required to make sure the process provides valid values. Following are some of the most popular imputation techniques the system should be evaluated against.
Deterministic imputation: The system can identify the situations where there is only one possible value that can be used to impute a given field. It is desired that such a function works not only with independent edit rules, but also with cross-edit rules.
Donor imputation: Here, a unit (donor) is identified from the set of valid records, and its data are used for imputation purposes. The donor selection can be anywhere between the random process and the nearest neighbour approach. A good implementation would allow the use of constraints or specific matches to reduce the set of potential donors for each record to be imputed. It would also verify that a minimum number of donors are available to impute the units in error.
Imputation by estimators: Various estimator functions, also called models, can be used to impute data: Mean, ratio, trend, regression, etc. They can be applied to a mixture of current data, administrative data and historical data. Similarly to the donor approach, the estimators should allow the definition of models within sub-populations, with constraints on the number of records required to estimate the model.
Multiple imputation: The idea of this approach is to generate several imputation runs based on a unique method, in order to produce variability and quality indicators. The implementation has to offer general models, with a user-defined number of replications of the process.
D) GENERAL FEATURES:
Other criteria have to be considered in the evaluation of systems. They are mostly related to the features and tools that make the system accessible to inexperienced users.
Graphical user interface: This refers to a menu structure which helps in the setting up of functions and the submission of computer jobs. The interface should be preferably mouse- driven and intuitive from the user's point of view. Functions that are similar should be implemented similarly in the interface.
User-friendliness: How easy and how quickly can the user learn the system? Is it interesting to use? Examples of good practices are the transfer of settings among versions of the system, among job submissions and among repetitive processes, the choice of non-aggressive colours that help differentiate concepts, functions or outputs, the parameter inputs specified in an intuitive order, the outputs being easy to read, etc.
On-line help: An on-line help provides information on the spot. Ideally, it should provide a description and some instructions on the active module by default. It should also offer a navigation option to obtain information on any module.
On-line tutorial: A tutorial is an important component of the training material. It should offer an opportunity to play with live data as opposed simply providing a static show.
Documentation: A well documented system should offer a detailed methodological description of its functions as well as a system's documentation which details the data flow and its processing, and a complete user's documentation.
Diagnostic reports: The diagnostics include information and statistics that help users to monitor how well the process is performing. Examples are: the number of records or data items which failed edits, the number of records in a donor pool, etc. Good diagnostic reports have to be easy to read, with enough statistics to understand what happened but not too much that the useful information is drawn into a mass of numbers.
Integration: This feature refers to the flow of data across various survey steps, including the editing and imputation processes. A well integrated suite of systems would allow the processing of data without requiring any reformat or pre/post processor. For instance, the input of an edit and imputation system usually comes from a collection/capture system, and its output goes to an estimation system. This creates a need for an integrated suite. A record linkage process may also be required to get auxiliary information for the data editing and imputation. A whole data stream developed on a unique platform would be desired, but the level of complexity of each step often justifies the choice of different platforms.
Reusable code: Reusability refers to the possibility of using the system to process data from various censuses, surveys or administrative sources. We also refer to this as the generalization of the function, which allows the development of specific applications simply by changing input parameters.
Portability: A portable system is relatively easy to install because it requires simple foundation softwares, if any are required. For instance, a set of executable files which need a simple copy to a Windows environment is portable. If the product can be installed on other platforms as well (UNIX, MVS, etc.), it is a bonus.
Flexibility: A system built from individual and self-contained modules is flexible. These modules should be embeddable in another survey stream whenever needed. This allows the user to replace some modules by his own customized function, or to add additional modules easily. An open source code may be considered here to facilitate these additions.
User support: Users' support via telephone and/or e-mail is desired if it cannot be in person.
Acquisition cost: If applicable, the acquisition cost should be mentioned in an evaluation.