Most researchers agree that the quality of real-life data archives is often very poor, and this makes the definition and realisation of automatic techniques for cleansing data a relevant issue. In such a scenario, the Universal Cleansing framework has recently been proposed to automatically identify the most accurate cleansing alternatives among those synthesised through model-checking techniques. However, the identification of some values of the cleansed instances still relies on the rules defined by domain-experts and common practice, due to the difficulty to automatically derive them (e.g. the date value of an event to be added). In this paper we extend this framework by including well-known machine learning algorithms - trained on the data recognised as consistent - to identify the information that the model based cleanser couldn’t produce. The proposed framework has been implemented and successfully evaluated on a real dataset describing the working careers of a population.
Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M. (2015). Accurate data cleansing through model checking and machine learning techniques. In M. Helfert, A. Holzinger, O. Belo, C. Francalanci (a cura di), Data Management Technologies and Applications. Third International Conference, DATA 2014, Vienna, Austria, August 29-31, 2014, Revised Selected papers (pp. 62-80). Springer [10.1007/978-3-319-25936-9_5].
Accurate data cleansing through model checking and machine learning techniques
BOSELLI, ROBERTOPrimo
;CESARINI, MIRKO
Secondo
;MERCORIO, FABIOPenultimo
;MEZZANZANICA, MARIOUltimo
2015
Abstract
Most researchers agree that the quality of real-life data archives is often very poor, and this makes the definition and realisation of automatic techniques for cleansing data a relevant issue. In such a scenario, the Universal Cleansing framework has recently been proposed to automatically identify the most accurate cleansing alternatives among those synthesised through model-checking techniques. However, the identification of some values of the cleansed instances still relies on the rules defined by domain-experts and common practice, due to the difficulty to automatically derive them (e.g. the date value of an event to be added). In this paper we extend this framework by including well-known machine learning algorithms - trained on the data recognised as consistent - to identify the information that the model based cleanser couldn’t produce. The proposed framework has been implemented and successfully evaluated on a real dataset describing the working careers of a population.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.