Most researchers agree that the quality of real-life data archives is often very poor, and this makes the definition and realisation of automatic techniques for cleansing data a relevant issue. In such a scenario, the Universal Cleansing framework has recently been proposed to automatically identify the most accurate cleansing alternatives among those synthesised through model-checking techniques. However, the identification of some values of the cleansed instances still relies on the rules defined by domain-experts and common practice, due to the difficulty to automatically derive them (e.g. the date value of an event to be added). In this paper we extend this framework by including well-known machine learning algorithms - trained on the data recognised as consistent - to identify the information that the model based cleanser couldn’t produce. The proposed framework has been implemented and successfully evaluated on a real dataset describing the working careers of a population.

Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M. (2015). Accurate data cleansing through model checking and machine learning techniques. In M. Helfert, A. Holzinger, O. Belo, C. Francalanci (a cura di), Data Management Technologies and Applications. Third International Conference, DATA 2014, Vienna, Austria, August 29-31, 2014, Revised Selected papers (pp. 62-80). Springer [10.1007/978-3-319-25936-9_5].

Accurate data cleansing through model checking and machine learning techniques

BOSELLI, ROBERTO
Primo
;
CESARINI, MIRKO
Secondo
;
MERCORIO, FABIO
Penultimo
;
MEZZANZANICA, MARIO
Ultimo
2015

Abstract

Most researchers agree that the quality of real-life data archives is often very poor, and this makes the definition and realisation of automatic techniques for cleansing data a relevant issue. In such a scenario, the Universal Cleansing framework has recently been proposed to automatically identify the most accurate cleansing alternatives among those synthesised through model-checking techniques. However, the identification of some values of the cleansed instances still relies on the rules defined by domain-experts and common practice, due to the difficulty to automatically derive them (e.g. the date value of an event to be added). In this paper we extend this framework by including well-known machine learning algorithms - trained on the data recognised as consistent - to identify the information that the model based cleanser couldn’t produce. The proposed framework has been implemented and successfully evaluated on a real dataset describing the working careers of a population.
Capitolo o saggio
Data cleansing,Model based data quality,Machine learning,Labour market data
English
Data Management Technologies and Applications. Third International Conference, DATA 2014, Vienna, Austria, August 29-31, 2014, Revised Selected papers
Helfert, M; Holzinger, A; Belo, O; Francalanci, C
2015
978-3-319-25935-2
178
Springer
62
80
Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M. (2015). Accurate data cleansing through model checking and machine learning techniques. In M. Helfert, A. Holzinger, O. Belo, C. Francalanci (a cura di), Data Management Technologies and Applications. Third International Conference, DATA 2014, Vienna, Austria, August 29-31, 2014, Revised Selected papers (pp. 62-80). Springer [10.1007/978-3-319-25936-9_5].
none
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/105629
Citazioni
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 2
Social impact