Bicocca Open Archive

Research on data quality is growing in importance in both industrial and academic communities, as it aims at deriving knowledge (and then value) from data. Information Systems generate a lot of data useful for studying the dynamics of subjects' behaviours or phenomena over time, making the quality of data a crucial aspect for guaranteeing the believability of the overall knowledge discovery process. In such a scenario, data cleansing techniques, i.e., automatic methods to cleanse a dirty dataset, are paramount. However, when multiple cleansing alternatives are available a policy is required for choosing between them. The policy design task still relies on the experience of domain-experts, and this makes the automatic identification of accurate policies a significant issue. This paper extends the Universal Cleaning Process enabling the automatic generation of an accurate cleansing policy derived from the dataset to be analysed. The proposed approach has been implemented and tested on an on-line benchmark dataset, a real-world instance of the Labour Market Domain. Our preliminary results show that our approach would represent a contribution towards the generation of data-driven policy, reducing significantly the domain-experts intervention for policy specification. Finally, the generated results have been made publicly available for downloading.

Mezzanzanica, M., Boselli, R., Cesarini, M., Mercorio, F. (2014). Improving Data Cleansing Accuracy: A model-based Approach. In Proceedings of the 3rd International Conference on Data Technologies and Applications (DATA) (pp.189-201). Insticc [10.5220/0005004901890201].

Improving Data Cleansing Accuracy: A model-based Approach

MEZZANZANICA, MARIO;BOSELLI, ROBERTO;CESARINI, MIRKO;MERCORIO, FABIO

2014

Abstract

Research on data quality is growing in importance in both industrial and academic communities, as it aims at deriving knowledge (and then value) from data. Information Systems generate a lot of data useful for studying the dynamics of subjects' behaviours or phenomena over time, making the quality of data a crucial aspect for guaranteeing the believability of the overall knowledge discovery process. In such a scenario, data cleansing techniques, i.e., automatic methods to cleanse a dirty dataset, are paramount. However, when multiple cleansing alternatives are available a policy is required for choosing between them. The policy design task still relies on the experience of domain-experts, and this makes the automatic identification of accurate policies a significant issue. This paper extends the Universal Cleaning Process enabling the automatic generation of an accurate cleansing policy derived from the dataset to be analysed. The proposed approach has been implemented and tested on an on-line benchmark dataset, a real-world instance of the Labour Market Domain. Our preliminary results show that our approach would represent a contribution towards the generation of data-driven policy, reducing significantly the domain-experts intervention for policy specification. Finally, the generated results have been made publicly available for downloading.

Scheda breve

Scheda completa

Scheda completa (DC)

	Tipo di intervento
	
				paper
			
	Parole chiave
	
				Data and Information Quality; Data Cleansing; Data Accuracy; Weakly-Structured Data
			
	Lingua del contenuto
	
				English
			
	Nome del convegno
	
				the 3rd  International Conference on Data Technologies and Applications
			
	Anno del convegno
	
				2014
			
	Autori della monografia
	
				Mezzanzanica, M; Boselli, R; Cesarini, M; Mercorio, F
			
	Titolo degli atti
	
				Proceedings of the 3rd International Conference on Data Technologies and Applications (DATA)
			
	ISBN del volume degli atti
	
				978-989-758-035-2
			
	Data di pubblicazione
	
				2014
			
	Pagina iniziale
	
				189
			
	Pagina finale
	
				201
			
	DOI dell'intervento
	
				https://dx.doi.org/10.5220/0005004901890201
			
	URL alternativo
	
				http://www.dataconference.org/PreviousAwards.aspx
			
	Fulltext
	
				open
			
	Citazione
	
				Mezzanzanica, M., Boselli, R., Cesarini, M., Mercorio, F. (2014). Improving Data Cleansing Accuracy: A model-based Approach. In Proceedings of the 3rd International Conference on Data Technologies and Applications (DATA) (pp.189-201). Insticc [10.5220/0005004901890201].
			
	Appare nelle tipologie:
	
				02 - Intervento a convegno

File in questo prodotto:

File	Dimensione	Formato
DATA2014.pdf accesso aperto Dimensione 3.02 MB Formato Adobe PDF Visualizza/Apri	3.02 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/52825

Citazioni

4

ND

Social impact