Most record linkage techniques assume that information of the underlying entities do not change and is provided in different representations and sometimes with errors. For example, mailing lists may contain multiple entries representing the same physical address, but each record may be slightly different, e.g., containing different spellings or missing some information. As a second example, consider a company that has different customer databases (e.g., one for each subsidiary). A given customer may appear in different ways in each database, and there is a fair amount of guesswork in determining which customers match. However, in real-world, we often observe value diversity in real-world data sets for linkage. For example, many data sets contains temporal records over a long period of time; each record is associated with a time stamp and describes some aspects of a real-world entity at that particular time (e.g., author information in DBLP). In such cases, we often wish to identify records that describe the same entity over time and so be able to enable interesting longitudinal data analysis. Value diversity also exists group linkage: linking records that refer to entities in the same group. Applications for group linkage includes finding businesses in the same chain, finding conference attendants from the same affiliation, finding players from the same team, etc. In such cases, although different members in the same group can share some similar global values, they represent different entities so can also have distinct local values, requiring a high tolerance for value diversity. However, most existing record linkage techniques assume that records describing the same real-world entities are fairly consistent and often focus on different representations of the same value, such as ”IBM” and ”International Business Machines”. Thus, they can fall short when values may vary for the same entity. This dissertation studies how to improve linkage quality of integrated data with tolerance to fairly high diversity, including temporal linkage, and group linkage. We solve the problem of temporal record linkage in two ways. First, we apply time decay to capture the effect of elapsed time on entity value evolution. Second, instead of comparing each pair of records locally, we propose clustering methods that consider time order of the records and make global decisions. Experimental results show that our algorithms significantly outperform traditional linkage methods on various temporal data sets. For group linkage, we present a two-stage algorithm: the first stage identifies cores containing records that are very likely to belong to the same group; the second stage collects strong evidence from the cores and leverages it for merging more records in the same group, while being tolerant to differences in other values. Our algorithm is designed to ensure efficiency and scalability. An experiment shows that it finished in 2.4 hours on a real-world data set containing 6.8 million records, and obtained both a precision and a recall of above .95. Finally, we build the CHRONOS system which offers users the useful tool for finding real-world entities over time and understanding history of entities in the bibliography domain. The core of CHRONOS is a temporal record-linkage algorithm, which is tolerant to value evolution over time. Our algorithm can obtain an F-measure of over 0.9 in linking author records and fix errors made by DBLP. We show how CHRONOS allows users to explore the history of authors, and how it helps users understand our linkage results by comparing our results with those of existing systems, highlighting differences in the results, explaining our decisions to users, and answering “what-if” questions.
(2013). Linking records with value diversity. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2013).
Linking records with value diversity
LI, PEI
2013
Abstract
Most record linkage techniques assume that information of the underlying entities do not change and is provided in different representations and sometimes with errors. For example, mailing lists may contain multiple entries representing the same physical address, but each record may be slightly different, e.g., containing different spellings or missing some information. As a second example, consider a company that has different customer databases (e.g., one for each subsidiary). A given customer may appear in different ways in each database, and there is a fair amount of guesswork in determining which customers match. However, in real-world, we often observe value diversity in real-world data sets for linkage. For example, many data sets contains temporal records over a long period of time; each record is associated with a time stamp and describes some aspects of a real-world entity at that particular time (e.g., author information in DBLP). In such cases, we often wish to identify records that describe the same entity over time and so be able to enable interesting longitudinal data analysis. Value diversity also exists group linkage: linking records that refer to entities in the same group. Applications for group linkage includes finding businesses in the same chain, finding conference attendants from the same affiliation, finding players from the same team, etc. In such cases, although different members in the same group can share some similar global values, they represent different entities so can also have distinct local values, requiring a high tolerance for value diversity. However, most existing record linkage techniques assume that records describing the same real-world entities are fairly consistent and often focus on different representations of the same value, such as ”IBM” and ”International Business Machines”. Thus, they can fall short when values may vary for the same entity. This dissertation studies how to improve linkage quality of integrated data with tolerance to fairly high diversity, including temporal linkage, and group linkage. We solve the problem of temporal record linkage in two ways. First, we apply time decay to capture the effect of elapsed time on entity value evolution. Second, instead of comparing each pair of records locally, we propose clustering methods that consider time order of the records and make global decisions. Experimental results show that our algorithms significantly outperform traditional linkage methods on various temporal data sets. For group linkage, we present a two-stage algorithm: the first stage identifies cores containing records that are very likely to belong to the same group; the second stage collects strong evidence from the cores and leverages it for merging more records in the same group, while being tolerant to differences in other values. Our algorithm is designed to ensure efficiency and scalability. An experiment shows that it finished in 2.4 hours on a real-world data set containing 6.8 million records, and obtained both a precision and a recall of above .95. Finally, we build the CHRONOS system which offers users the useful tool for finding real-world entities over time and understanding history of entities in the bibliography domain. The core of CHRONOS is a temporal record-linkage algorithm, which is tolerant to value evolution over time. Our algorithm can obtain an F-measure of over 0.9 in linking author records and fix errors made by DBLP. We show how CHRONOS allows users to explore the history of authors, and how it helps users understand our linkage results by comparing our results with those of existing systems, highlighting differences in the results, explaining our decisions to users, and answering “what-if” questions.File | Dimensione | Formato | |
---|---|---|---|
Phd_unimib_725129.pdf
accesso aperto
Tipologia di allegato:
Doctoral thesis
Dimensione
1.23 MB
Formato
Adobe PDF
|
1.23 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.