This thesis sits in the survey-methodology field and explores the challenging concept of data quality when using online panels as a sample source in the survey industry. Over time, sample surveys implemented different techniques to overcome various challenges. In particular, high costs, need for timely data delivery, and undercoverage (in the case of telephone surveys) are the main drivers of the development of online panels in survey research, from the 1990s on. Online panels are large pools of registered people who agreed to take part in web-based research in exchange for some form of incentive. They have the advantages of i) reducing both costs and time devoted to data collection and delivery, and ii) providing a sampling frame of registered individuals who consented to participate in surveys. The main objective of this thesis is to empirically assess the quality of primary data collected from an Italian non-probability online-panel survey, comparing the estimates obtained from this web survey to those obtained from a probability-based survey conducted by the Italian National Institute of Statistics (ISTAT), used as benchmarks. In particular, I focused on i) undercoverage and self-selection bias in different samples (i.e., the Internet population, panel members, and web survey respondents), ii) nonresponse bias, comparing the characteristics of web survey respondents to those of nonrespondents, iii) data quality, comparing the estimates obtained from the web survey to those obtained from a probability-based reference survey, and iv) a weighting adjustment to remove bias. I used three data sources. The first is data of all the members of the Italian non-probability online panel, recorded in the panel archive. Using this valuable source (because companies usually do not provide panellists’ data to their clients), I was able to study i) the panel members’ representativeness in comparison to both the Internet and the general population, ii) the web survey respondents’ representativeness in comparison to the panel members, and iii) the differences between the characteristics of web survey respondents and those of nonrespondents. The second dataset comes from my primary data collection, conducted on a sample of Italian adults selected from the panel. The last source is the dataset from a probability-based reference survey. The impact of my study is relevant to both the Italian and the international survey research fields. In particular, this is the first study in Italy that uses panel survey data, and compares the panel survey estimates to those from a probability-based reference survey, to assess the quality of data collected from a non-probability online panel. The relevance of my study for international research on non-probability online panels consists in broadening the current findings on three under-researched topics. Firstly, to the best of my knowledge, there are only three publications that have assessed the differences in the demographic and socio-economic characteristics of panel members, the Internet population and/or the general population. Secondly, there is only one European study on the assessment of the differences between study respondents and nonrespondents. Lastly, there are only a few studies that use a gold standard to assess data quality in non-probability online panels. The results from my study showed that i) the online panel survey estimates substantially deviate from the benchmarks, but the magnitude of bias varies widely according to the estimate (i.e., estimates from behavioural variables produce higher distortion than estimates from socio-demographics), and ii) weighting does correct variations in the data, but does not remove the overall bias (that is higher for the behavioural variables than for the socio-demographics).

Questa tesi si colloca nell’ambito della “survey methodology” e indaga il concetto di qualità dei dati nelle indagini campionarie condotte sui membri dei panel online. Nel tempo tali indagini hanno implementato diverse tecniche per superare varie sfide. In particolare, i costi elevati, la necessità di ottenere dati in tempi brevi e i problemi di copertura della popolazione, hanno portato allo sviluppo dei panel online a partire dagli anni ‘90. I panel online sono gruppi di persone che, tramite iscrizione volontaria, hanno acconsentito a partecipare a indagini condotte via web in cambio di una ricompensa. Questi panel hanno il vantaggio di ridurre tempi e costi dedicati alla raccolta dati e di fornire una lista di membri da cui estrarre campioni di rispondenti. L’obiettivo principale di questa tesi è valutare la qualità dei dati raccolti attraverso un panel online non probabilistico italiano, confrontando le stime ottenute da un’indagine web condotta sui membri del panel con quelle di un’indagine ISTAT di tipo probabilistico che adotto come “gold standard”. In particolare mi focalizzo sulle distorsioni generate i) dalla mancata copertura Internet della popolazione italiana, ii) dal meccanismo di auto-selezione dei membri del panel e dei partecipanti all’indagine web, iii) dal fenomeno della non risposta (confronto rispondenti e non rispondenti all’indagine web), e iv) dall’errore di misurazione che può manifestarsi durante la compilazione del questionario (confronto delle stime campionarie con l’indagine di riferimento). Inoltre applico un sistema di pesi per eliminare le distorsioni. Utilizzo tre fonti di dati. La prima sono i dati di profilazione dei membri del panel, solitamente non forniti ai ricercatori, che uso per studiare: i) la rappresentatività del panel rispetto alla popolazione Internet e alla popolazione generale e ii) quella dei rispondenti all’indagine web rispetto ai membri del panel e iii) le differenze tra le caratteristiche di rispondenti e non rispondenti. Il secondo dataset contiene i dati raccolti con l’indagine web su un campione di membri del panel. L’ultima fonte è il dataset dell’indagine di riferimento. Il mio studio ha un impatto rilevante nell’ambito della ricerca sull’indagine campionaria sia in Italia che all’estero. Infatti è il primo studio condotto in Italia che usa i dati raccolti da un’indagine sui membri di un panel online non probabilistico e che verifica la qualità dei dati raccolti con questo metodo confrontando le stime ottenute con un benchmark. La rilevanza del mio studio per la ricerca internazionale sui panel online non probabilistici consiste nell’approfondire i risultati finora ottenuti in merito a tre temi di ricerca poco studiati. In primo luogo, da quanto mi risulta, solo tre pubblicazioni hanno studiato le differenze nelle caratteristiche socio-demografiche dei membri di un panel, della popolazione Internet e/o della popolazione generale. Inoltre, un solo studio europeo ha valutato le differenze tra rispondenti e non rispondenti in un’indagine condotta su un panel. Infine, ridotto è il numero di ricerche che ricorrono al confronto con un’indagine “gold standard” per studiare la qualità dei dati raccolti tramite panel online non probabilistici. I risultati del mio studio hanno evidenziato che le stime ottenute dall’indagine condotta sul panel sono distorte rispetto ai benchmark, ma l’entità della distorsione varia molto in funzione della stima considerata (ad es., le stime su variabili comportamentali producono una distorsione più elevata rispetto a quelle sulle caratteristiche socio-demografiche). Inoltre, l’applicazione dei pesi è efficace nel correggere le distorsioni nelle stime, ma non nell’eliminarle completamente.

(2019). Can we trust data collected using web surveys? Assessing the quality of an Italian non-probability online panel. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2019).

Can we trust data collected using web surveys? Assessing the quality of an Italian non-probability online panel

RESPI, CHIARA
2019

Abstract

This thesis sits in the survey-methodology field and explores the challenging concept of data quality when using online panels as a sample source in the survey industry. Over time, sample surveys implemented different techniques to overcome various challenges. In particular, high costs, need for timely data delivery, and undercoverage (in the case of telephone surveys) are the main drivers of the development of online panels in survey research, from the 1990s on. Online panels are large pools of registered people who agreed to take part in web-based research in exchange for some form of incentive. They have the advantages of i) reducing both costs and time devoted to data collection and delivery, and ii) providing a sampling frame of registered individuals who consented to participate in surveys. The main objective of this thesis is to empirically assess the quality of primary data collected from an Italian non-probability online-panel survey, comparing the estimates obtained from this web survey to those obtained from a probability-based survey conducted by the Italian National Institute of Statistics (ISTAT), used as benchmarks. In particular, I focused on i) undercoverage and self-selection bias in different samples (i.e., the Internet population, panel members, and web survey respondents), ii) nonresponse bias, comparing the characteristics of web survey respondents to those of nonrespondents, iii) data quality, comparing the estimates obtained from the web survey to those obtained from a probability-based reference survey, and iv) a weighting adjustment to remove bias. I used three data sources. The first is data of all the members of the Italian non-probability online panel, recorded in the panel archive. Using this valuable source (because companies usually do not provide panellists’ data to their clients), I was able to study i) the panel members’ representativeness in comparison to both the Internet and the general population, ii) the web survey respondents’ representativeness in comparison to the panel members, and iii) the differences between the characteristics of web survey respondents and those of nonrespondents. The second dataset comes from my primary data collection, conducted on a sample of Italian adults selected from the panel. The last source is the dataset from a probability-based reference survey. The impact of my study is relevant to both the Italian and the international survey research fields. In particular, this is the first study in Italy that uses panel survey data, and compares the panel survey estimates to those from a probability-based reference survey, to assess the quality of data collected from a non-probability online panel. The relevance of my study for international research on non-probability online panels consists in broadening the current findings on three under-researched topics. Firstly, to the best of my knowledge, there are only three publications that have assessed the differences in the demographic and socio-economic characteristics of panel members, the Internet population and/or the general population. Secondly, there is only one European study on the assessment of the differences between study respondents and nonrespondents. Lastly, there are only a few studies that use a gold standard to assess data quality in non-probability online panels. The results from my study showed that i) the online panel survey estimates substantially deviate from the benchmarks, but the magnitude of bias varies widely according to the estimate (i.e., estimates from behavioural variables produce higher distortion than estimates from socio-demographics), and ii) weighting does correct variations in the data, but does not remove the overall bias (that is higher for the behavioural variables than for the socio-demographics).
SALA, EMANUELA MARIA
LOZAR MANFREDA, KATJA
online panel; self-selection; nonresponse; measurement error; weighting
online panel; self-selection; nonresponse; measurement error; weighting
SPS/07 - SOCIOLOGIA GENERALE
English
5-feb-2019
SOCIOLOGIA APPLICATA E METODOLOGIA DELLA RICERCA SOCIALE - 92R
31
2017/2018
open
(2019). Can we trust data collected using web surveys? Assessing the quality of an Italian non-probability online panel. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2019).
File in questo prodotto:
File Dimensione Formato  
phd_unimib_062190.pdf

Accesso Aperto

Descrizione: tesi di dottorato
Dimensione 2.23 MB
Formato Adobe PDF
2.23 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/241197
Citazioni
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
Social impact