The prevalence of hate speech leads to an increase in hate crimes, online violence, and serious harm to social safety, physical security, and cyberspace. To address this issue, several studies have been conducted on hate speech detection in European languages, whereas little attention has been paid to low-resource South Asian languages, making social media vulnerable for millions of users. Due to the scarcity of the datasets and the samples available, there is a need to apply some strategies to increase the data samples. In this paper, we improved the performance of the already fine-tuned m-Bert model by applying data augmentation techniques to one of the datasets on hate speech on tweets in Roman Urdu language. F1-score and accuracy matrix have been used to compare the results. We also experiment to determine the optimal percentage of augmented data to be included and the percentage of words augmented in each instance of data. The new RUHSOLD++ Dataset containing the augmented data has also been published publicly. The improvement in hate speech detection of the model proved that the performance of the models can be improved by applying data augmentation techniques to the dataset with a limited number of instances.

Maqbool, F., Spahiu, B., Maurino, A. (2024). Impact of Data Augmentation on Hate Speech Detection in Roman Urdu. In Proceedings of the 32nd Symposium on Advanced Database Systems (pp.321-330). CEUR-WS.

Impact of Data Augmentation on Hate Speech Detection in Roman Urdu

Fariha Maqbool
;
Blerina Spahiu;Andrea Maurino
2024

Abstract

The prevalence of hate speech leads to an increase in hate crimes, online violence, and serious harm to social safety, physical security, and cyberspace. To address this issue, several studies have been conducted on hate speech detection in European languages, whereas little attention has been paid to low-resource South Asian languages, making social media vulnerable for millions of users. Due to the scarcity of the datasets and the samples available, there is a need to apply some strategies to increase the data samples. In this paper, we improved the performance of the already fine-tuned m-Bert model by applying data augmentation techniques to one of the datasets on hate speech on tweets in Roman Urdu language. F1-score and accuracy matrix have been used to compare the results. We also experiment to determine the optimal percentage of augmented data to be included and the percentage of words augmented in each instance of data. The new RUHSOLD++ Dataset containing the augmented data has also been published publicly. The improvement in hate speech detection of the model proved that the performance of the models can be improved by applying data augmentation techniques to the dataset with a limited number of instances.
poster + paper
data augmentation, under resourced languages, large language models
English
32nd Italian Symposium on Advanced Database Systems, SEBD 2024 - 23 June 2024 through 26 June 2024
2024
Atzori, M; Ciaccia, P; Ceci, M; Mandreoli, F; Malerba, D; Sanguinetti, M; Pellicani, A; Motta, F
Proceedings of the 32nd Symposium on Advanced Database Systems
2024
3741
321
330
https://ceur-ws.org/Vol-3741/
open
Maqbool, F., Spahiu, B., Maurino, A. (2024). Impact of Data Augmentation on Hate Speech Detection in Roman Urdu. In Proceedings of the 32nd Symposium on Advanced Database Systems (pp.321-330). CEUR-WS.
File in questo prodotto:
File Dimensione Formato  
Maqbool-2024-SEBD 2024-VoR.pdf

accesso aperto

Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Licenza: Creative Commons
Dimensione 572.58 kB
Formato Adobe PDF
572.58 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/490399
Citazioni
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
Social impact