Hierarchical Adversarial Correction to Mitigate Identity Term Bias in Toxicity Detection

Schäfer, Johannes; Heid, Ulrich; Klinger, Roman

doi:10.18653/v1/2024.wassa-1.4

Schäfer, Johannes; Heid, Ulrich; Klinger, Roman (2024): Hierarchical Adversarial Correction to Mitigate Identity Term Bias in Toxicity Detection, in: Orphee De Clercq, Valentin Barriere, Jeremy Barnes, u. a. (Hrsg.), Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, Bangkok, Thailand: Association for Computational Linguistics, S. 35–51, doi: 10.18653/v1/2024.wassa-1.4.

Faculty/Chair:

Fundamentals of Natural Language Processing

Author:

Schäfer, Johannes

;

Heid, Ulrich

;

Klinger, Roman

Title of the compilation:

Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

Editors:

Conference:

14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis ; Bangkok, Thailand

Publisher Information:

Bangkok, Thailand : Association for Computational Linguistics

Year of publication:

2024

Pages:

35–51

Language:

English

DOI:

10.18653/v1/2024.wassa-1.4

URL:

https://aclanthology.org/2024.wassa-1.4

Abstract:

Corpora that are the fundament for toxicity detection contain such expressions typically directed against a target individual or group, e.g., people of a specific gender or ethnicity. Prior work has shown that the target identity mention can constitute a confounding variable. As an example, a model might learn that Christians are always mentioned in the context of hate speech. This misguided focus can lead to a limited generalization to newly emerging targets that are not found in the training data. In this paper, we hypothesize and subsequently show that this issue can be mitigated by considering targets on different levels of specificity. We distinguish levels of (1) the existence of a target, (2) a class (e.g., that the target is a religious group), or (3) a specific target group (e.g., Christians or Muslims). We define a target label hierarchy based on these three levels and then exploit this hierarchy in an adversarial correction for the lowest level (i.e. (3)) while maintaining some basic target features. This approach does not lower the toxicity detection performance but increases the generalization to targets not being available at training time.

Keywords:

Hierarchical Adversarial Correction

Peer Reviewed:

Yes:

International Distribution:

Yes:

Open Access Journal:

Yes:

Type:

Conferenceobject

URI:

https://fis.uni-bamberg.de/handle/uniba/98197

Activation date:

September 20, 2024

Permalink https://fis.uni-bamberg.de/handle/uniba/98197

FIS

Versioning

Question on publication

Options

Versioning

Question on publication