Options
Are Humans as Brittle as Large Language Models?
Li, Jiahui; Papay, Sean; Klinger, Roman (2025): Are Humans as Brittle as Large Language Models?, in: Kentaro Inui, Sakriani Sakti, Haofen Wang, u. a. (Hrsg.), Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Mumbai, India: The Asian Federation of Natural Language Processing and The Association for Computational Linguistics, S. 2130–2155.
Faculty/Chair:
Author:
Title of the compilation:
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Editors:
Inui, Kentaro
Wang, Haofen
Wong, Derek F.
Bhattacharyya, Pushpak
Banerjee, Biplab
Ekbal, Asif
Chakraborty, Tanmoy
Singh, Dhirendra Pratap
Conference:
14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, December 2025 ; Mumbai, India
Publisher Information:
Year of publication:
2025
Pages:
ISBN:
979-8-89176-298-5
Language:
English
Abstract:
The output of large language models (LLMs) is unstable, due both to non-determinism of the decoding process as well as to prompt brittleness. While the intrinsic non-determinism of LLM generation may mimic existing uncertainty in human annotations through distributional shifts in outputs, it is largely assumed, yet unexplored, that the prompt brittleness effect is unique to LLMs. This raises the question: do human annotators show similar sensitivity to prompt changes? If so, should prompt brittleness in LLMs be considered problematic? One may alternatively hypothesize that prompt brittleness correctly reflects human annotation variances. To fill this research gap, we systematically compare the effects of prompt modifications on LLMs and identical instruction modifications for human annotators, focusing on the question of whether humans are similarly sensitive to prompt perturbations. To study this, we prompt both humans and LLMs for a set of text classification tasks conditioned on prompt variations. Our findings indicate that both humans and LLMs exhibit increased brittleness in response to specific types of prompt modifications, particularly those involving the substitution of alternative label sets or label formats. However, the distribution of human judgments is less affected by typographical errors and reversed label order than that of LLMs.
Keywords:
Large Language Model
Peer Reviewed:
Yes:
International Distribution:
Yes:
Open Access Journal:
Yes:
Type:
Conferenceobject
Activation date:
January 21, 2026
Versioning
Question on publication
Permalink
https://fis.uni-bamberg.de/handle/uniba/112676