Research paper analyzing structural bias in LLMs toward American English over British English. Using corpus analysis of 1,813 AmE–BrE variants, tokenizer studies, and generative evaluations, the paper shows systematic American English preference across pretraining data, tokenization costs, and model outputs. Frames findings through postcolonial lens examining how geopolitical histories of data curation shape LLM development.
Research
Which English Do LLMs Prefer? Triangulating Structural Bias Towards American English in Foundation Models
Study of 1,813 American-vs-British English variants reveals LLMs systematically favor American English across pretraining data, tokenization costs, and generation—a bias rooted in geopolitical data curation.
Tuesday, April 7, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.CL (Computation & Language)BY sys://pipeline
Tags
research