Project Logo

To Compare, or Not to Compare:
On Methodological Practices in Evaluating Social Bias

An overview of our method

Motivation

As Large Language Models are increasingly deployed in critical applications, robustly evaluating their social bias is paramount. However, current literature is highly fragmented, yielding contradictory conclusions on whether techniques like Chain-of-Thought or neutral fallback options mitigate or actually mask prejudice. We argue that these inconsistencies stem from a shared methodological blind spot: existing benchmarks frequently ignore the profound behavioral shift that occurs when evaluating a demographic group in isolation (ISO) versus forcing a direct comparison between competing groups (CMP).

To resolve this, we introduce a unified framework that systematically isolates the impact of these evaluation paradigms. Using the Parity Gap, a metric that measures the model's propensity to favor one of the analyzed options, we conducted a rigorous audit across 19 models ranging from 14B to 120B parameters. By bridging the gap between what is tested and how it is framed, our framework enables reproducible and unambiguous bias evaluations.

Contribution of this work

  • As a warning to researchers and practitioners: a context that compares multiple demographic categories or stereotypical options is more likely to influence a stronger model preference than an isolated interaction. Then, we show that this stronger effect scales positively with model size, meaning larger architectures display increased discrimination rather than mitigating it.
  • This heightened preference in comparative setting is not alleviated by neutral fallback options, but is exacerbated by Chain-of-Thought reasoning, which stabilizes these skewed preferences across prompt variations. Moreover, even when the model claims to select a random answer to avoid bias, these stochastic responses are instead a strong indicator of deterministic preferences.
  • A suite of 9 standardized benchmarks to systematically evaluate social bias in Large Language Models across multiple axes, including the ISO versus CMP setting, the presence or absence of CoT, prompt variations, and the inclusion of neutral options. This includes new sets of questions derived from existing benchmarks, specifically tailored for both isolated and comparative settings.

Finding 1: The Paradigm Gap (ISO vs. CMP)

Our systematic evaluation across 19 models reveals a stark paradigm gap between the two evaluation settings. While language models demonstrate an almost entirely unbiased behavior under the Isolated (ISO) format, the exact same architectures trigger strong stereotypical preferences, discrimaniton and hidden biases when evaluated under Comparative (CMP) constraints. This indicates that comparative settings act as aggressive catalysts for hidden prejudice.

Paradigm Gap Results

Finding 2: The Role of Contextual Ambiguity

The massive behavioral shift between ISO and CMP paradigms is fundamentally driven by the absence of contextual information. When prompts contain explicit, disambiguating facts that support a correct answer, the forced-choice paradigm no longer acts as a catalyst for social discrimination, and models remain anchored to factual data, significantly reducing model prefecne in CMP setting.

BBQ Ambiguous
BBQ ambiguous split
BBQ Disambiguous
BBQ disambiguated split

Finding 3: Chain-of-Thought Amplifies Bias

Contrary to standard literature assumptions stating that step-by-step reasoning acts as a mitigation strategy, Chain-of-Thought (CoT) prompting aggressively increases the parity gap in comparative settings, suggesting that when forced to compare, CoT can lead the model to reason toward nuanced details of stereotypes and provide a biased answer. Furthermore, across 54 prompt variations, CoT stabilizes these skewed preferences, drastically reducing result variability.

CoT Variance Reddit
BBQ ambiguous split
CoT Variance BBQ
RedditBias

Finding 4: "Random" Answers Hide Prejudice

Even when language models explicitly state they are choosing an option randomly (left), or when they are explicitly instructed to randomize their selection under comparative constraints (right), their final decisions are far from random. The outputs manifest as a deterministic reflection of systemic bias instead of showing true stochastic neutrality.

Random Claim
Percentage of times the model selects the same answer while claiming to answer at random.
Forced Random
Model parity gap when explicitly asked to answer randomly.

Finding 5: The Illusion of Neutral Fallbacks

Allowing models to choose a neutral fallback option creates a misleading sense of safety. Even if the model actively selects the neutral option (e.g., "Prefer not to answer.") when allowed, the underlying preferences do not change. When evaluating choices conditional on a non-neutral decision being made, the underlying preferences remain heavily biased, proving that neutral options mask rather than fix latent prejudices.

Llama BBQ
BBQ ambiguous split
Llama DiscrimEval
DiscrimEvalGen

Finding 6: Comparative Bias Scales with Model Size

The tendency to prefer certain demographic groups or to enforce stereotypical preferences in comparative settings is a systemic issue that scales positively with parameter size. Across various model families, larger models possess a higher capacity to internalize societal biases from training data, making them more prone to aggressive discrimination or perpetuating stereotypes under forced choices

Scaling Effects on Bias

Citation

@misc{marcuzzi2026comparecomparemethodologicalpractices,
      title = {To Compare, or Not to Compare: On Methodological Practices in Evaluating Social Bias}, 
      author = {Federico Marcuzzi and Xuefei Ning and Roy Schwartz and Iryna Gurevych},
      year = {2026},
      eprint = {2606.24596},
      archivePrefix = {arXiv},
      primaryClass = {cs.CL},
      url = {https://arxiv.org/abs/2606.24596}, 
}