How Quantization Shapes Bias in Large Language Models

Is quantization safe?

Quantization increases stereotypes and discrimination, reduces the model's tendency to generate toxic content, and neutralizes sentiment.
Quantization does not introduce new discrimination across demographic categories and subgroups.
Quantization has a similar effect across model architectures and reasoning capabilities.

Motivation

Large language models (LLMs) have been widely adopted in tasks such as machine translation, question answering, and dialogue systems. Their growing use is driven by the observation that scaling up parameters and data consistently improves performance and unlocks new capabilities. However, this comes with significant computational and memory costs, resulting in higher hardware requirements, increased storage needs, and longer inference times. To address these issues, compression strategies such as quantization have been proposed to reduce resource usage while largely preserving accuracy.

While several works have explored the impact of quantization on model capabilities, its effects on other critical aspects, such as social biases, have received little attention. In particular, a fine-grained analysis of the impact of quantization at the demographic category and subgroup level remains largely overlooked. To fill this gap, in this work, we provide an extensive analysis of how quantization affects social bias.

Evaluation Method

Quantization: We employ two types of quantization strategies. First, weight-only quantization: Generalized Post-Training Quantization (GPTQ) and Activation-aware Weight Quantization (AWQ). Second, weight-activation quantization: SmoothQuant (SQ). For each strategy, we evaluate different quantization settings by varying the number of quantization bits (e.g., 3-, 4-, and 8-bit quantization for the weights, and 8-bit for the activations).

Models: To generalize our understanding of the impact of quantization, we use four models: LLaMA-3.1-8B-Instruct, Qwen2.5-14B-Instruct, DeepSeek-R1-Distill-LLaMA-8B, and DeepSeek-R1-Distill-Qwen-14B. This setup enables us to compare the effects of quantization across different architectures (LLaMA vs. Qwen) and across reasoning capabilities (DeepSeek-based vs. non-DeepSeek models).

Social Bias: To expose models' social biases, we employ eight benchmarks covering different social dimensions: stereotypes (StereoSet, RedditBias, and WinoBias), fairness (DiscrimEval, DiscrimEvalGen, and DecodingTrust-Fairness), toxicity (BOLD and DecodingTrust-Toxicity), and sentiment (BOLD). We analyze model bias using both probability-based metrics (first-token probability and sentence perplexity) and generated text-based metrics (choice generation and sentence completion). Finally, we employed the MMLU benchmark to assess the impact of quantization on model capabilities.

Demographic Categories and Subgroups: To avoid aggregated results masking nuanced effects of quantization on bias (e.g., decreasing bias for one subgroup while increasing it for another), we perform a fine-grained analysis at both the category and subgroup levels. In particular, we focus on the categories of Gender, Race, and Religion, along with their respective subgroups (e.g., male, female, and non-binary for the gender category). This approach allows us to obtain a three-level understanding of the effects of quantization: global, category-level, and subgroup-level bias.

Findings

RQ1: How do quantization and specific quantization strategies impact each bias type? In generative-based tasks, quantization substantially reduces the model's tendency to produce toxic outputs and slightly neutralizes its sentiment. However, it also tends to marginally increase alignment with stereotypes and exacerbate unfair discriminatory behavior, particularly under aggressive compression.

In contrast, in probability-based tasks, we do not observe evidence of increased discrimination. Instead, quantization primarily increases model uncertainty: as quantization becomes more aggressive, the model assigns lower likelihoods to both stereotypical and anti-stereotypical prompts.

At the quantization-strategies level, we observe that weight-activation quantization (i.e., SQ) has a stronger impact across all dimensions, while for weight-only quantization, AWQ and GPTQ have comparable effects across all experiments.

Toxicity on BOLD.

Equalized Odds Difference on DT-Fairness.

RQ2: How does quantization affect bias across categories and subgroups? Overall, quantization has a comparable impact across demographic categories and subgroups: it neither introduces nor substantially alters the discrimination already present in the original models. However, in generative-based tasks, we observe a small increase in stereotype alignment and, in some cases, a heightened tendency to favor specific subgroups over others, which can result in greater unfairness and discriminatory outcomes.

Toxicity across categories on BOLD.

Stereotype Score across categories on StereoSet.

RQ3: How does quantization affect bias across model architectures and reasoning abilities? The impact of quantization remains largely consistent across different model architectures and reasoning abilities. Interestingly, un-quantized reasoning models are generally less aligned with stereotypes, produce less toxic outputs, and exhibit greater fairness compared to their non-reasoning counterparts. These distinctions are largely preserved even after quantization, indicating that the relative behavioral differences between reasoning and non-reasoning models are robust to the effects of quantization.

Citation

@article{marcuzzi2025quantizationshapesbiaslarge, title={How Quantization Shapes Bias in Large Language Models}, author={Federico Marcuzzi and Xuefei Ning and Roy Schwartz and Iryna Gurevych}, year={2025}, eprint={2508.18088}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2508.18088} }