Improbable Bigrams: Vulnerabilities in Byte-Level BPE Tokenizers

1. Introduction

Tokenization serves as the critical bridge between human-readable text and model-processable discrete tokens in large language models (LLMs). Recent research has exposed significant vulnerabilities in this foundational component, particularly in byte-level byte-pair encoding (BPE) tokenizers. This paper investigates incomplete tokens—undecodable tokens with stray bytes that result from byte-level BPE tokenization—and their susceptibility to exploitation through improbable bigrams.

The core vulnerability stems from incomplete tokens' heavy reliance on adjacent tokens for proper decoding. When paired with unfamiliar tokens in out-of-distribution combinations, these incomplete tokens become fragile and prone to triggering hallucinatory behaviors in LLMs. Our research demonstrates that this vulnerability persists even when the constituent tokens are well-trained, distinguishing it from previously identified glitch token issues.

90% Reduction

Hallucination reduction in Llama3.1 with alternative tokenization

1.47M Bigrams

Maximum incomplete bigrams in Command-R-v01 tokenizer

6 Models

Tested across multiple LLM families

2. BPE Tokenization Fundamentals

2.1 Byte-Level BPE Implementation

Byte-level BPE extends the traditional BPE algorithm by operating directly on UTF-8 encoded bytes rather than Unicode characters. The algorithm iteratively merges the most frequent pairs of bytes or byte sequences according to the formula:

$$\text{merge}(x,y) = \arg\max_{(x,y) \in V} \frac{\text{count}(x,y)}{\text{count}(x) \cdot \text{count}(y)}$$

where $V$ represents the current vocabulary and $\text{count}(x,y)$ denotes the frequency of the byte pair $(x,y)$ in the training corpus.

2.2 Incomplete Tokens Definition

Incomplete tokens are byte-level tokens that cannot be independently decoded into valid Unicode characters. These tokens contain stray bytes that require combination with specific adjacent tokens to form legal UTF-8 sequences. The vulnerability arises because:

Incomplete tokens lack independent semantic meaning
They exhibit strong contextual dependence on neighboring tokens
Their byte patterns create decoding ambiguities

3. Improbable Bigrams Methodology

3.1 Construction Technique

Improbable bigrams are carefully constructed combinations of two incomplete tokens that form out-of-distribution pairs. The construction follows these principles:

Select incomplete tokens from the tokenizer vocabulary
Ensure the combination creates valid UTF-8 byte sequences
Maximize the statistical improbability of the pairing
Verify the bigram doesn't appear in training data

3.2 Vulnerability Analysis

The vulnerability mechanism operates through three primary channels:

Decoding Ambiguity: Incomplete tokens create parsing uncertainties that propagate through the model layers. The mathematical representation shows how embedding vectors for incomplete tokens $e_i$ exhibit higher variance:

$$\text{Var}(e_i | \text{incomplete}) > \text{Var}(e_j | \text{complete})$$

Contextual Fragility: The dependency structure makes these tokens brittle when removed from expected contexts, similar to the instability observed in adversarial examples from computer vision research.

4. Experimental Results

4.1 Hallucination Rates

Our experiments across multiple LLM families reveal dramatic differences in hallucination rates between standard and alternative tokenizations of the same phrases:

Model	Standard Tokenization	Alternative Tokenization	Reduction
Llama3.1	45.2%	4.5%	90.0%
Qwen2.5	38.7%	6.2%	84.0%
Mistral-Nemo	52.1%	8.9%	82.9%

4.2 Cross-Model Comparison

The scale of vulnerability varies significantly across tokenizers, as shown in our comprehensive analysis:

Tokenizer	Vocab Size	Incomplete Tokens	Incomplete Bigrams
Meta-Llama-3.1	128k	1,224	71k
Exaone-3.0	102k	1,222	36k
Qwen2.5	151k	1,320	39k
Command-R-v01	255k	2,956	1.47M

5. Technical Analysis Framework

Core Insight

The byte-level BPE tokenization paradigm, while computationally efficient, introduces fundamental architectural weaknesses that create systematic blind spots in LLMs. This isn't merely an implementation bug—it's a structural flaw in how modern tokenizers handle Unicode complexity.

Logical Flow

The vulnerability cascade follows a predictable pattern: Byte-level segmentation → Incomplete token creation → Contextual dependency formation → Statistical improbability exploitation → Hallucination triggering. This chain reveals that tokenization isn't just preprocessing—it's a critical security layer.

Strengths & Flaws

Strengths: The research methodology is rigorous, with cross-model validation and quantitative metrics. The improbable bigram concept provides a concrete attack vector for testing tokenizer robustness.

Flaws: The paper underemphasizes the training data contamination angle. Many "improbable" combinations might actually reflect rare but legitimate multilingual text patterns rather than pure artifacts.

Actionable Insights

LLM developers must treat tokenizers as security-critical components, not mere preprocessing utilities. Implement runtime tokenization sanity checks, adopt hybrid tokenization approaches, and conduct adversarial testing specifically targeting incomplete token combinations.

Original Analysis: The Tokenization Security Paradigm

This research fundamentally shifts how we should conceptualize tokenization in the LLM security landscape. The findings demonstrate that byte-level BPE tokenizers create systematic vulnerabilities that transcend individual model architectures, reminiscent of the fundamental flaws discovered in early cryptographic systems. Unlike the well-documented issues with glitch tokens—which primarily affect undertrained tokens—the incomplete token vulnerability persists even in well-trained models, suggesting a deeper architectural problem.

The 90% reduction in hallucination rates when using alternative tokenizations for the same input phrases is particularly damning. This magnitude of improvement indicates that current byte-level BPE implementations are introducing substantial noise into the model processing pipeline. When compared to the adversarial robustness literature in computer vision—where similar architectural vulnerabilities have been extensively studied—the tokenization layer emerges as the NLP equivalent of decision boundary fragility in image classifiers.

What makes this research particularly compelling is its connection to broader Unicode security concerns. The Unicode Consortium has long warned about confusables and normalization vulnerabilities, but this work extends those concerns into the neural architecture domain. The finding that Command-R-v01's larger vocabulary correlates with dramatically more incomplete bigrams (1.47M vs 71k in Llama3.1) suggests that scaling vocabulary size without addressing this fundamental issue may actually increase attack surface.

Looking forward, this research should catalyze a paradigm shift toward "security-first tokenization" similar to the cryptographic community's embrace of provably secure primitives. The alternative tokenization approaches that dramatically reduce hallucinations point toward hybrid methods that combine the efficiency of byte-level BPE with the robustness of character-level or word-piece approaches. As LLMs become increasingly deployed in safety-critical applications, addressing these tokenization-level vulnerabilities becomes not just an academic concern but a practical imperative.

6. Future Directions & Applications

Defensive Applications

Robust Tokenization Standards: Development of tokenization methods that minimize incomplete tokens while maintaining efficiency
Adversarial Testing Frameworks: Automated systems for detecting tokenization vulnerabilities during model development
Runtime Monitoring: Detection and mitigation of improbable bigram attacks in production systems

Research Opportunities

Cross-lingual analysis of incomplete token distributions
Integration with retrieval-augmented generation to mitigate context fragility
Development of formal verification methods for tokenizer security properties

Industry Impact

The findings have immediate implications for:

LLM safety evaluation benchmarks
Tokenizer design in next-generation models
Regulatory frameworks for AI system security

7. References

Jang, E., Lee, K., Chung, J.-W., Park, K., & Shin, S. (2025). Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers. arXiv:2410.23684v2
Rumbelow, J., & Watkins, M. (2023). SolidGoldMagikarp: A analysis of glitch tokens in large language models.
Land, K., & Bartolo, A. (2024). Embedding layer heuristics for identifying glitch tokens.
Wang, X., et al. (2024). Adversarial questions through tokenizer segmentation attacks.
Petrov, A., et al. (2023). Tokenization fairness in multilingual models.
Geiping, J., et al. (2024). Jailbreaking through token manipulation.
Unicode Consortium. (2024). Unicode Security Considerations. Unicode Technical Report #36
Vaswani, A., et al. (2017). Attention is All You Need. NeurIPS 2017

Table of Contents