Table of Contents
Most ASR systems treat each language as an isolated problem. That's a bad assumption when at least half the world's population is bilingual, and the speech recognition market serving them is projected to grow from $9.66 billion in 2025 to $23.11 billion by 2030, with improved multilingual capabilities as a primary driver.
When a speaker switches from Spanish to English mid-sentence, your transcription pipeline can lose far more than a few words. It can break entirely. Code switching is the mismatch between how real people talk and what most speech systems can handle. This guide covers the definition, failure modes, architecture options, and evaluation methods you need for production code-switching ASR.
Key Takeaways
Here's what you need to know about code switching in ASR:
- Code-switched audio produces 1.5x to 11x higher error rates than monolingual baselines on peer-reviewed benchmarks.
- Aggregate WER hides switch-point failures. A model can improve overall WER by 1.69% while worsening at switch points by 6.17%.
- Unified multilingual models outperform cascade LID-then-route architectures for streaming code-switching. Peer-reviewed comparisons often show materially lower latency.
- Mixed Error Rate (MER) and Point-of-Interest Error Rate (PIER) capture failures that standard WER misses.
- Deepgram supports multilingual transcription for production speech workflows.
What Code-Switching Means for Speech Systems
Code switching is common in production audio, and monolingual assumptions break quickly when speakers alternate languages mid-utterance. If your stack can't handle those transitions, you'll lose accuracy on the words that often matter most.
Linguistic Definition and Why It Matters for ASR
Code switching is the alternation between two or more languages within a single conversation or utterance. Linguists distinguish two patterns. Inter-sentential code switching happens between sentences.
A speaker finishes one sentence in Hindi, then starts the next in English. Intra-sentential switching happens within a single sentence: "No recuerdo mi bank password." Intra-sentential switching is the harder problem for ASR.
Most speech APIs treat mid-sentence language alternation as recognition errors rather than natural communication patterns. Your multilingual speech-to-text pipeline needs to handle both.
Where Code-Switching Shows Up in Production Audio
Code switching appears across high-volume voice verticals in production audio. US contact centers handle heavy Spanish-English switching in daily call traffic. India's BPO sector runs on Hindi-English—26% of India's population is multilingual according to census data, and Hindi-English code-mixing dominates urban and professional communication.
The Scale of the Problem in 2026
Research volume clusters around a few language pairs, but the production challenge is broader. A 2025 systematic review of 127 E2E ASR papers found Mandarin-English leading with roughly 70 papers.
Hindi-English and Arabic-English follow, while Spanish-English has only 3—likely reflecting benchmark maturity more than low volume. Published research volume doesn't directly measure production traffic. The engineering problem persists regardless. The same review found that even the top-scoring 2024 model achieves 48.38% WER on the Miami Bangor Spanish-English benchmark.
Why Monolingual ASR Breaks at Language Boundaries
Monolingual ASR stacks fail systematically at switch points because every layer assumes a single-language distribution. When the speaker changes languages, tokenizer coverage, acoustic confidence, and downstream task quality all degrade together.
Tokenizer and Vocabulary Failures
A monolingual English tokenizer has no subword units for Hindi or Mandarin. When it encounters "No recuerdo mi bank password," it tries to force Spanish phonemes into English vocabulary entries.
The result is garbled output, hallucinated words, or silent deletions. For character-based languages like Mandarin mixed with English, the mismatch is even worse. The model's vocabulary has no representation for the switched segments.
Acoustic Model Confusion at Switch Points
Acoustic models trained on single-language data learn language-specific phonetic distributions. English vowel spaces don't overlap cleanly with Spanish ones. When a speaker switches languages, the acoustic model's confidence drops sharply at the boundary.
An Interspeech 2025 paper measured this directly. OpenAI Whisper Turbo-V3 degrades from 22.04% monolingual Word Error Rate to 58.67% WER on Bahasa Malay-English code-switched audio. For structurally distant pairs like Chinese-Bahasa Malay, WER exceeded 114%, mostly from massive insertion errors.
The HiKE benchmark for Korean-English shows the same pattern. Whisper-Small degrades from 4.5–8.3% monolingual WER to 50.1% PIER at switch points. Errors spike at the exact moments where language boundaries occur.
Downstream Pipeline Damage
Switch-point errors don't stay in the transcript. They cascade into every downstream system. NLU intent classifiers receive garbled input. Named entity recognition fails on the highest-value tokens: product names, account numbers, and medical terms that often trigger a language switch.
Sentiment analysis misreads tone when half the utterance is missing. Semantic error rate reveals these failures. Aggregate WER dashboards won't. A speaker might say "necesito cancelar my subscription" and get transcribed as "necessity can sell my subscription." The WER dashboard still looks green.
But the intent is completely wrong. If you've debugged this kind of ghost failure before, you know how frustrating it is. Your dashboard says everything's fine. Your customer says otherwise.
Architecture Patterns That Handle Code-Switching in Production
For streaming code-switching, unified multilingual models are usually the safer production default. Cascade LID-then-route designs can work for cleaner boundaries, but peer-reviewed comparisons and streaming constraints favor unified approaches.
Cascade Architecture: LID Plus Monolingual Routing
Cascade designs run a Language Identification module first. They then route audio segments to language-specific ASR models. This works for inter-sentential switching where language boundaries align with utterance boundaries. It fails for intra-sentential switching.
The same systematic review confirms the production verdict across the surveyed literature: cascade LID-route designs don't fit streaming ASR setups. The LID module adds latency before transcription begins. Routing decisions made on partial audio aren't reliable. Maintaining separate models per language multiplies both architecture complexity and real-time compute cost.
Unified Multilingual Models
Unified models process all languages in a single encoder-decoder pass. They handle intra-sentential switching natively because the model's vocabulary and acoustic representations span multiple languages at once.
Google's production solution embeds a per-frame LID predictor directly inside an RNN-T model. The first-pass decoder operates with zero right-context, giving you true streaming with no lookahead.
The trade-off is clear. Cascade designs make sense when single-language accuracy for one dominant language matters more than latency or code-switching stability. For most streaming code-switching workloads, unified models are the better fit.
Production Configuration for Code-Switching
Deepgram's Flux Multilingual model supports conversational speech workflows. It reduces the need for separate language detection and routing components in production architectures.
For a bilingual support line handling Spanish-English calls:
wss://api.deepgram.com/v2/listen?model=flux-general-multi&language_hint=en&language_hint=es&encoding=linear16&sample_rate=16000
Deepgram also supports multilingual transcription for conversations that blend languages naturally—worth knowing if you're building streaming systems that need to stay stable when speakers switch languages mid-turn.
Measuring Code-Switching Accuracy Beyond Standard WER
WER alone misses the most expensive code-switching failures. You need metrics that isolate switch points and connect transcript quality to downstream task performance.
Why WER Alone Fails for CS Evaluation
WER counts substitutions, insertions, and deletions across the entire transcript. In code-switched audio, most of the transcript is usually in one dominant matrix language. Errors at switch points get diluted by correct transcription of the matrix-language segments. The PIER metric captures this directly. A model showed a relative WER improvement while simultaneously showing worse switch-point accuracy. If you're only tracking aggregate WER, you'll celebrate an improvement that made code-switching performance worse.
Metrics That Capture Switch-Point Performance
Two metrics address WER's blind spots for code-switched evaluation:
Mixed Error Rate (MER) applies word-level error counting to English tokens and character-level error counting to character-based language tokens such as Mandarin, Korean, and Japanese.
The formula is MER = (INS_m + DEL_m + SUB_m) / N_m × 100%. Standard benchmarks like SEAME and ASCEND use MER as their primary metric. MER only applies to language pairs involving character-based scripts. For Spanish-English or French-English, standard WER is appropriate.
Point-of-Interest Error Rate (PIER) measures error rates specifically at language switch points. ICASSP 2025 results show the disparity clearly: on the ASCEND benchmark, overall MER was 20.4% while PIER at switch points reached 34.27%. On SEAME, overall MER was 39.72% versus 58.67% PIER. Together, these figures show that aggregate MER consistently understates how much models degrade where language switches occur.
Building a CS Evaluation Pipeline
Track three layers in your evaluation pipeline. First, overall WER or MER for baseline performance. Second, PIER at switch points to catch failures that aggregate metrics hide.
Third, downstream intent recognition accuracy on code-switched utterances to measure business impact. Semantic error rate is the layer most teams skip. It's also the one that surfaces downstream quality problems raw transcription metrics miss.
Build your test set from actual production audio, not synthetic data. Tag language boundaries manually in your reference transcripts. Segment your benchmarks by language pair, since performance varies dramatically. Mandarin-English benchmarks and Spanish-English benchmarks produce very different scores for the same model. You can't extrapolate one pair's performance to another.
How to Ship Multilingual Transcription That Handles Code-Switching
If you serve multilingual markets, code-switching support is a baseline requirement. Choose a unified model when you need streaming stability, and evaluate it with metrics that expose switch-point failures.
Choosing Your Architecture
If your traffic involves intra-sentential switching or real-time voice agents, use a unified multilingual model. Cascade designs add latency and break at mid-sentence switch points. For Deepgram users, Flux is built for conversational speech workflows, while Nova-3 supports real-time multilingual transcription across the broader model lineup.
Configure the Connection
The configuration is minimal: one model name, one WebSocket connection, and optional language hints. Enough to keep the pipeline simple when conversations blend languages naturally.
Test on Production Audio
Pair that setup with PIER-based evaluation on your actual production audio, and you'll catch switch-point failures before they reach your users. Build the test set from real calls, tag language boundaries, and track both transcript and downstream task quality.
Get Started with Deepgram
Try it yourself with free credits. New accounts can test multilingual transcription against their own audio with $200 in credits.
FAQ
What's the difference between code-switching and code-mixing in speech recognition?
Linguists sometimes separate them. Code-switching happens between sentences, while code-mixing happens within one sentence. For ASR engineering, most systems and benchmarks group both under code-switching.
Can monolingual ASR models handle code-switching with post-processing?
Post-processing can fix minor errors, but it can't recover tokens the acoustic model never detected. If the tokenizer lacks vocabulary for the switched language, those segments usually become hallucinations or deletions.
How does code-switching affect real-time voice agent latency?
Cascade LID-then-route pipelines can add latency compared with unified models. For voice agents, that delay stacks with LLM inference and TTS generation. Unified models avoid the routing step entirely.
What language pairs are most commonly code-switched in production audio?
That depends on your customer base geography. Audit your real call recordings before choosing benchmarks. Academic coverage may center on Mandarin-English, while your traffic may lean Spanish-English or Hindi-English.
How do you build a test set to evaluate code-switching ASR accuracy?
Start with 200–500 production utterances that contain language switches. Use bilingual annotators to transcribe them with per-word language tags. Split the set into intra-sentential and inter-sentential subsets, then rerun it on every model update.









