Article·Oct 21, 2025

Part 2: What Developers Should Know About Model Selection, Adaptation, and Tuning for Enterprise Speech Data

If you find a model specifically trained on your enterprise's domain, test how well it works on your audio before attempting adaptation or fine-tuning. Here are some domain-tuned models along with some of their respective features.

8 min read
Headshot of Brad Nikkel

By Brad Nikkel

AI Content Fellow

Last Updated

Select Candidate STT Models

Once you've mapped your audio landscape, you can make informed decisions about which STT model to select. Minimally, consider whether your candidate models are closed or open-weights, whether they support streaming or batch (or both), whether any domain-specific models exist for your domain (e.g., medical and call‑center models are quite common), and whether any adaptation options are available. There are other model characteristics you might consider, but these will carry you far. We’ll gloss over each one.

Weight Access and Customization Potential

Closed‑weights models. Deepgram, Google Cloud Speech‑to‑Text, Azure Speech, AWS Transcribe, and AssemblyAI are proprietary, so you can’t change their models' weights, but you can adapt their behavior via API parameters like keyword or phrase boosting and context options. Here are some options that a few STT vendors offer:

  • Deepgram: offers adaptation via keyterm or phrase boosting to bias up to 500 terms with Nova‑3 and enterprise customers can request custom model and call it from Deepgram API endpoints.
  • Google Cloud STT: offers adaptation via phrase boosting to bias up to 5000 phrases per request. You can also fine-tune Google's model on your audio and call it from Google API endpoints.
  • Azure Speech: offers adaptation via phrase boosting to bias up to 500 phrases per request, or you can fine-tune Microsoft's model on your audio or text data and call it from Azure API endpoints.
  • AWS Transcribe: offers adaptation via phrase boosting to bias up to 100 phrases, or you can fine-tune Amazon's base model on your audio or text data and call it from AWS API endpoints.

These adaptation options reduce “deployment” to simple API calls, a trivial operational burden compared with hosting your own model.

Open‑weights models: You can fully fine-tune open-weights models like Whisper or Wav2Vec2 families, or parameter‑efficiently tune them. You can also fine-tune many vendor-provided models, though vendors tend to constrain how you fine-tune theirs. Fine-tuning gives you the most overall potential to improve a model's performance on your domain's typical accents, ambient noise, and vocabulary, but you're on the hook for inference hosting, scaling, monitoring, and updates (unless you opt for a vendor-managed fine-tune solution).

Streaming versus Batch Processing

Streaming STT processes can constrain model adaptation via custom vocabulary for several reasons:

  1. Latency budgets limit how much the model can re-check or favor specific words without slowing down real-time transcription.
  2. Streaming can’t leverage an audio file's full context, whereas batch decoding can. The longer context in batches helps STT models disambiguate rare terms more reliably (so opt for batching when latency isn't critical).
  3. Some vendors restrict keyword or phrase boosting to either streaming or batch processing, but not both. Azure Speech, for example, allows phrase lists for streaming, but not in their batch transcription. By contrast, Deepgram's Nova‑3 allows keyterm prompting in both batch and streaming requests.

When evaluating candidate STT models for your enterprise, also check for vendors offering pre-trained domain-specific options.

Domain-Specific STT Models

If you find a model specifically trained on your enterprise's domain, test how well it works on your audio before attempting adaptation or fine-tuning. Here are some domain-tuned models along with some of their respective features:

  • Medical:
  • Telephony or Call‑center:
    • AWS Transcribe: domain‑specific models tuned for telephone calls' typical low-fidelity audio and dedicated customer-agent call‑analytics STT APIs that include features like generative call summarization, PII redaction, and sentiment analysis.
    • Deepgram's Nova2: "phone call" mode designed for low‑fidelity audio.
  • Finance and Business:
    • Kensho's "Scribe" STT model is trained on financial and business jargon, useful for transcribing regulatory hearings, meeting minutes, and compliance monitoring.
  • Legal:

If domain-trained models don’t yield the performance you’re after, you have a few more options. Learn more in Part 3!