
By Jack Kearney
Staff Research Scientist
Last Updated
End-of-Turn Based Finalization
With Flux, one of the more visible changes we made was to how our streaming models handle finalization. Finalization refers to the “locking in” of a word; once the word is finalized, the system will not go back and change its hypothesis for that word. Many streaming STT systems finalize words according to
- “wall clock time,” i.e., finalizing words some amount of time after they’ve been spoken (e.g., transducer-based models), or
- “pause time,” i.e., whenever a break or pause, generally as detected by a voice activity detector (VAD), occurs in speech (e.g., Nova-3 streaming, “streaming Whisper” variants).
By comparison, Flux uses “conversation time,” performing finalization upon the completion of a conversational turn.
The reason for doing so was to help unlock low latency end-of-turn detection. Systems that inherently rely on some delay, e.g., until some time has passed or until a pause occurs, to finalize their transcript necessarily introduce additional latency between when a user stops speaking/ends their turn and when the transcript is ready. Instead, Flux constantly provides a view of what the most likely transcript would be if the turn were to end at that moment, such that when the turn does end a high-quality transcript is ready immediately.
However, a downside of treating the turn as if it could end at any moment is that Flux would occasionally “take a stab” at transcribing words before the user had finished speaking them, resulting in hypotheses that would need to be revised later. In the worst case, Flux would end up guessing that non-word sounds would become common words such as “Hey” or “I,” resulting infrequently (but still more often than our target goal of “never”) in false positive speech detection.
In reality, though, a turn is not likely to end at any moment; if the user is mid-word, then their turn is obviously not ending then. In these situations, Flux could afford to wait without sacrificing the ultimate end-of-turn accuracy. Since the initial release of Flux, we have developed a new training approach that better reflects this reality, emphasizing more the quality of the transcript at end-of-turn. The resulting model is slightly more conservative when it comes to transcription, which has three notable implications for developers. Relative to the initial version of Flux, this new version exhibits
- Improved transcription quality, by up to 10% for certain types of audio data,
- Reduced false positive rate for start-of-turn detection, by up to 70%, and
- Faster end-of-turn detection at
eot_threshold >= 0.8by 50-150ms.
To see these benefits, developers working with Flux will need to do…absolutely nothing! The new version was already launched in early December; to start, we applied our new training recipe to achieve a modest fine-tuning of the existing Flux model (”Flux V0.1”) such that the behavior is almost identical to that of the original model with the notable exceptions listed above.
Below, we describe the intuition behind the new training approach in more detail, as well as the impact on the model.
Flux V0.1: A More Conservative Transcriber
As astute observers of Flux outputs might have noticed, Flux transcripts are constantly being revised throughout the turn, analogous to how human understanding of speech evolves as we ingest more context. For those more familiar with LLMs, the way this works under the hood is loosely analogous to the concept of “test time compute;” Flux has some transcription “budget” that it spends over the course of the turn. The initial version of Flux (”Flux V0”) was trained to spend this budget aggressively, transcribing everything it had heard up to that point in time. This ensured that all audio would already have been transcribed as soon as a turn ends, but had the downside that Flux V0 might occasionally transcribe something “thinking” that whatever occurred at the end of the audio might be part of a word.
However, from the standpoint of “high-quality transcription at end-of-turn,” this aggressiveness is not really warranted! Instead, our new training paradigm does the obvious thing, namely optimize for…correctness specifically towards end-of-turn. The result is that the model learns to be more conservative with its budget. For instance, if you were to painstakingly evaluate the “working transcripts” output by Flux V0.1 throughout the course of the turn (note: you should not actually do this since it’s annoying and, anyway, we have done it for you as shall imminently become apparent), you would see a reduction in cases where the model
- outputs a word and subsequently removes it, by 20%, and
- changes the last word it output, by 30%.
Notably, this is not the result of a hard-coded lookahead/delay, nor does the model learn to consistently delay transcription. Indeed, in this training paradigm, it could not since it is still penalized for not having a maximally accurate transcript at end-of-turn. Instead, the model learns when to delay transcription or not, allowing improved usage of budget without sacrificing end-of-turn latency.
Improved Turn Detection
The advantage of a more conservative transcriber is that it is less likely to be tricked into thinking it heard speech when it did not, or to have to delay end-of-turn while it fixes an incorrect transcript. Correspondingly, Flux V0.1 exhibits fewer false positive start-of-turn detections, and reduced end-of-turn detection latency, particularly at higher eot_threshold and in the tails.
To evaluate start-of-turn detection, we compare Flux’s StartOfTurn detection time with the start time of the first word in the turn (for more details on how we evaluate these turn-oriented STT models, see Evaluating End-of-Turn (Turn Detection) Models). In this paradigm, a detection latency of “zero” is not really to be expected, since that would correspond to detecting and outputting the word before it had been fully spoken. We do this so that latency < 0 has a distinct interpretation, namely that such cases correspond to a likely false positive, since we detected a word before any were spoken.
The plot below shows the full cumulative density function of StartOfTurn detection latency for Flux V0 and V0.1. Since it’s hard to see (like we said, Flux V0 only occasionally falsely detected turn start in our benchmarking), we have included in the caption the density (i.e., frequency) of detections with latency < 0, i.e., false positives. Flux V0.1 achieves a 0.4% false positive rate, an over 70% reduction compared to the 1.5% rate observed for Flux V0.
As with most things, this improvement is not entirely free (I was disappointed to find, upon moving into ML in industry, that there is “no free lunch,” especially given free lunch is one of the primary motivators for physics PhD students). Flux V0.1 typically detects start-of-turn ~40ms slower than Flux V0. However, as the CDF of first word durations indicates, both versions still detect turn start within 200ms of when the user actually starts speaking, well within the regime of “normal.”
Just like for start-of-turn, improvements for end-of-turn show up in the tails, i.e., those difficult cases where being a little bit more conservative might lead to a higher quality prediction. The plot below is analogous to the one above, but focused on end-of-turn detection latency, for an eot_threshold = 0.80. At the median, we see a modest reduction of median latency by 40ms, but speedups of closer to 100-150ms at higher percentiles.
Below, we show the WER and end-of-turn detection F1 score as a function of median detection latency as controlled by eot_threshold (lower threshold = crossed earlier = faster detection, but more false positives). As we can see, the new model behaves almost identically, apart of being slightly faster (and more accurate!) at higher thresholds.
Note that we do observe a minor slowdown at the left-side of the plot, which corresponds to eot_threshold = 0.6, lower than the values preferred by most developers. But, for this threshold, the latency is actually reduced for percentiles above 80%.
Improved Transcription Quality
Since the changes in behavior predominantly impact working transcripts and the implications for turn detection, it is not immediately obvious these changes would necessarily result in significant improvements to finalized accuracy. However, again leveraging the analogy to “test time compute,” it is not unreasonable that this is the case. If you have a finite time (or token budget) to answer a question and you devote some of that precious budget to a fruitless line of reasoning, that reduces the time you have available to find the right answer. And, indeed, we see that these changes in behavior do correspond to meaningful changes in transcription accuracy.
Some of this is apparent above; Flux V0.1 exhibits a lower WER across most of the parameter space than its predecessor. However, since “two person conversation-oriented” data can somewhat narrow in terms of acoustic conditions or topics covered, we also compared the “pure transcription” capabilities of the model on a broader data sample.
When evaluating STT models at Deepgram, we typically use internal test sets that are reflective of the wide range of real world use cases and audio conditions encountered by our customers. We do not prefer open source evaluation sets such as Common Voice due to their more artificial nature, and in fact have found that achieving high performance on Common Voice can typically come at the detriment of performance on customer data. Also, since the test splits are public, one might be tempted to over-fit on Common Voice in order to look more impressive on benchmarking platforms.
In this case, however, our internal evaluation on a truly held-out Common Voice test set revealed significant improvements from this methodology, potentially due to the model’s ability to “wait” when confronted with challenging or more stilted speech. Specifically, whereas we observed a modest 3-5% improvement on our internal test sets, the improvement on our Common Voice test set was closer to 10%! So, especially since we are focused on relative Flux improvements and not comparing to competitors (who might have their own view of what an appropriate held-out set is), here we also share the results of our internal Common Voice benchmarking.
The plots above show finalized transcription accuracy on two test sets, comparing the original version of Flux with this latest update. As you can see, the new version of Flux is modestly more accurate! This was a pleasant surprising in the land of English STT, where we find our models are very close to the accuracy ceiling imposed by ground truth noise (either due to inherent ambiguity in transcription or annotator mistakes).
Conclusions
Our new training approach results in a more economical version of Flux, resulting in better overall accuracy and, notably, fewer false positives when it comes to start-of-turn detection. Now, if only I could work out how to train my offspring to be more economical with their “TV budget,” we’d really be onto something…


