AIMinds #033 | Andrew Seagaves, VP of Research at Deepgram
About this episode
Andrew is a computational scientist and theoretician turned Deep Learning Researcher. He earned his PhD in Mechanical Engineering from MIT where he worked on simulating body armor fragmentation using giant supercomputers. After graduating, he worked for an energy company designing shaped charges and other explosive devices using AI. At Deepgram, he is a Research Scientist and Team Lead, where he does exploratory deep learning research and builds awesome new NLU products.
Listen to the episode on Spotify, Apple Podcast, Podcast addicts, Castbox. You can also watch this episode on YouTube.
In this episode of the AIMinds podcast, Demetrios is joined by our very own Andrew Seagaves, VP of Research at Deepgram, to discuss text-to-speech (TTS) technology and language modeling. Andrew shares insights from his diverse AI career, including his work on shaped charges and his role in advancing speech recognition and TTS systems. He explores the challenges of managing complex text and audio data, the importance of context conditioning in TTS, and how these technologies have evolved. Andrew also recounts his career trajectory from MIT to designing AI-driven explosive devices, highlighting key innovations and future prospects in AI speech technology.
Additionally, Andrew teases exciting new features and launches at Deepgram that are set to revolutionize the landscape of speech technology, inviting listeners to stay tuned for upcoming releases.
Fun Fact: Before diving into the world of AI and speech technology, Andrew Seagaves worked as an explosives engineer, where he designed explosive devices for oil and gas wells using shape charges. This unique early career choice showcases the versatile applications of engineering skills across different industries.
Show Notes
00:00 Designing thin reactive membranes to generate power.
05:49 Scalable simulations require stable large computer processing.
08:39 Shape charge creates high-speed collapsing jet.
12:06 Experimentation, learning, performing, and expanding through simulation.
12:56 Developing and optimizing designs using evolutionary algorithms.
19:17 Speech recognition research focused on expansion and innovation.
22:54 Revolution in architecture; high-quality data key.
26:41 Text-to-speech and language modeling complexity explained.
29:20 Text relies on one vector, lacks nuance.
31:10 TTS should sound emotionally and tonally appropriate.
More Quotes from Andrew Seagaves:
Transcript:
Demetrios:
Welcome to the AI Minds podcast. This is a podcast where we explore the companies of tomorrow built AI. First, I'm your host, Demetrios, and this episode is brought to you, just like every episode, by the number one voice API on the Internet today, Deepgram. It is trusted by the world's top startups, conversational AI leaders, and enterprises like Spotify, the one you can listen to music on Twilio, NASA, the one that sends rockets into space, and Citibank. I'm joined today by the VP of research of Deepgram itself, which is an enormous honor. Mister Andrew. How you doing, dude?
Andrew Seagaves:
I'm doing great. I'm very excited to be here. It's great to see you, Demetrius.
Demetrios:
Let's hope that I can live up to your expectations, because, well, I know.
Andrew Seagaves:
You have live up to yours. Yeah.
Demetrios:
Such a rich history that I would love to get into right now. You did undergraduate, and you were a real engineer, as we were just saying before we hit record, not one of these software engineers before you transitioned into software engineering, I think, and you were developing theories for the design of fuel cells. Can you just break down what exactly that means? Because I have no idea if I even said that correctly and what that is.
Andrew Seagaves:
You did, you said lots of words correctly there, so nicely done.
Demetrios:
They are hard sometimes.
Andrew Seagaves:
Yeah, that's right. Yeah. When I was an undergrad, I was getting my training in mechanical engineering, and I was interested in developing theories of complex systems. And so one of the things I worked on, the main project I worked on was designing polymer electrolyte membrane fuel cells, pemfcs, which now maybe people forgot about them. They were a hot thing at the time. They are now in some infrastructure. They're in some buses. They were for a few years.
Andrew Seagaves:
Maybe they're not anymore, but, yeah. So these are like systems. They're of thin layers of reactive material, and you flow at very low rates, a gas through them, and the gas reacts with a membrane, and this reaction then establishes, like, an electrical current, and you can draw power from this. And so what I was working on was actually designing the pore structure of these very thin membranes in such a way that it was optimal to balance the processes of diffusion, the movement of the gas through the membrane by diffusion process and reaction, where the gas is actually interacting with the walls of the membrane, the port walls.
Demetrios:
And the end goal, just so I'm not jumping ahead of myself, is to power vehicles.
Andrew Seagaves:
Yes. That was the end application that they had in mind at the time, and.
Demetrios:
It fell out of vogue.
Andrew Seagaves:
It fell out of vogue. I mean, there were some strong technological limitations, which, like the membranes themselves were not very mechanically robust, and they would wear down over time, and they would crack, and then when they cracked, they would lose their ability to function. And then the other big problem that they had was the storage and transportation of the gas. So, yeah, they were using hydrogen gas, and that was not a very economical option.
Demetrios:
It's not the easiest to get from point a to point b.
Andrew Seagaves:
That's right. That's right.
Demetrios:
So then you went on to Mit. Can you tell me a little bit about your stint there?
Andrew Seagaves:
Sure. Yeah, I worked. So I did mechanical engineering at MIT. I got my PhD in mechanical engineering there. From 2006 to 2013, I actually worked in an aero astro group. So a group that did aerospace engineering. I was the oddball mechanical engineer in that group. And a lot of the work that goes on in aerospace engineering is actually very, very high end simulation work.
Andrew Seagaves:
So, like, simulating complex systems, like, everything to do with airplanes and the aerospace industry, a lot of highly complex fluid mechanics simulations. And so what I worked on was modeling dynamic fracture and fragmentation, and the main application that we were interested in were protective plates, like body armor plates for personal human defense.
Demetrios:
So you were simulating what body armor would react or how it would act when hit with a lot of force.
Andrew Seagaves:
Yeah, basically when hit with a bullet traveling at ballistic speeds, desert. And so that was a really unexplored area because you need a very big computer to model a fragmentation process. Fragmentation involves cracks forming in a material. It can be described as a hierarchical cascade of length scales. You see the big cracks, and then if you were to zoom in, you see smaller and smaller cracks. And then if you were to zoom in, you see smaller and smaller ones. So it's basically like a fractal, and so you have to. So it's, like, hard to model this.
Andrew Seagaves:
You need a very big computer. And so that was our objective was to come up with scalable ways of simulating things. And we ended up running some, like, very, very large calculations and just getting the jobs, these calculations to run at all, which, like, the analog of this these days is like getting your training job to, like, remain stable over time. Right. Like, you know, dealing with training instabilities. It was the same thing dealing with numerical instabilities and, like, putting in hacks to. To get the thing to remain stable so that it could run long enough so you could see the action and just observe.
Demetrios:
And were you using GPU's back then?
Andrew Seagaves:
We were not. This was all cpu work? Oh, yeah. GPU's were just emerging and emerging technology at the time for user computations.
Demetrios:
Twinkle and Jensen's eye back in bed, they were.
Andrew Seagaves:
They were, yeah, we. You know, people were, like, dipping their toes into. Into GPU computing and, like, writing Cuda code and stuff like that. But that was a very.
Demetrios:
It was very early, so I didn't quite catch how working with bulletproof vests relates to space.
Andrew Seagaves:
Well, the similar bodies that have lots of money will fund space research and also other kinds of defense related research.
Demetrios:
Yeah, yeah, that makes sense. I see. And both need a whole lot of compute.
Andrew Seagaves:
Yes. Yeah. So, like, there's. I would say, like, fragmentation is analogous to another problem in mechanical engineering, which is modeling turbulence in fluids. So those are two of, like, the hardest problems to model. They're actually similar in the fractal nature of phenomenon happening at smaller and smaller length scales. Turbulence is basically kind of the same thing.
Demetrios:
Oh, that's fascinating. So then you got out MIT and you did something that is so cool, you became an explosives engineer.
Andrew Seagaves:
That's right. Yep. I wasn't sure exactly what company would hire me coming out of the specialized work that I was doing. And it turns out that there was one particular group that I was very well suited for at the time, which is working in a group that makes shape charges. I worked in an oil field service company for a number of years. I worked in a specialized group where we were basically designing little explosive devices that they deploy down hole in oil and gas wells to punch holes through.
Demetrios:
Okay. What is a shape charge? Is just a very directed bomb.
Andrew Seagaves:
That's exactly right. That is. Yeah, it's exactly what it is. So, like a shape charge is like a steel cup, so some kind of receptacle, and then an explosive, and then a. A metallic cone that you press into it, and so you detonate the explosive, and that actually causes the cone to collapse in on itself. And this is like a highly unstable motion. And what it produces is a jet that forms at the apex of the cone from this cone, just like, collapsing on itself. And this jet travels at, like, extremely high speeds.
Andrew Seagaves:
If the cone is made out of copper, it's basically like a molten jet. Oh, wow. And it can penetrate almost anything. And so the metallic jet is the thing that does all the penetrating and the explosive.
Demetrios:
And what's the use case for that?
Andrew Seagaves:
So the reason that they use these things in the oil industry is that, like, you drill a hole, a cylindrical hole in the ground, like, very deep into the earth, and if you just did that, it would collapse in on itself and so you have to put a pipe in the way. And so you put a pipe that holds the, holds the things together, but now you have a pipe in the way and the oil can't flow in, so you have to punch a hole. And so the hole, like the shape charge, is the thing that punches the.
Demetrios:
Hole at the end of the pipe.
Andrew Seagaves:
Yeah. So we pack so that we have these things called perforating guns, which are themselves long cylinders full of hundreds or thousands of charges directed radially outward. And you just lower these things down into the well and you detonate them and they all fire at the same time. And you punch like hundreds or thousands of holes.
Demetrios:
Oh, so now it's almost like a straw and it's more permeable. So this big pipe has a lot of holes in it and the oil can flow through.
Andrew Seagaves:
That's exactly right, yep. And so you're trying to design these charges to actually like, produce the, say, the deepest hole that they can and then maybe the widest entry hole. So this kind of a multi objective design problem, and you want it to do minimal damage to the rock. So you're like designing for the effects of the jet on the rock. So it's an incredible problem. I miss working on it. You could spend a whole career working on it, and there are a lot of folks who do. And it's a more interesting and hard problem than rocket design.
Andrew Seagaves:
I would say it's similar, but better. And so rocket scientists get a lot of, they have a lot of cachet. And I'm telling shape charge scientists, yeah, yeah.
Demetrios:
We do have that saying, which I tend to say a lot. It's not, you know, it's not so hard. It's not rocket science.
Andrew Seagaves:
Exactly.
Demetrios:
So I'm gonna change that. It's not shape charging, it's not shape charge design. It doesn't quite roll off of the tongue.
Andrew Seagaves:
No one has any idea what you're talking about. Yeah, yeah.
Demetrios:
I'll just reference this podcast and I'll say, you should hear what Andrew used to be doing now. How did AI come into your life?
Andrew Seagaves:
Yeah. So for a few years I was just doing experimental work, and I learned the black art of shape charge design, build and test. And I learned all the secret rules, magic rules that the engineers had discovered over long periods of time. But then this build and test approach eventually reached a limit to what it could do. And we were asked to design more and more performant charges that did really amazing things with no increase in the explosive capacity or the fundamental driving forces. So we just had to do more with less. And in order to do that, we needed to leverage, you know, computation. And so we, we started simulating shape charges, just something that I had done in my PhD.
Andrew Seagaves:
And then once we were able to simulate it with a reasonable amount of fidelity, we could optimize using something like evolutionary algorithms. And so that was something that I worked on for the last couple of years before I left, which was like, coming up with, like, automated design optimization approach, using evolutionary algorithms, which are incredible algorithms, genetic algorithms, and evolutionary approaches. And at that point, I feel like you're about to ask a follow up, but I'm going to say one more thing. Say it. I thought what we were doing, I started to get the feeling that what we were doing was ridiculous, because we were solving some set of equations, physics equations, and in order to do so, we were discretizing space and time and simulating the shape charge by discretizing space and time, which is incredibly complex and expensive way of obtaining the information that you want. And it just seemed like there's some lower dimensional representation of this shape charge governing equations that we could leverage. And at the time, AI was really, you know, there was a lot of stuff coming out of AI research, like manifold learning and applying this to do order complexity reduction and governing equations.
Demetrios:
When was this?
Andrew Seagaves:
This was like 2000. What was this? 2019.
Demetrios:
Okay.
Andrew Seagaves:
In that time frame. So that's when I really got interested in aih. And that's when I started to learn it and get my hands dirty. And that's about the time that, uh, I ended up making the jump.
Demetrios:
But you started to learn with it. Was there a way that you could apply AI to what you had been working on with the shape charges?
Andrew Seagaves:
Yes, there was. So I was actually, like, the, one of the early things that I did, that I. I'm not sure how much I can say about. Well, it never went anywhere, so I guess I could talk about it was actually using gans to generate designs. So that was an early. So I was looking at some work that had come out of University of Maryland, and it was really cool stuff where you could train a gan to generate new samples from your design parameter space so it could learn how to generate. It could learn the shape of an aircraft wing, and then you could actually sample using a gan, aircraft wings that had never been synthesized before, something like that.
Andrew Seagaves:
That was the original impetus for their research. And then I applied it to design shaped charges.
Demetrios:
Wow. Okay. So this is really cool because it's basically like you're using the precursor of what we all know, as text to photo. Right. Now, with the diffusion models, you were using that to get inspiration on your designs, shape charges with the gans, which were. I'm not going to say it was the model before the diffusion models, but it was kind of like the precursor to the diffusion models.
Andrew Seagaves:
Totally. That's right.
Demetrios:
Yep.
Andrew Seagaves:
So our inputs, instead of, like, natural language inputs, they were just like, 15 or 20 different float values that were corresponding to, like, some latence in a design space.
Demetrios:
So those were your prompts.
Andrew Seagaves:
Exactly. Yeah. But I couldn't tell, you know, I couldn't tell it what, uh, I couldn't express myself in that language, you know?
Demetrios:
Yeah.
Andrew Seagaves:
So if we had a. We had a. We had a noise sampler that was really the one, uh, specifying the prompt. Yeah.
Demetrios:
And so then you got to play around with that. I imagine that piqued your interest, and that is what made you want to dive deeper. How did you end up going and taking the next step?
Andrew Seagaves:
Yeah, that's a great question. Yeah, it did. It made me want to dive deeper. So I wanted to get involved in AI research, and I was really open as to what that would look like. And I just happened to accidentally meet someone who was recruiting for Deepgram. But I was actually interviewing with another company, and they were recruiting for that company, but also for Deepgram. And when it didn't work out with the first company, they suggested, hey, why don't you have a call with Adam Sipnesky? He's CTO of Deepgram. And that was actually Lauren Zipneski, who at the time, I didn't even realize the connection there.
Andrew Seagaves:
That was Adam's sister, who she now works for Deepgram.
Demetrios:
Oh, nice.
Andrew Seagaves:
Head of data ops. And, yeah. So then I had a call with Adam, and then the rest is history.
Demetrios:
And what did you start doing? How has your job evolved over the years?
Andrew Seagaves:
Yep. So I came in as a researcher. I was a research scientist, and the first thing I worked on was the diorizer I built. Well, we did have a diarizer, but it was not very good.
Demetrios:
Now, that's something that people rave about. I know there's a ton of people that will always say, I specifically chose Deepgram because of the diarization feature.
Andrew Seagaves:
Yep. We have a really strong diarizer. We have a really strong diarizer are now, it's definitely not the one that I did. It's been props, but, yeah, I wish I could take credit for it, but that was all.
Demetrios:
Like, I came from explosives, and right off the bat, I created the best diorizer out there.
Andrew Seagaves:
It was at the time, the one that we shipped was probably the best out there, but it wasn't that good, you know, so like the technology has come a long way, but it was one of the thing was it was fast enough to run and service like real production loads. And that was, that's often like a limiting factor in being able to ship something for Deepgram, because we do everything at scale and we have like customers who process tremendous, tremendous amounts of audio through our system. Yep.
Demetrios:
And fast.
Andrew Seagaves:
And fast. It's got to be fast. It's got to be scalable.
Demetrios:
So then you built that and you continued to just research?
Andrew Seagaves:
Yep, I worked on a bunch of stuff, but eventually I started working on speech recognition. And so I was doing like frontier research for speech recognition, trying to build like the next generation of models. So we would like ship a generation, we would expand it, training it in new languages, new use cases, etcetera. So we always had this like expansion phase. And then in parallel to that, I would be doing the research to build the next version that would replace this. And that would be there's like, you have the inputs to the model, so the features that you're going to use, the input space was always on the table. The architecture of the model was always on the table, and then the data you used to train it. So we were always scaling and improving the data and then the training approach, the strategy.
Andrew Seagaves:
So those are kind of like the four ingredients for building really any model, to be honest, and always iterating on those and trying new stuff. And then you make a discovery and you find a particular thing that works really well. And maybe you get what's really interesting about architecture research up to this point is that we could find better architectures that would give us improvements across like all audio domains, which was really interesting. So that's like a powerful generalizing thing, and that's becoming increasingly difficult to do. Now.
Demetrios:
What were some other surprises that came out of nowhere in your eyes as you were working on some of this stuff?
Andrew Seagaves:
I would say one big surprise. Well, one thing that we had a hypothesis that scaling data would help. And so something that was really cool to see, maybe it wasn't that surprising, was that scaling the data really helped models massively to generalize well. And so showing the data examples like very, very broad corpus of different accents and different acoustic scenes, many examples of words being spoken, the models got incredibly strong. And so people have seen similar effects in LLMs, like scaling up the data and scaling up the model. We kind of observed the same stuff in speech recognition.
Demetrios:
And how about the labeling part? Can you talk a bit about that? Because it feels like that's probably the other piece that as you're scaling up the data, a, labeling then becomes you have to scale that up too. But b, it is just as important because it's part of the data.
Andrew Seagaves:
Yeah, I mean, it's actually like, I would say there's two effects that are data. So if you want to understand the effect of data and what's important for a speech recognition model, and maybe this extends to other types of models, there's a fundamental scale effect. So you need very, very large data scales to get a model to a certain minimum quality level for current architectures. So a model needs to see a word many, many times to learn it reliably and not to the point where it wouldn't benefit from seeing that word more times. And so there's this fundamental scale effect. And the scale effect right now is the dominant thing that explains why there's a very long tail of languages where speech recognition is not very good yet.
Demetrios:
Because we just, because there's not data of the size that you need with current architecture.
Andrew Seagaves:
So there's probably maybe a revolution in architecture where models can learn much more efficiently with fewer examples, but that is yet to be determined. So the first thing is scale. Once you have scale, you then really need domain coverage and specificity for all of the niche audio conditions and speaker characteristics and special vocabulary that you care about for different business use cases. We don't have that in many languages at all. We have that for English, for example. And that's why english models have gotten really, really good. It's primarily the data, and that's where all of your very high quality labeling becomes the most important thing for scale. You train primarily on noisy labels that are not like gold, high quality, verified frame truth, and you use whatever tools are available to you to filter the data so that you try to train on what's good.
Andrew Seagaves:
And then for your domain coverage, your specialized data, that's where you really need good labels.
Demetrios:
Okay, I had no idea about that. Can you go into a bit of the idea on text to speech? Because this is all speech to text, right, that we're talking about. But recently we came out with the model aura that is all text to speech. How does it differ, if at all? What does that look like? And what are some things that you now have to keep in mind because you have both ends of the spectrum?
Andrew Seagaves:
Yeah, this is a great question. So we've been working a lot on text to speech recently. And it was very exciting to start getting our hands dirty with that and really building text to speech models from the ground up and developing our own perspective on them. One thing that you can say right off the bat, or how I would describe speech to text and text to speech and to juxtapose them, is that speech to text models effectively learn to discard information. That's what they're really good at. And to be effective, that's what they have to do, because they're taking in this audio signal, which has a tremendous amount of relevant information in it, incredibly dense in information. But really you only care about being able to write down the words that are present, what people call the content. That's like the term that folks are using these days.
Andrew Seagaves:
And so in order to do that, the model just has to learn to discard all the information that's not necessary to write down the content. And that turns out to be hard, but not that hard, I think, like, relatively easy to do. It's a many to one problem. Many to one, I think, is a lot easier to do than one to many. So, like, text to speech, on the other hand, TTS is like the opposite problem. So I think TTS is ridiculous. This is what I like. It is a terrible idea.
Andrew Seagaves:
Like, you expect a model to take in a random text string and then give you a realization of that in audio that you want, that you like. Right? Like that solves some business problem. Like the. The model just does not have enough information from text alone, from the words alone. So it's like so much nuance. Exactly. Incredible. There's an incredible amount of nuance.
Andrew Seagaves:
How do you want it to sound? You know, in like a hundred different dimensions. Right? So I think, like, text to speech is very similar in. In how ridiculous it is to language modeling. So I also think language language modeling is absurd that you expect, what we expect an LLM to do is take a random text string from the Internet and predict the next word, given the wild complexity, expansive complexity of distributions of text on the Internet. And so the only way to get an LLM to do that is to massively scale the data. So basically use all the text that's ever been generated by humankind, which is a lot, and you can get it. Okay? And then equally, like, scale the model up hundreds of billions of parameters, you reach the limit of what you can do with compute. So then you just keep building larger and larger, compute to scale the model more.
Andrew Seagaves:
And the reason that you have to do that is just because the task is just too hard, you know, like, it is incredible that you can get a model to do it at all, I think. Okay, so text to speech is just like that. You could describe it as, like, the model is fundamentally under conditioned. It doesn't have enough additional conditioning information about telling it about the context in which it's supposed to answer the question or generate the response that it's supposed to generate. And so the issue with text to speech is there's just, like, not nearly as much audio available as there is text. Yeah. So we just don't have right now the ability to do the massive scaling. And so we have to do modeling.
Andrew Seagaves:
We have to do real modeling to add conditioning to the model. We have to give it more conditioning signals. And this is where I think the field is not at all in a state. It's not in a good state yet where it should be. People are still trying to train text to speech models with very bare bones conditioning, which is absurd. It's very hard to do. And this is why also, text to speech models don't really work that well yet. We think of this giving better conditioning to the model as one of the core things that's going to make text to speech better, giving the model context.
Andrew Seagaves:
Our CEO calls this representation hacking. That's what he calls it, because you can't just rely on scaling. So anyway, I think text to speech is extremely exciting.
Demetrios:
Well, it makes a lot of sense. When you read text, there's things that, even as a human, when I send a text to someone, sometimes the person who's reading it may not interpret it the same way that I wanted to send it, just because, as you said, you've condensed down all these vectors, so you only have this one vector that is the text. Whereas when you're talking, you can hear if I'm being sarcastic or if I'm joking or whatever it may be, because there's so many different vectors. And so I really like that visualization, taking on the front end of the audio to text or voice to text, you have many to one, but then going from one to many is so hard because there are so many nuances that you have to make sure you're keeping in line. Otherwise, when that audio is generated, it's just not gonna sound right. And for humans, it's gonna be like, wait, what? Why? Imagine somebody's trying to say something sarcastic, but the text or the voice that comes out is not sarcastic at all. You instantly have this disconnect and a pattern interrupt in your mind, saying, like, that something's wrong here I don't know what it is, but I don't like it.
Andrew Seagaves:
It's so true. Yeah. Humans can tell right away. There's like, it's very hard to get a text to speech system to produce natural human like, intonation patterns over long timescales. And so that's like the first thing that you have to try to solve to get it to sound natural at all, you know, because just like the slightest weird pause or unnatural speed variation in a human can tell right away, and they don't like it. Don't like talking to something that sounds bot. Like a bot. Not even.
Andrew Seagaves:
It doesn't even sound like a bot. It sounds like a weird sound. Weird, yeah, it just sounds weird. So you have to get your tts not sounding weird, but then it has to sound like. Right. You know, like, it then has to sound emotionally or tonally appropriate for the context. And that's where like, you know, utilizing context, I think, is really like kind of a game changer for TTS. Like, being able to condition on.
Andrew Seagaves:
On the content of a conversation up to this point is. Is one thing that's going to enable TTS systems to sound much more natural, for sure.
Demetrios:
Yeah. Well, I'm really excited for everything that you're doing. I always love when there's new releases that come out, and I, I love to see the community rave about it. So congrats on all the great work.
Andrew Seagaves:
Thanks a lot. We really appreciate it. And, you know, we're going to be shipping, we're going to be shipping some new features quite soon, so keep an eye out.
Demetrios:
Excellent.