---
title: "Voice Is the New Interface: What I Saw at the AssemblyAI VoiceAI Meetup"
date: 2026-03-10
description: "I attended the AssemblyAI VoiceAI meetup and walked away convinced that voice has crossed the threshold from promising demo to production-ready magic."
tags: ["voice-ai","assemblyai","speech-to-text","agent-experience"]
readingTime: "12 min read"
url: https://alexmoening.com/dev-thoughts/voice-is-the-new-interface.html
markdownUrl: https://alexmoening.com/dev-thoughts/voice-is-the-new-interface.md
---

# Voice Is the New Interface: What I Saw at the AssemblyAI VoiceAI Meetup

[← Back to /dev/thoughts](/dev-thoughts/)

<p class="lead">I attended the AssemblyAI VoiceAI meetup last night and walked away convinced that voice has crossed the threshold from "promising demo" to "production-ready magic." Their latest model — Universal-3 Pro — nailed scientific notation, advanced medical terminology, and programming jargon in real time, across multiple models, without breaking a sweat.</p>

### The Company Behind the Magic

<p class="section-summary">AssemblyAI has been quietly building the most developer-friendly speech AI infrastructure on the planet since 2017.</p>

AssemblyAI isn't a newcomer chasing a trend. Founded in 2017 by **Dylan Fox** — a former Cisco ML engineer who applied to Y Combinator 30 days past the deadline with a video demo — the company has been on a quiet tear.

<table class="data-table">
    <thead>
        <tr>
            <th>Metric</th>
            <th>Value</th>
        </tr>
    </thead>
    <tbody>
        <tr><td>Founded</td><td>2017 (YC S17)</td></tr>
        <tr><td>Total Funding</td><td>$115M (Seed → Series C)</td></tr>
        <tr><td>Valuation</td><td>~$300M</td></tr>
        <tr><td>Revenue</td><td>$10.4M (2024), 2x YoY growth</td></tr>
        <tr><td>Developers</td><td>200,000+</td></tr>
        <tr><td>Monthly Inference Calls</td><td>600M+</td></tr>
        <tr><td>Customers</td><td>5,000+ (Spotify, Notion, NBC Universal, WSJ)</td></tr>
    </tbody>
</table>

Their investor roster reads like a who's-who of tech: **Accel** (led Series A and C), **Insight Partners** (led Series B), the **Collison brothers** (Stripe founders), **Nat Friedman** (ex-GitHub CEO), **Daniel Gross** (Fox's first investor from the YC days), and **Keith Block** (former Salesforce co-CEO).

Fox's origin story is worth noting — he recognized that incumbents in speech recognition had built products on aging technology and stopped innovating. He saw the same opening that Twilio saw in telecom and Stripe saw in payments: **make powerful infrastructure absurdly easy for developers.** It took three years to hit $1M in revenue. Now they're processing over 10 terabytes of audio per day.

---

### What I Saw: The Demo That Dropped Jaws

<p class="section-summary">One of the best live demos I've seen. Perfect real-time transcription of scientific, medical, and programming terminology.</p>

The presentation was, simply put, one of the best live demos I've seen. **Alex Kroman**, AssemblyAI's Chief Product and Technology Officer — formerly GM and SVP of Product/Engineering at New Relic — walked through three to four different models and ran them all in real time against increasingly difficult content:

<table class="data-table">
    <thead>
        <tr><th>Domain</th><th>Examples Demonstrated</th></tr>
    </thead>
    <tbody>
        <tr><td>Scientific notation</td><td>Complex formulas, mathematical expressions, chemical nomenclature</td></tr>
        <tr><td>Advanced medical terminology</td><td>Pharmaceutical names, anatomical terms, diagnostic codes</td></tr>
        <tr><td>Programming jargon</td><td>Framework names, API references, code syntax, technical product names</td></tr>
    </tbody>
</table>

The system didn't hesitate. It didn't hallucinate. It didn't approximate. It just *banged it out* — perfect, real-time, every time. The audience reaction was visceral. Jaws dropped. People looked at each other like they'd just seen a card trick they couldn't explain.

I've been playing around with a number of speech-to-text solutions — Whisper, Deepgram, Google, Amazon Transcribe — and I haven't seen anything match what was demonstrated live last night. The difference isn't incremental. It felt like a generational leap.

---

### The Technology: Why This Is Different

<p class="section-summary">Speech Language Models fuse audio encoders with LLMs — the system understands words, not just sounds.</p>

What makes this possible is a fundamental architectural shift that AssemblyAI has been pioneering: **Speech Language Models (SLMs).**

Traditional speech-to-text works like a translator — it hears sounds and maps them to words. AssemblyAI's approach fuses an audio encoder with a large language model through an adapter layer. The result is a system that doesn't just *hear* words — it *understands* them. When it encounters "myocardial infarction" or "Kubernetes" or "3.14 times 10 to the negative 5," it has semantic context, not just phonetic matching.

#### The Model Evolution

<table class="data-table">
    <thead>
        <tr>
            <th>Model</th>
            <th>Release</th>
            <th>Key Innovation</th>
        </tr>
    </thead>
    <tbody>
        <tr><td>Universal-2</td><td>2024</td><td>99 languages, code-switching, 24% better rare word recognition</td></tr>
        <tr><td>Slam-1</td><td>April 2025</td><td>First prompt-based speech model — 72% human preference over competitors</td></tr>
        <tr><td>Universal-3 Pro</td><td>February 2026</td><td>Promptable SLM, 1,500-word context, 50+ audio event tags</td></tr>
        <tr><td>Universal-3 Pro Streaming</td><td>March 3, 2026</td><td>Voice-agent optimized, sub-300ms latency, real-time keyterm updates</td></tr>
    </tbody>
</table>

The latest — **Universal-3 Pro** — is the first production-quality *promptable* speech model. You can give it natural language instructions before transcription: "This is a medical consultation about cardiac care" or "The speaker will reference React components and TypeScript interfaces." The model adapts its recognition in real time.

**Key performance benchmarks:**

<table class="data-table">
    <thead>
        <tr><th>Metric</th><th>Value</th><th>Context</th></tr>
    </thead>
    <tbody>
        <tr><td>Word Error Rate</td><td>3.3%</td><td>Second only to OpenAI's batch-only GPT-4o-transcribe (2.46%)</td></tr>
        <tr><td>Median latency</td><td>307ms</td><td>41% faster than Deepgram Nova-3 (516ms)</td></tr>
        <tr><td>Keyterm prompting gain</td><td>Up to 45%</td><td>Accuracy improvement on domain-specific vocabulary</td></tr>
        <tr><td>Medical entity errors</td><td>-88%</td><td>Reduction with specialized prompting</td></tr>
        <tr><td>Transcript stability</td><td>Immutable</td><td>Characters never revise after emission — critical for AI agent pipelines</td></tr>
    </tbody>
</table>

---

### Voice Is the New Interface

<p class="section-summary">We're at an inflection point where voice is becoming the primary way humans interact with AI.</p>

What made last night feel like more than a product demo was the bigger picture it painted.

**The market data backs this up:**

<table class="data-table">
    <thead>
        <tr><th>Indicator</th><th>Data</th></tr>
    </thead>
    <tbody>
        <tr><td>Voice recognition market</td><td>$18.39B (2025) → $61.71B by 2031 (22.38% CAGR)</td></tr>
        <tr><td>Voice AI agent market</td><td>$2.4B (2024) → $47.5B by 2034 (34.8% CAGR)</td></tr>
        <tr><td>Builder adoption</td><td>87.5% actively building voice agents, not just experimenting</td></tr>
        <tr><td>Enterprise adoption</td><td>97% have adopted voice AI; 67% consider it foundational</td></tr>
        <tr><td>VC funding</td><td>$315M (2022) → $2.1B (2024) — nearly 7x in two years</td></tr>
        <tr><td>YC allocation</td><td>22% of the latest class is building voice-first products</td></tr>
    </tbody>
</table>

<blockquote class="pull-quote">Voice will become the wedge, not the product.<br><cite>— Andreessen Horowitz</cite></blockquote>

Dylan Fox himself puts it in perspective: *"We are at the start of a 100x curve, which means today's usage is the floor, not the peak."*

The thing that clicked for me last night is that we've crossed the reliability threshold. When a model can handle "Wellbutrin XL 150mg" and "OAuth 2.0 PKCE flow" and "6.022 times 10 to the 23rd" without flinching — in real time, in a noisy room — the "voice interface" stops being aspirational and starts being *the obvious choice*.

---

### The Voice AI Stack

<p class="section-summary">Four layers, best-of-breed at each. AssemblyAI has the best ears in the business.</p>

For those building in this space, here's how the stack layers:

<div class="flow-diagram flow-vertical" role="img" aria-label="Voice AI Stack: Orchestration, Text-to-Speech, Large Language Model, Speech-to-Text">
    <div class="flow-step">
        <span class="step-icon">🎭</span>
        <span class="step-text">ORCHESTRATION<br>LiveKit, Pipecat, Vapi, Daily</span>
    </div>
    <span class="flow-arrow" aria-hidden="true">↕</span>
    <div class="flow-step">
        <span class="step-icon">🗣️</span>
        <span class="step-text">TEXT-TO-SPEECH<br>ElevenLabs, Cartesia, Rime</span>
    </div>
    <span class="flow-arrow" aria-hidden="true">↕</span>
    <div class="flow-step">
        <span class="step-icon">🧠</span>
        <span class="step-text">LARGE LANGUAGE MODEL<br>Claude, GPT, Gemini, DeepSeek</span>
    </div>
    <span class="flow-arrow" aria-hidden="true">↕</span>
    <div class="flow-step">
        <span class="step-icon">👂</span>
        <span class="step-text">SPEECH-TO-TEXT<br>AssemblyAI, Deepgram, Whisper</span>
    </div>
</div>

AssemblyAI occupies Layer 1 — the ears. And after what I saw last night, they have the best ears in the business.

<blockquote class="pull-quote">The core challenge is latency. Sequential processing creates cumulative delays that turn natural conversation into awkward exchanges.<br><cite>— Dylan Fox, CEO, AssemblyAI</cite></blockquote>

The 200ms human conversational pause is the physics constraint that defines every strategy in this market. AssemblyAI's 300ms immutable transcripts put them right at the edge of what feels natural.

---

### The Competitive Landscape (Honest Assessment)

<p class="section-summary">The market is competitive and evolving fast. AssemblyAI's edge is the bundled intelligence platform.</p>

I want to be balanced here. AssemblyAI is impressive, but the market is competitive and evolving fast:

<table class="data-table">
    <thead>
        <tr>
            <th>Player</th>
            <th>Strength</th>
            <th>Valuation / Funding</th>
        </tr>
    </thead>
    <tbody>
        <tr><td><strong>AssemblyAI</strong></td><td>Best accuracy + speech intelligence platform</td><td>$300M val, $115M raised</td></tr>
        <tr><td><strong>Deepgram</strong></td><td>Fastest latency, medical model, on-prem</td><td>$1.3B val, $130M raised (Jan 2026)</td></tr>
        <tr><td><strong>ElevenLabs</strong></td><td>TTS leader, diversifying into full stack</td><td>$11B val</td></tr>
        <tr><td><strong>OpenAI</strong></td><td>GPT-4o-transcribe (2.46% WER), ecosystem</td><td>—</td></tr>
        <tr><td><strong>Retell AI</strong></td><td>Strongest execution ($40M+ ARR, 300% QoQ)</td><td>~$750M val</td></tr>
        <tr><td><strong>Cartesia</strong></td><td>Novel SSM architecture (40ms TTFA)</td><td>Emerging</td></tr>
        <tr><td><strong>Vapi</strong></td><td>Orchestration layer, modular approach</td><td>$130M val</td></tr>
    </tbody>
</table>

Open-source models (Canary Qwen 2.5B at 5.63% WER, IBM Granite Speech at 5.85%) are closing the accuracy gap. Meta released an open-source ASR supporting 1,600+ languages. The base transcription layer is commoditizing — the defensible value is shifting to *what you do with the transcript*: sentiment analysis, PII redaction, speaker diarization, summarization. That's where AssemblyAI's bundled intelligence layer is unmatched.

---

### What I'm Going to Do Next

I'm going to build a demo. After seeing what Universal-3 Pro can do live, I want to get hands-on with the streaming API and put it through its paces with our own use cases. Their docs claim you can go from sign-up to production in 15 minutes. I intend to test that claim.

The Voice Agent Playground at `assemblyai.com/playground/voice-agent` has pre-built scenarios (customer support, medical appointment, drive-through) that let you test model switching between Universal-3 Pro, Universal-Streaming, and Whisper. That's my starting point.

If you're building anything that involves voice — agents, transcription, accessibility, search — this is the moment to pay attention. The technology just made a visible jump. What I saw last night was a little bit of magic.

---

### Key Takeaways

<table class="data-table">
    <thead>
        <tr>
            <th>Takeaway</th>
            <th>Why It Matters</th>
        </tr>
    </thead>
    <tbody>
        <tr><td><strong>Reliability threshold crossed</strong></td><td>Real-time transcription of medical, scientific, and code vocabulary is now production-ready</td></tr>
        <tr><td><strong>Speech Language Models</strong></td><td>Fusing audio encoders with LLMs enables semantic understanding, not just phonetic matching</td></tr>
        <tr><td><strong>Universal-3 Pro</strong></td><td>3.3% WER, industry's first promptable speech interface — second only to OpenAI's batch-only model</td></tr>
        <tr><td><strong>The 200ms latency wall</strong></td><td>The physics constraint defining every strategy. AssemblyAI's 307ms median is in the conversation zone</td></tr>
        <tr><td><strong>The market is real</strong></td><td>$47.5B projected by 2034, 87.5% of builders actively constructing voice agents</td></tr>
        <tr><td><strong>Modular stack</strong></td><td>Ears (STT), brain (LLM), voice (TTS), conductor (orchestration) — best-of-breed at each layer</td></tr>
        <tr><td><strong>Fierce competition</strong></td><td>Deepgram, OpenAI, and open-source are strong. AssemblyAI's edge is the bundled intelligence platform</td></tr>
    </tbody>
</table>

---

## Navigation

- [Home](/)
- [About](/about.html)
- [Projects](/projects.html)
- [Contact](/contact.html)
- [/dev/thoughts](/dev-thoughts/)

*Copyright 2026 Alex Moening. Opinions expressed are my own.*
