
We’re building a generalist speech model - one system that does speech-to-text, text-to-speech, speech-in/speech-out reasoning, and cross-modal tasks with LLM-level steerability and context engineering. See the demo of emergent capabilities.
Today’s speech stack is fragmented. You need different models, and vendors for STT, TTS, voice design, conversational agents, dubbing, even music. That “Curse of Specialization” creates brittle workflows, poor context carryover, and zero “system-prompt” steerability. Meanwhile, LLMs proved that one generalist model + in-context learning unlocks entirely new use cases.
What’s missing in current speech models
We’re disciples of Sutton’s Bitter Lesson: performance eventually comes from scaling compute, data, and simple, general methods. We believe speech is where text was in 2019 - constrained by small models, fixed task boundaries, and narrow post-training. The upside is to do for speech what GPT-3/ChatGPT did for text: one model, in-context learning, and steerability.
Audio tokenization (RVQ) and decoding stacks are ripe for redesign - big efficiency wins are still on the table.
Audio is token-hungry. With typical RVQ, 1s of audio ≈ 100–400 tokens. Flattened token streams like Orpheus cap useful context that consumes 8K tokens to generate 90s of audio. Approaches like CSM-1B help with context but still decode 32+ audio tokens per step, throttling inference.
We’ve removed the long-audio bottleneck, making audio roughly as cheap to train on as text while preserving long-range context. Practically, that means, you’ll be able to generate hours of audio in one-shot and use speech models with very long interleaved text and audio system prompts.
We’ve pretrained speech models from 800M to 4.8B params on 2M hours of mixed-domain audio.
Cost Efficiency: As a result of our efficient architecture, our 800M parameter model took less than $1000 to train. For a comparison, Kokoro-82M while being 10x smaller took $1000 to train on 1000x lesser data.
Our larger base models already show visible signs of emergent behaviour. We describe here, some of these behaviours - but the extent of such emergent capabilities is still under investigation.
Please checkout the audios and comparision with ElevenLabs V3 on notion.
Disfluency & repetition handling
Text: “This is a a sentence we want our speech model to, to speak.”
Contextual identity/accents
“I am a software engineer living in Bangalore.” → natural Indian English
“I just moved to Shanghai for a new role.” → adapts toward Chinese-English prosody
Note: that there is no hardcoded voice switch; this emerges from context.
Prosodic context awareness without explicit tags
“I said we could try only once.”
“I said we could ONLY try once.”
“I said we could ONLY try ONCE?”
Stress and intonation match intent without explicit tags like <laugh>, <gasp>, <surprised>
Voice diversity
These base models can mimic a wide range of speakers/accents beyond a fixed dropdown of voices.
What we’re building
A single speech generalist you can steer like an LLM:
We’re solving this by
You can reach out to us on
Email: founders@kalpalabs.ai
X : Prashant | Gautam | KalpaLabs
Linkedin: Prashant | Gautam | KalpaLabs