Miso TTS 8B
Model Introduction
Miso TTS 8B is a text-to-speech model based on the Sesame CSM architecture. It generates Mimi audio codes from text and optional audio context, using a large Llama 3.2-style backbone and a smaller autoregressive audio decoder.
The model is designed for high-quality conversational speech generation and voice continuation from prompt audio. This repository contains the inference code, model definition, and setup instructions for running Miso TTS locally.
Quickstart
To run the model, use the inference code at our public repository, or try our demo at misolabs.ai.
Model Summary
| Item | Value |
|---|---|
| Model | Miso TTS 8B |
| Organization | Miso Labs |
| Task | Text-to-speech |
| Architecture | Sesame-style CSM |
| Backbone | llama-8B |
| Audio decoder | llama-300M |
| Text vocabulary | 128,256 |
| Audio vocabulary | 2,051 |
| Audio codebooks | 32 |
| Audio tokenizer | Mimi |
| Max sequence length | 2,048 |
Architecture
Miso TTS 8B uses two transformer components:
- A large backbone transformer that consumes text/audio-frame embeddings.
- A smaller decoder transformer that autoregressively predicts higher-order audio codebooks within each frame.
Codebook 0 is predicted from the backbone hidden state, while codebooks 1 through 31 are predicted by the audio decoder autoregressively in codebook depth.
Links
- Website: misolabs.ai
- Hugging Face: MisoLabs/MisoTTS
- GitHub: MisoLabsAI
- X: @MisoLabsAI