Quickstart | Model Introduction | Model Summary | Architecture | Links

Miso TTS 8B

Model Introduction

Miso TTS 8B is a text-to-speech model based on the Sesame CSM architecture. It generates Mimi audio codes from text and optional audio context, using a large Llama 3.2-style backbone and a smaller autoregressive audio decoder.

The model is designed for high-quality conversational speech generation and voice continuation from prompt audio. This repository contains the inference code, model definition, and setup instructions for running Miso TTS locally.

Quickstart

To run the model, use the inference code at our public repository, or try our demo at misolabs.ai.

Model Summary

Item	Value
Model	Miso TTS 8B
Organization	Miso Labs
Task	Text-to-speech
Architecture	Sesame-style CSM
Backbone	`llama-8B`
Audio decoder	`llama-300M`
Text vocabulary	`128,256`
Audio vocabulary	`2,051`
Audio codebooks	`32`
Audio tokenizer	Mimi
Max sequence length	`2,048`

Architecture

Miso TTS 8B uses two transformer components:

A large backbone transformer that consumes text/audio-frame embeddings.
A smaller decoder transformer that autoregressively predicts higher-order audio codebooks within each frame.

Codebook 0 is predicted from the backbone hidden state, while codebooks 1 through 31 are predicted by the audio decoder autoregressively in codebook depth.

Model tree for MisoLabs/MisoTTS

Quantizations

1 model

MisoLabs
/

MisoTTS