Miso TTS 8B

Model Introduction

Miso TTS 8B is a text-to-speech model based on the Sesame CSM architecture. It generates Mimi audio codes from text and optional audio context, using a large Llama 3.2-style backbone and a smaller autoregressive audio decoder.

The model is designed for high-quality conversational speech generation and voice continuation from prompt audio. This repository contains the inference code, model definition, and setup instructions for running Miso TTS locally.


Quickstart

To run the model, use the inference code at our public repository, or try our demo at misolabs.ai.

Model Summary

Item Value
Model Miso TTS 8B
Organization Miso Labs
Task Text-to-speech
Architecture Sesame-style CSM
Backbone llama-8B
Audio decoder llama-300M
Text vocabulary 128,256
Audio vocabulary 2,051
Audio codebooks 32
Audio tokenizer Mimi
Max sequence length 2,048

Architecture

Miso TTS 8B uses two transformer components:

  • A large backbone transformer that consumes text/audio-frame embeddings.
  • A smaller decoder transformer that autoregressively predicts higher-order audio codebooks within each frame.

Codebook 0 is predicted from the backbone hidden state, while codebooks 1 through 31 are predicted by the audio decoder autoregressively in codebook depth.


Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
8B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for MisoLabs/MisoTTS

Quantizations
1 model

Space using MisoLabs/MisoTTS 1