hubertsiuzdak/snac_24khz

transformerstransformerspytorchaudiolicense:mitendpoints_compatibleregion:usmit
1.2M

SNAC šŸæ

Multi-Scale Neural Audio Codec (SNAC) compressess audio into discrete codes at a low bitrate.

šŸ‘‰ This model was primarily trained on speech data, and its recommended use case is speech synthesis. See below for other pretrained models.

šŸ”— GitHub repository: https://github.com/hubertsiuzdak/snac/

Overview

SNAC encodes audio into hierarchical tokens similarly to SoundStream, EnCodec, and DAC. However, SNAC introduces a simple change where coarse tokens are sampled less frequently, covering a broader time span.

This model compresses 24 kHz audio into discrete codes at a 0.98 kbps bitrate. It uses 3 RVQ levels with token rates of 12, 23, and 47 Hz.

Pretrained models

Currently, all models support only single audio channel (mono).

ModelBitrateSample RateParamsRecommended use case
hubertsiuzdak/snac_24khz (this model)0.98 kbps24 kHz19.8 MšŸ—£ļø Speech
hubertsiuzdak/snac_32khz1.9 kbps32 kHz54.5 MšŸŽø Music / Sound Effects
hubertsiuzdak/snac_44khz2.6 kbps44 kHz54.5 MšŸŽø Music / Sound Effects

Usage

Install it using:

pip install snac

To encode (and decode) audio with SNAC in Python, use the following code:

import torch
from snac import SNAC

model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval().cuda()
audio = torch.randn(1, 1, 24000).cuda()  # B, 1, T

with torch.inference_mode():
    codes = model.encode(audio)
    audio_hat = model.decode(codes)

You can also encode and reconstruct in a single call:

with torch.inference_mode():
    audio_hat, codes = model(audio)

āš ļø Note that codes is a list of token sequences of variable lengths, each corresponding to a different temporal resolution.

>>> [code.shape[1] for code in codes]
[12, 24, 48]

Acknowledgements

Module definitions are adapted from the Descript Audio Codec.

DEPLOY IN 60 SECONDS

Run snac_24khz on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.