Solutions

·

Voice & Audio

Synthesize speech.
Transcribe everything.

Text-to-speech and speech-to-text models via the Runcrate inference API. Real-time streaming, batch transcription, multilingual support, and per-token pricing -- no infrastructure to manage.

TTS
Voice synthesis
ASR
Speech-to-text
Streaming
Real-time audio

Capabilities

Full audio pipeline, one API.

Real-time voice synthesis

Generate natural, expressive speech from text with low latency. Stream audio token-by-token for conversational interfaces and voice assistants.

Speech-to-text transcription

Transcribe audio files or live streams with high accuracy. Support for long-form content, meetings, podcasts, and call recordings.

Streaming support

WebSocket and SSE endpoints for real-time audio streaming. Send audio in, get text out -- or send text in, get audio out -- with minimal latency.

Batch processing

Transcribe thousands of audio files or generate hours of speech in bulk. Queue jobs via API and retrieve results asynchronously.

Multilingual support

TTS and ASR models that handle dozens of languages natively. Build global products without separate pipelines per locale.

Audio processing pipelines

Chain TTS and ASR with language models to build voice agents, automated dubbing, and audio summarization workflows. All via API.

Available Models

TTS and ASR
via one API.

Voice synthesis, speech recognition, and audio processing — all available through the inference API with per-token pricing.

Qwen3-TTSTTS · 10 languagesVoice cloning, 97ms streaming
Orpheus 3BTTS · Speech-LLMEmpathetic, human-level speech
Kokoro 82MTTS · Ultra-efficientHigh quality at minimal cost
Whisper Large V3ASR · MultilingualTranscription and translation
Voxtral SmallASR · MistralAudio understanding

How It Works

Three steps to voice AI.

01

Choose your task

Select text-to-speech for voice synthesis or speech-to-text for transcription. Pick the model that fits your language, quality, and latency requirements.

02

Call the API

Send text or audio to the inference endpoint. Stream results in real-time via WebSocket or get batch results asynchronously.

03

Build your pipeline

Chain audio models with language models to build voice agents, dubbing systems, or transcription services. Pay per token, scale on demand.

Start building with voice on Runcrate.

Synthesize speech or transcribe audio in seconds. No GPU setup, no commitments, no credit card required to explore.

Per-token pricing
No upfront commitments
Streaming
Real-time audio in/out
Cancel anytime
No lock-in, no penalties