Solutions
·Voice & Audio
Text-to-speech and speech-to-text models via the Runcrate inference API. Real-time streaming, batch transcription, multilingual support, and per-token pricing -- no infrastructure to manage.
Capabilities
Generate natural, expressive speech from text with low latency. Stream audio token-by-token for conversational interfaces and voice assistants.
Transcribe audio files or live streams with high accuracy. Support for long-form content, meetings, podcasts, and call recordings.
WebSocket and SSE endpoints for real-time audio streaming. Send audio in, get text out -- or send text in, get audio out -- with minimal latency.
Transcribe thousands of audio files or generate hours of speech in bulk. Queue jobs via API and retrieve results asynchronously.
TTS and ASR models that handle dozens of languages natively. Build global products without separate pipelines per locale.
Chain TTS and ASR with language models to build voice agents, automated dubbing, and audio summarization workflows. All via API.
Available Models
Voice synthesis, speech recognition, and audio processing — all available through the inference API with per-token pricing.
Qwen3-TTSTTS · 10 languagesVoice cloning, 97ms streaming
Orpheus 3BTTS · Speech-LLMEmpathetic, human-level speech
Kokoro 82MTTS · Ultra-efficientHigh quality at minimal cost
Whisper Large V3ASR · MultilingualTranscription and translation
Voxtral SmallASR · MistralAudio understandingHow It Works
Select text-to-speech for voice synthesis or speech-to-text for transcription. Pick the model that fits your language, quality, and latency requirements.
Send text or audio to the inference endpoint. Stream results in real-time via WebSocket or get batch results asynchronously.
Chain audio models with language models to build voice agents, dubbing systems, or transcription services. Pay per token, scale on demand.