Solutions
·Vision AI
Vision-language models that read, analyze, and reason about visual content. Qwen3-VL, Llama Vision, and Nemotron available via inference API -- plus bare-metal instances for custom training.
Capabilities
Ask questions about images and get detailed, reasoned answers. Describe scenes, identify objects, interpret charts, and understand spatial relationships.
Extract structured data from invoices, receipts, forms, and contracts. OCR with semantic understanding -- not just text extraction, but comprehension.
Analyze video content frame-by-frame or holistically. Summarize meetings, extract key moments, and answer questions about video sequences.
Read text from images, screenshots, handwritten notes, and scanned documents. Multilingual OCR with context-aware formatting preservation.
Combine visual and textual inputs for complex tasks. Code from screenshots, math from diagrams, data extraction from charts -- all via the same API.
Need specialized detection or classification? Deploy bare-metal GPU instances for fine-tuning vision models on your own datasets with full root access.
Models
Frontier VLMs available through the inference API. For custom training, use bare-metal instances.
Qwen3-VL-235BVision-languageBest-in-class visual reasoning
Llama 3.2 90B VisionVision-languageComplex visual QA
Llama 3.2 11B VisionVision-languageFast, efficient vision tasks
Nemotron Nano 12B VLVision-languageLightweight multimodalHow It Works
Use the inference API for instant access to Qwen3-VL, Llama Vision, and Nemotron. Or deploy a bare-metal instance for custom model training.
Pass images, screenshots, documents, or video frames alongside text prompts. The model sees and reasons about your visual content.
Process documents in bulk, analyze video feeds, or integrate visual understanding into your product. Pay per token via the inference API.