How accurate is face detection via vision models?

Vision-language models provide strong face detection accuracy comparable to specialized models for most use cases. For pixel-perfect bounding boxes in high-throughput pipelines, consider a dedicated face detection model.

Can I get bounding box coordinates?

Yes. Ask the model to return bounding box coordinates in your preferred format (pixel coordinates, normalized, etc.) and it will include them in the response.

Is this suitable for real-time applications?

Vision-language models have higher latency than specialized face detection models. They are best for batch processing and applications where rich analysis matters more than sub-100ms response times.

runcrate

Contact Sales Console

FACE DETECTION API

Detect faces with a single API call.

Use vision-language models to detect faces, analyze facial attributes, estimate age, identify emotions, and extract facial landmarks from images. Send an image with a prompt describing what you need, get structured results back. No specialized face detection library required.

Models

Vision + chat

Format

Structured JSON

Output

Get API Key View Pricing

QUICK START

Integrate in minutes.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.runcrate.ai/v1",
    api_key="rc_live_YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-VL-235B-A22B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
                {"type": "text", "text": "Detect all faces in this image. For each face, provide bounding box coordinates, estimated age, and expression."},
            ],
        }
    ],
)
print(response.choices[0].message.content)

AVAILABLE MODELS

Models you can use today.

Model	Provider	Price	Detail
Qwen/Qwen3-VL-235B-A22B-Instruct	Alibaba	Per-token	235B MoE, top-tier visual understanding
meta-llama/Llama-3.2-90B-Vision-Instruct	Meta	Per-token	90B, strong image reasoning
Qwen/Qwen3-VL-30B-A3B-Instruct	Alibaba	Per-token	30B MoE, efficient processing

Qwen/Qwen3-VL-235B-A22B-Instruct

AlibabaPer-token

235B MoE, top-tier visual understanding

meta-llama/Llama-3.2-90B-Vision-Instruct

MetaPer-token

90B, strong image reasoning

Qwen/Qwen3-VL-30B-A3B-Instruct

AlibabaPer-token

30B MoE, efficient processing

WHY RUNCRATE

Built for production.

Vision-Language Detection

Use the full power of vision-language models for face detection. Ask for bounding boxes, age estimates, expressions, or any custom attribute in natural language.

Structured Output

Request JSON-formatted results with bounding boxes, confidence scores, and attributes. Parse programmatically for integration into your pipeline.

No Specialized Library

Skip OpenCV, dlib, and MediaPipe setup. Use the same chat completions API you already use for text, just add an image.

Custom Analysis

Beyond detection: ask the model to analyze facial expressions, estimate demographics, count people, or describe the scene context around each face.

FAQ

Common questions.

Start detecting faces today.

Get API Key View Pricing