Video Understanding API
Most video APIs detect objects or transcribe speech. VidContext actually understands what is happening in your video — scene by scene, with full context, narrative awareness, and audio-visual synthesis. One API call. Structured output. Ready in about 50 seconds.
What is a video understanding API?
A video understanding API watches a video the way a human analyst would — then gives you a structured report of everything it observed. Not just a list of detected objects or a raw transcript, but a genuine comprehension of what the video communicates.
VidContext processes every frame at high resolution, listens to every second of audio, reads every piece of on-screen text, and synthesizes all of it into a single coherent analysis. The result is a JSON document that captures the full meaning of the video — usable by AI agents, automation workflows, databases, or human reviewers.
What the API captures
Scene-level comprehension
Every scene described with full context — setting, subjects, actions, mood, and visual composition. Not just 'outdoor scene' but what is actually happening.
Narrative flow tracking
Understands how scenes connect. Identifies topic transitions, story arcs, and thematic structure across the entire video timeline.
Audio-visual synthesis
Combines speech, music, sound effects, and visual content into a single coherent analysis. Knows when a voiceover describes what is on screen.
On-screen text extraction
Reads every piece of text that appears — titles, captions, prices, URLs, watermarks, lower thirds. Mapped to the scene where each appears.
Brand and entity detection
Identifies brands, logos, products, and named entities visible in the video. Tracks which brands appear in which scenes.
Structured JSON output
Everything returned as clean, machine-readable JSON. Feed it directly into databases, AI agents, dashboards, or automation workflows.
Single API call
Send a video. Get back structured understanding. No pipeline to build, no models to manage, no frames to extract yourself.
curl -X POST https://api.vidcontext.com/v1/analyze \ -H "X-API-Key: vc_your_key_here" \ -F "file=@video.mp4" \ -F "output_format=context"
{
"visual_scenes": [
{
"timestamp": "0:00-0:12",
"description": "Wide aerial shot of a coastal city at sunrise.
Camera slowly pushes in toward a glass office building.
Warm golden light reflects off the water. No people visible.
Establishes a calm, professional tone.",
"on_screen_text": ["ACME Corp — Building Tomorrow"],
"audio": "Ambient piano music, no speech"
},
{
"timestamp": "0:12-0:31",
"description": "Cut to interior office. A woman in business
attire presents at a whiteboard to a small team. She
gestures at a product roadmap diagram. The team is
engaged — nodding, taking notes. Professional but relaxed
atmosphere.",
"on_screen_text": ["Q3 Product Roadmap", "Launch: Sept 15"],
"audio": "Speaker: 'Our timeline puts us ahead of the
market by six weeks...'"
}
],
"transcript": "Our timeline puts us ahead of the market...",
"brands_detected": ["ACME Corp"],
"overall_mood": "confident, forward-looking, corporate"
}Use cases
AI agent workflows
Give LLM-based agents the ability to watch and reason about video content. Feed the structured output directly into Claude, GPT, or any agent framework.
Content moderation at scale
Understand what videos actually communicate — not just flag individual frames. Detect misleading claims, brand safety issues, or policy violations in context.
Marketing and ad analysis
Analyze competitor ads, review your own creative, and score video content across engagement, clarity, and brand alignment using VidContext's 8 analysis modes.
Accessibility and compliance
Generate detailed video descriptions for accessibility. Extract all on-screen text for compliance review. Create searchable archives of video content.
Frequently asked questions
What does 'video understanding' mean?
Video understanding goes beyond simple detection. Instead of labeling individual objects or transcribing speech in isolation, VidContext watches the entire video and describes what is happening — who is speaking, what the setting conveys, how scenes connect, and what the overall narrative communicates. The output reads like a detailed briefing, not a list of tags.
How is this different from object detection or video labeling?
Object detection tells you 'there is a car.' Video understanding tells you 'a red sedan pulls into a dealership lot while a voiceover explains the current lease offer, and a price graphic appears in the lower third.' VidContext captures the relationship between visual elements, audio, text, and narrative flow — not just isolated labels.
Can VidContext understand context and narrative across scenes?
Yes. The API processes video at 2 frames per second and tracks how scenes connect. It identifies narrative arcs, topic shifts, visual callbacks, and thematic through-lines. A product demo, a tutorial, or a story-driven ad are all analyzed with full awareness of how each scene builds on the last.
Does it work with non-English video?
VidContext transcribes and analyzes audio in any language supported by the underlying model. On-screen text extraction works across scripts and character sets. The structured output is returned in English regardless of the source language.
How does the AI model work under the hood?
VidContext uses Gemini 3.1 Pro at high resolution, sampling 2 frames per second. Every frame is analyzed for visual content, on-screen text, and scene composition. Audio is processed for speech, music, and sound effects. The model then synthesizes all of this into a coherent understanding of the full video. It's more than a transcript — it gives your AI agent eyes.
Ready to start?
5 free analyses without an account. 20 credits on signup. No credit card required.
Try VidContext free