What happens to my video after processing?

Deleted immediately. We never store your video files. Only the text output is saved in your account.

What are the 8 analysis modes?

Context (extracts everything into structured text), Editor (frame-by-frame breakdown for AI video editors), Creator Analysis (content performance scoring), Ad Analysis (ad effectiveness for media buyers), E-commerce (product video conversion analysis), Training (pedagogical effectiveness for L&D), UGC Vetting (creator evaluation for brand partnerships), and Competitor Intelligence (competitive threat scoring).

How accurate is the extraction?

VidContext uses Gemini 3.1 Pro at 2 frames per second with high resolution. It captures on-screen text, brand logos, audio cues, and scene transitions that most humans miss on a first watch.

Is this just a wrapper around Gemini?

Gemini handles the vision layer. The 8 analysis modes, scoring frameworks, structured output format, and the full extraction pipeline are proprietary. Raw Gemini gives you a paragraph. VidContext gives you expert-scored analysis with actionable recommendations.

What about compliance and data privacy?

Videos are deleted immediately after processing. No video storage, no retention. All traffic over HTTPS. API key authentication on every request. We never log or store video content.

Video Understanding API

Most video APIs detect objects or transcribe speech. VidContext actually understands what is happening in your video — scene by scene, with full context, narrative awareness, and audio-visual synthesis. One API call. Structured output. Ready in about 50 seconds.

Try free — no account needed View API docs

What is a video understanding API?

A video understanding API watches a video the way a human analyst would — then gives you a structured report of everything it observed. Not just a list of detected objects or a raw transcript, but a genuine comprehension of what the video communicates.

VidContext processes every frame at high resolution, listens to every second of audio, reads every piece of on-screen text, and synthesizes all of it into a single coherent analysis. The result is a JSON document that captures the full meaning of the video — usable by AI agents, automation workflows, databases, or human reviewers.

What the API captures

Scene-level comprehension

Every scene described with full context — setting, subjects, actions, mood, and visual composition. Not just 'outdoor scene' but what is actually happening.

Narrative flow tracking

Understands how scenes connect. Identifies topic transitions, story arcs, and thematic structure across the entire video timeline.

Audio-visual synthesis

Combines speech, music, sound effects, and visual content into a single coherent analysis. Knows when a voiceover describes what is on screen.

On-screen text extraction

Reads every piece of text that appears — titles, captions, prices, URLs, watermarks, lower thirds. Mapped to the scene where each appears.

Brand and entity detection

Identifies brands, logos, products, and named entities visible in the video. Tracks which brands appear in which scenes.

Structured JSON output

Everything returned as clean, machine-readable JSON. Feed it directly into databases, AI agents, dashboards, or automation workflows.

Single API call

Send a video. Get back structured understanding. No pipeline to build, no models to manage, no frames to extract yourself.

curl -X POST https://api.vidcontext.com/v1/analyze \
  -H "X-API-Key: vc_your_key_here" \
  -F "file=@video.mp4" \
  -F "output_format=context"

{
  "visual_scenes": [
    {
      "timestamp": "0:00-0:12",
      "description": "Wide aerial shot of a coastal city at sunrise.
        Camera slowly pushes in toward a glass office building.
        Warm golden light reflects off the water. No people visible.
        Establishes a calm, professional tone.",
      "on_screen_text": ["ACME Corp — Building Tomorrow"],
      "audio": "Ambient piano music, no speech"
    },
    {
      "timestamp": "0:12-0:31",
      "description": "Cut to interior office. A woman in business
        attire presents at a whiteboard to a small team. She
        gestures at a product roadmap diagram. The team is
        engaged — nodding, taking notes. Professional but relaxed
        atmosphere.",
      "on_screen_text": ["Q3 Product Roadmap", "Launch: Sept 15"],
      "audio": "Speaker: 'Our timeline puts us ahead of the
        market by six weeks...'"
    }
  ],
  "transcript": "Our timeline puts us ahead of the market...",
  "brands_detected": ["ACME Corp"],
  "overall_mood": "confident, forward-looking, corporate"
}

Use cases

AI agent workflows

Give LLM-based agents the ability to watch and reason about video content. Feed the structured output directly into Claude, GPT, or any agent framework.

Content moderation at scale

Understand what videos actually communicate — not just flag individual frames. Detect misleading claims, brand safety issues, or policy violations in context.

Marketing and ad analysis

Analyze competitor ads, review your own creative, and score video content across engagement, clarity, and brand alignment using VidContext's 8 analysis modes.

Accessibility and compliance

Generate detailed video descriptions for accessibility. Extract all on-screen text for compliance review. Create searchable archives of video content.

Frequently asked questions

What does 'video understanding' mean?

Video understanding goes beyond simple detection. Instead of labeling individual objects or transcribing speech in isolation, VidContext watches the entire video and describes what is happening — who is speaking, what the setting conveys, how scenes connect, and what the overall narrative communicates. The output reads like a detailed briefing, not a list of tags.

How is this different from object detection or video labeling?

Object detection tells you 'there is a car.' Video understanding tells you 'a red sedan pulls into a dealership lot while a voiceover explains the current lease offer, and a price graphic appears in the lower third.' VidContext captures the relationship between visual elements, audio, text, and narrative flow — not just isolated labels.

Can VidContext understand context and narrative across scenes?

Yes. The API processes video at 2 frames per second and tracks how scenes connect. It identifies narrative arcs, topic shifts, visual callbacks, and thematic through-lines. A product demo, a tutorial, or a story-driven ad are all analyzed with full awareness of how each scene builds on the last.

Does it work with non-English video?

VidContext transcribes and analyzes audio in any language supported by the underlying model. On-screen text extraction works across scripts and character sets. The structured output is returned in English regardless of the source language.

How does the AI model work under the hood?

VidContext uses Gemini 3.1 Pro at high resolution, sampling 2 frames per second. Every frame is analyzed for visual content, on-screen text, and scene composition. Audio is processed for speech, music, and sound effects. The model then synthesizes all of this into a coherent understanding of the full video. It's more than a transcript — it gives your AI agent eyes.

Ready to start?

5 free analyses without an account. 20 credits on signup. No credit card required.

Try VidContext free