What happens to my video after processing?

Deleted immediately. We never store your video files. Only the text output is saved in your account.

What are the 8 analysis modes?

Context (extracts everything into structured text), Editor (frame-by-frame breakdown for AI video editors), Creator Analysis (content performance scoring), Ad Analysis (ad effectiveness for media buyers), E-commerce (product video conversion analysis), Training (pedagogical effectiveness for L&D), UGC Vetting (creator evaluation for brand partnerships), and Competitor Intelligence (competitive threat scoring).

How accurate is the extraction?

VidContext uses Gemini 3.1 Pro at 2 frames per second with high resolution. It captures on-screen text, brand logos, audio cues, and scene transitions that most humans miss on a first watch.

Is this just a wrapper around Gemini?

Gemini handles the vision layer. The 8 analysis modes, scoring frameworks, structured output format, and the full extraction pipeline are proprietary. Raw Gemini gives you a paragraph. VidContext gives you expert-scored analysis with actionable recommendations.

What about compliance and data privacy?

Videos are deleted immediately after processing. No video storage, no retention. All traffic over HTTPS. API key authentication on every request. We never log or store video content.

Video OCR API

Traditional OCR works on static images. VidContext reads every piece of text that appears across your entire video — titles, captions, prices, watermarks, URLs, phone numbers, and graphics. Timestamped, structured, and included with every analysis.

Try free — no account needed View API docs

Why video needs its own OCR

Standard OCR tools process a single image. Video contains thousands of frames, each potentially showing different text — a title card at the start, a price at second 14, a disclaimer in the final frame. Running image OCR frame-by-frame is slow, expensive, and produces massive amounts of duplicate data.

VidContext handles the entire video in a single API call. It samples at 4 frames per second at high resolution, extracts all visible text, deduplicates across frames, and maps each text element to the scene where it appears. The result is a clean, structured list of everything written on screen — without the noise of frame-by-frame extraction.

What you get

Full-video text extraction

Every piece of text that appears anywhere in the video, from the first frame to the last. No manual frame selection needed.

Timestamped results

Each text element is mapped to the exact scene and time range where it appears. Know when a price flashes, when a URL shows, when a disclaimer appears.

Beyond basic OCR

Not just raw text — VidContext understands the role each text element plays. It distinguishes titles from captions, prices from descriptions, watermarks from content.

Paired with scene context

On-screen text is returned alongside full scene descriptions. You see not just what the text says, but what was happening when it appeared.

All video types supported

Ads, tutorials, presentations, product demos, social media clips, broadcast content, webinars. Any video format, any content type.

Included in every analysis

On-screen text extraction is not a separate add-on. It is part of every VidContext API call at no additional cost.

Extract on-screen text with one call

curl -X POST https://api.vidcontext.com/v1/analyze \
  -H "X-API-Key: vc_your_key_here" \
  -F "file=@ad-video.mp4" \
  -F "output_format=context"

{
  "visual_scenes": [
    {
      "timestamp": "0:00-0:05",
      "on_screen_text": [
        "SUMMER SALE — UP TO 60% OFF",
        "www.brandname.com"
      ],
      "description": "Full-screen promotional graphic with
        bold white text on a gradient background."
    },
    {
      "timestamp": "0:05-0:18",
      "on_screen_text": [
        "Nike Air Max 90",
        "$89.99  (was $149.99)",
        "Free shipping on orders over $50"
      ],
      "description": "Product showcase. White sneaker rotating
        on a platform. Price tag and shipping offer displayed
        in lower third."
    },
    {
      "timestamp": "0:18-0:24",
      "on_screen_text": [
        "Use code SUMMER60 at checkout",
        "Offer ends July 31",
        "Terms apply — see brandname.com/terms"
      ],
      "description": "End card with discount code prominently
        displayed. Fine print disclaimer at bottom of frame."
    }
  ],
  "brands_detected": ["Nike", "Air Max"],
  "transcript": "This summer, step into something fresh..."
}

Use cases for video OCR

Ad compliance and verification

Automatically extract disclaimers, pricing claims, offer terms, and fine print from video ads. Verify that required disclosures appear and are legible.

Competitive ad monitoring

Pull pricing, promotional codes, product names, and CTAs from competitor video ads at scale. Track how their messaging changes over time.

Content indexing and search

Make video libraries searchable by on-screen text. Find every video where a specific product name, price point, or URL appears.

Accessibility and captioning

Extract on-screen text to ensure it is also represented in audio descriptions. Verify that important visual text is accessible to all viewers.

Frequently asked questions

What types of text can VidContext extract from video?

VidContext extracts any text visible on screen: title cards, lower thirds, captions, subtitles, price tags, phone numbers, URLs, watermarks, brand names, product labels, street signs, whiteboard writing, presentation slides, and graphic overlays. If a human can read it on screen, VidContext captures it.

Is it frame-by-frame OCR?

VidContext samples at 2 frames per second at high resolution. This captures text that appears for even a fraction of a second — flash frames, quick cuts, and briefly displayed graphics. Each extracted text element is mapped to the scene and timestamp where it appears.

Can it handle animated or moving text?

Yes. Text that scrolls, fades, slides, or animates across the screen is captured as it becomes readable. The AI model processes each frame independently, so text in motion is detected at the point where it is most legible.

Does it work with handwritten text?

VidContext can read handwriting that is clearly visible on screen — whiteboard notes, handwritten signs, sketches with labels. Accuracy depends on legibility, just as it would for a human reader. Printed and digital text has the highest accuracy.

What languages and scripts are supported?

On-screen text extraction works across Latin, Cyrillic, CJK (Chinese, Japanese, Korean), Arabic, Devanagari, and other major scripts. The extracted text is returned as-is in its original language, and the scene descriptions are provided in English.

Ready to start?

5 free analyses without an account. 20 credits on signup. No credit card required.

Try VidContext free