Video OCR API
Traditional OCR works on static images. VidContext reads every piece of text that appears across your entire video — titles, captions, prices, watermarks, URLs, phone numbers, and graphics. Timestamped, structured, and included with every analysis.
Why video needs its own OCR
Standard OCR tools process a single image. Video contains thousands of frames, each potentially showing different text — a title card at the start, a price at second 14, a disclaimer in the final frame. Running image OCR frame-by-frame is slow, expensive, and produces massive amounts of duplicate data.
VidContext handles the entire video in a single API call. It samples at 4 frames per second at high resolution, extracts all visible text, deduplicates across frames, and maps each text element to the scene where it appears. The result is a clean, structured list of everything written on screen — without the noise of frame-by-frame extraction.
What you get
Full-video text extraction
Every piece of text that appears anywhere in the video, from the first frame to the last. No manual frame selection needed.
Timestamped results
Each text element is mapped to the exact scene and time range where it appears. Know when a price flashes, when a URL shows, when a disclaimer appears.
Beyond basic OCR
Not just raw text — VidContext understands the role each text element plays. It distinguishes titles from captions, prices from descriptions, watermarks from content.
Paired with scene context
On-screen text is returned alongside full scene descriptions. You see not just what the text says, but what was happening when it appeared.
All video types supported
Ads, tutorials, presentations, product demos, social media clips, broadcast content, webinars. Any video format, any content type.
Included in every analysis
On-screen text extraction is not a separate add-on. It is part of every VidContext API call at no additional cost.
Extract on-screen text with one call
curl -X POST https://api.vidcontext.com/v1/analyze \ -H "X-API-Key: vc_your_key_here" \ -F "file=@ad-video.mp4" \ -F "output_format=context"
{
"visual_scenes": [
{
"timestamp": "0:00-0:05",
"on_screen_text": [
"SUMMER SALE — UP TO 60% OFF",
"www.brandname.com"
],
"description": "Full-screen promotional graphic with
bold white text on a gradient background."
},
{
"timestamp": "0:05-0:18",
"on_screen_text": [
"Nike Air Max 90",
"$89.99 (was $149.99)",
"Free shipping on orders over $50"
],
"description": "Product showcase. White sneaker rotating
on a platform. Price tag and shipping offer displayed
in lower third."
},
{
"timestamp": "0:18-0:24",
"on_screen_text": [
"Use code SUMMER60 at checkout",
"Offer ends July 31",
"Terms apply — see brandname.com/terms"
],
"description": "End card with discount code prominently
displayed. Fine print disclaimer at bottom of frame."
}
],
"brands_detected": ["Nike", "Air Max"],
"transcript": "This summer, step into something fresh..."
}Use cases for video OCR
Ad compliance and verification
Automatically extract disclaimers, pricing claims, offer terms, and fine print from video ads. Verify that required disclosures appear and are legible.
Competitive ad monitoring
Pull pricing, promotional codes, product names, and CTAs from competitor video ads at scale. Track how their messaging changes over time.
Content indexing and search
Make video libraries searchable by on-screen text. Find every video where a specific product name, price point, or URL appears.
Accessibility and captioning
Extract on-screen text to ensure it is also represented in audio descriptions. Verify that important visual text is accessible to all viewers.
Frequently asked questions
What types of text can VidContext extract from video?
VidContext extracts any text visible on screen: title cards, lower thirds, captions, subtitles, price tags, phone numbers, URLs, watermarks, brand names, product labels, street signs, whiteboard writing, presentation slides, and graphic overlays. If a human can read it on screen, VidContext captures it.
Is it frame-by-frame OCR?
VidContext samples at 2 frames per second at high resolution. This captures text that appears for even a fraction of a second — flash frames, quick cuts, and briefly displayed graphics. Each extracted text element is mapped to the scene and timestamp where it appears.
Can it handle animated or moving text?
Yes. Text that scrolls, fades, slides, or animates across the screen is captured as it becomes readable. The AI model processes each frame independently, so text in motion is detected at the point where it is most legible.
Does it work with handwritten text?
VidContext can read handwriting that is clearly visible on screen — whiteboard notes, handwritten signs, sketches with labels. Accuracy depends on legibility, just as it would for a human reader. Printed and digital text has the highest accuracy.
What languages and scripts are supported?
On-screen text extraction works across Latin, Cyrillic, CJK (Chinese, Japanese, Korean), Arabic, Devanagari, and other major scripts. The extracted text is returned as-is in its original language, and the scene descriptions are provided in English.
Ready to start?
5 free analyses without an account. 20 credits on signup. No credit card required.
Try VidContext free