All posts
9 min read

5 ways to extract structured data from video with one API call

How to extract transcripts, on-screen text, scenes, brands, and audio data from any video using a single API request. Includes code examples.

video extractionstructured datatutorialautomation

Video is full of data. Spoken words. Text burned into the frame. Brand logos in the background. Scene changes. Music cues. All of it is useful, and all of it is locked behind pixels and audio waveforms.

If you wanted to extract text from video automatically a year ago, you'd be wiring together ffmpeg for frame extraction, Whisper for transcription, an OCR library for on-screen text, and maybe GPT-4V for visual understanding. Four tools, four output formats, four failure modes. It worked, but it was slow to build and annoying to maintain.

There's a simpler way now. One API call, one JSON response, five types of structured data.

This tutorial walks through each extraction method using the VidContext API, with code examples and real output samples. By the end, you'll know how to pull transcripts, on-screen text, scene breakdowns, brand detections, and audio analysis from any video file.

Method 1: Full transcript extraction

What it extracts: Timestamped spoken words from the video's audio track.

When you'd use this: Search indexing, content repurposing, subtitle generation, accessibility compliance, feeding video content into RAG pipelines.

The simplest way to get a transcript is a curl request with output_format=context:

curl -X POST https://api.vidcontext.com/v1/analyze \
  -H "X-API-Key: vc_your_key_here" \
  -F "url=https://example.com/product-demo.mp4" \
  -F "output_format=context"

The response includes a transcript array. Each entry has a timestamp and the spoken text:

{
  "metadata": {
    "duration_seconds": 184,
    "resolution": "1920x1080",
    "fps": 30
  },
  "transcript": [
    {
      "timestamp": "00:00:02",
      "text": "Welcome to the product walkthrough."
    },
    {
      "timestamp": "00:00:08",
      "text": "Today I'll show you three features we shipped last week."
    },
    {
      "timestamp": "00:00:14",
      "text": "Let's start with the new dashboard."
    }
  ]
}

The timestamps correspond to when each phrase starts in the video. This makes it straightforward to build a searchable index where clicking a result takes you to the exact moment in the video.

One thing to note: the transcript captures all spoken audio, including background conversations. If you're processing videos with multiple speakers, you'll get everything in sequence. Speaker diarization (who said what) isn't broken out separately, but the timestamps make it possible to infer speaker changes from the visual scene data.

Method 2: On-screen text extraction (OCR)

What it extracts: Any text visible in the video frame. Titles, lower thirds, product labels, slide content, UI text, signs, captions baked into the video.

When you'd use this: Ad compliance checking ("did the disclaimer appear for the required duration?"), extracting slide content from recorded presentations, pulling product information from unboxing videos, reading code from tutorial screencasts.

Here's a Python example:

import requests

response = requests.post(
    "https://api.vidcontext.com/v1/analyze",
    headers={"X-API-Key": "vc_your_key_here"},
    data={"url": "https://example.com/presentation.mp4", "output_format": "context"}
)

data = response.json()

for text_item in data["on_screen_text"]:
    print(f"[{text_item['timestamp']}] {text_item['text']}")
    print(f"  Position: {text_item['position']}")
    print()

The on_screen_text section in the response looks like this:

{
  "on_screen_text": [
    {
      "timestamp": "00:00:05",
      "text": "Q4 Revenue Report",
      "position": "center",
      "type": "title"
    },
    {
      "timestamp": "00:00:05",
      "text": "Prepared by Finance Team - December 2025",
      "position": "bottom",
      "type": "subtitle"
    },
    {
      "timestamp": "00:01:12",
      "text": "Total Revenue: $4.2M (+18% YoY)",
      "position": "center",
      "type": "body"
    },
    {
      "timestamp": "00:02:30",
      "text": "John Smith, CFO",
      "position": "bottom-left",
      "type": "lower_third"
    }
  ]
}

The position field tells you where the text appeared in the frame. The type field classifies it as a title, subtitle, lower third, body text, or label. This classification happens automatically based on the text's size, position, and visual styling.

For presentation videos, this is essentially a slide-to-text converter. You get the content of every slide with timestamps, which means you can reconstruct an outline of the entire presentation without watching it.

Method 3: Visual scene breakdown

What it extracts: Scene-by-scene descriptions with timestamps, visual elements, camera movements, and transitions.

When you'd use this: Video indexing and search, content moderation, auto-generating thumbnail candidates, building chapter markers, training data annotation.

The visual_scenes array breaks the video into logical segments:

{
  "visual_scenes": [
    {
      "start_time": "00:00:00",
      "end_time": "00:00:08",
      "description": "Close-up of a person at a desk with two monitors. Office environment with natural lighting from a window on the left. The person is looking directly at the camera.",
      "elements": ["person", "desk", "monitors", "office", "window"],
      "camera": "static",
      "transition_to_next": "cut"
    },
    {
      "start_time": "00:00:08",
      "end_time": "00:00:34",
      "description": "Screen recording of a web dashboard. Mouse cursor navigating through a sidebar menu, then clicking into an analytics panel showing bar charts and a date range selector.",
      "elements": ["screen_recording", "dashboard", "charts", "cursor"],
      "camera": "screen_capture",
      "transition_to_next": "cut"
    },
    {
      "start_time": "00:00:34",
      "end_time": "00:00:52",
      "description": "Split screen showing the person on the left speaking and the dashboard on the right. The person is gesturing toward the dashboard.",
      "elements": ["person", "split_screen", "dashboard", "gesturing"],
      "camera": "static",
      "transition_to_next": "fade"
    }
  ]
}

Each scene includes a plain-text description you can feed directly into a search index or an LLM. The elements array gives you machine-parseable tags. The camera field tells you whether it's a static shot, panning, zooming, a screen recording, or something else.

I find the transition_to_next field useful for detecting editing style. A video that's all hard cuts has a different feel from one with fades and dissolves, and this data lets you quantify that.

Method 4: Brand and logo detection

What it extracts: Brand names, logos, and product placements visible in the video.

When you'd use this: Sponsorship verification ("did the sponsor's logo actually appear for 10 seconds as agreed?"), competitive analysis, brand safety auditing, influencer marketing compliance.

{
  "brands_detected": [
    {
      "name": "Apple",
      "type": "logo",
      "first_seen": "00:00:12",
      "last_seen": "00:02:45",
      "occurrences": 8,
      "context": "MacBook Pro visible on desk throughout video"
    },
    {
      "name": "Slack",
      "type": "ui",
      "first_seen": "00:00:42",
      "last_seen": "00:00:58",
      "occurrences": 2,
      "context": "Slack desktop app shown during screen recording segment"
    },
    {
      "name": "Nike",
      "type": "product",
      "first_seen": "00:01:30",
      "last_seen": "00:01:30",
      "occurrences": 1,
      "context": "Nike logo on speaker's t-shirt"
    }
  ]
}

The type field distinguishes between logos (visual brand marks), UI appearances (software shown on screen), and physical products. The context field gives you a sentence explaining where and how the brand appeared, which saves you from having to cross-reference timestamps manually.

For anyone doing influencer marketing, this is the data you need. You can verify that sponsored products actually appeared in the video, how many times, and for how long. No more scrubbing through videos frame by frame.

Method 5: Audio analysis

What it extracts: Music detection, sound effects, speech patterns, ambient audio, and audio quality metrics.

When you'd use this: Content categorization, mood analysis, accessibility auditing ("is there background music drowning out speech?"), podcast post-production, detecting copyrighted music.

{
  "audio_analysis": {
    "speech": {
      "percentage": 72,
      "speakers_estimated": 1,
      "language": "en",
      "clarity": "high"
    },
    "music": {
      "detected": true,
      "segments": [
        {
          "start_time": "00:00:00",
          "end_time": "00:00:06",
          "type": "intro_jingle",
          "mood": "upbeat"
        },
        {
          "start_time": "00:02:50",
          "end_time": "00:03:04",
          "type": "outro_music",
          "mood": "calm"
        }
      ]
    },
    "ambient": {
      "background_noise": "low",
      "notable_sounds": ["keyboard_typing", "mouse_clicks"]
    },
    "quality": {
      "overall": "good",
      "issues": []
    }
  }
}

The speech.clarity field is particularly useful for accessibility work. If you're auditing video content for compliance, you can programmatically check whether speech is clearly audible or if background music is interfering.

The music.segments array tells you exactly when music plays and what the mood is. If you're building a content moderation pipeline, this helps classify videos without watching them.

Combining everything: one call gets all five

You don't need to make five separate requests. All of this data comes back in a single API response when you use output_format=context. Here's the full request and the structure of what you get back:

import requests
import json

response = requests.post(
    "https://api.vidcontext.com/v1/analyze",
    headers={"X-API-Key": "vc_your_key_here"},
    files={"file": open("product-demo.mp4", "rb")},
    data={"output_format": "context"}
)

result = response.json()

# Everything is in one response
print(f"Duration: {result['metadata']['duration_seconds']}s")
print(f"Transcript entries: {len(result['transcript'])}")
print(f"Scenes detected: {len(result['visual_scenes'])}")
print(f"On-screen text items: {len(result['on_screen_text'])}")
print(f"Brands found: {len(result['brands_detected'])}")
print(f"Speech percentage: {result['audio_analysis']['speech']['percentage']}%")

You can also send a URL instead of uploading a file:

response = requests.post(
    "https://api.vidcontext.com/v1/analyze",
    headers={"X-API-Key": "vc_your_key_here"},
    data={
        "url": "https://example.com/video.mp4",
        "output_format": "context"
    }
)

Processing takes about 50 seconds for a 3-minute video. The response is a single JSON object with consistent field names every time. No parsing XML, no stitching together outputs from different tools.

Building a pipeline: webhook to database in 5 minutes

Once you can extract structured data from video, the next step is usually "do something with it automatically." Here's a workflow I've seen users build in n8n (works with Zapier, Make, or plain code too):

  1. Webhook receives a video URL from your app, a form submission, or a Slack bot
  2. VidContext analyzes the video and returns structured JSON
  3. Save the results to your database (Supabase, Postgres, Airtable, wherever)
  4. Send a notification with a summary to Slack or email

In Python, the core of that pipeline is about 30 lines:

import requests

def process_video(video_url, callback_url=None):
    # Step 1: Analyze
    analysis = requests.post(
        "https://api.vidcontext.com/v1/analyze",
        headers={"X-API-Key": "vc_your_key_here"},
        data={"url": video_url, "output_format": "context"}
    ).json()

    # Step 2: Extract what you need
    summary = {
        "url": video_url,
        "duration": analysis["metadata"]["duration_seconds"],
        "transcript_preview": analysis["transcript"][0]["text"] if analysis["transcript"] else "",
        "scene_count": len(analysis["visual_scenes"]),
        "brands": [b["name"] for b in analysis["brands_detected"]],
        "has_music": analysis["audio_analysis"]["music"]["detected"],
    }

    # Step 3: Save to your database (example with Supabase)
    requests.post(
        "https://your-project.supabase.co/rest/v1/video_analyses",
        headers={
            "apikey": "your_supabase_key",
            "Content-Type": "application/json"
        },
        json=summary
    )

    # Step 4: Notify
    if callback_url:
        requests.post(callback_url, json=summary)

    return summary

That's a working pipeline. Add a Flask or FastAPI route in front of it, and you have a video analysis microservice.

DIY vs. API: an honest comparison

If you're considering building this yourself with open-source tools, here's what that looks like next to the API approach:

DIY (ffmpeg + Whisper + GPT-4V)VidContext API
Time to build2-4 weeks15 minutes
Cost per 3-min video~$0.15-0.40 (varies by models used)$0.84 ($0.28/min)
MaintenanceOngoing (model updates, dependency conflicts, GPU management)None
Output formatDifferent per tool, you write the glue codeConsistent JSON schema every time
On-screen textNeeds separate OCR pipelineIncluded
Brand detectionNeeds custom model or manual rulesIncluded
Audio analysisNeeds separate audio processingIncluded
ScaleLimited by your GPU/CPUHandles concurrent requests

The DIY route costs less per video. That's real. If you're processing thousands of videos daily and have engineering time to spare, building your own pipeline makes sense.

For most teams, the math goes the other way. Two weeks of engineering time costs more than a year of API calls. And you don't have to worry about Whisper releasing a new version that breaks your frame extraction pipeline, or ffmpeg flags changing between updates.

Supported formats and limits

Before you start building, here are the specs:

  • File formats: MP4, MOV, WebM
  • Max file size: 500 MB
  • Max duration: 15 minutes
  • Processing speed: Roughly 50 seconds for a 3-minute video
  • Pricing: $0.28 per minute of video. Free tier available for testing.

For videos longer than 15 minutes, split them into segments first. ffmpeg handles this well:

ffmpeg -i long-video.mp4 -c copy -segment_time 900 -f segment segment_%03d.mp4

Then process each segment and concatenate the results.

FAQ

Can I extract data from a live stream?

Not directly. VidContext works with video files and URLs pointing to completed recordings. For live streams, you'd need to save segments as they complete and process each one. A common pattern is recording 5-minute chunks and sending each to the API as it finishes.

What languages does the transcript support?

The underlying model handles over 50 languages. You don't need to specify the language in advance. It detects the spoken language automatically and includes it in the audio_analysis.speech.language field.

How do I get just the transcript without the rest of the data?

You can use output_format=transcript instead of output_format=context to get only the spoken word content. This is faster since it skips visual and brand analysis. But if you think you might want the other data later, it's cheaper to request everything once than to process the same video twice.


All code examples in this article work with a free VidContext account. Sign up here to get 15 credits and start extracting data from your own videos.

Try VidContext free

5 analyses without an account. 15 credits on signup. No credit card.

Get started