All posts
9 min read

How to build an AI agent that understands video

A practical tutorial for adding video understanding to AI agents using Python and a video analysis API. Includes working code examples and MCP server setup.

AI agentstutorialMCPPython

AI agents can read documents, search the web, write code, send emails, and query databases. But ask one to watch a video and tell you what happens in it? Nothing. It can't do that.

This is a real problem. Videos carry information that doesn't exist anywhere else. A product demo shows how software actually works, not just how the marketing copy describes it. A competitor's ad reveals their positioning, their target audience, their production quality. A user-generated review captures sentiment that a text transcript alone would miss entirely.

If you're building agents that handle marketing, content moderation, media monitoring, research, or e-commerce, you're probably running into this wall already.

This tutorial walks through how to build an agent that can actually understand video content. We'll use Python and the VidContext API, and by the end you'll have working code you can drop into your own agent.

Why can't agents just "watch" a video?

LLMs are text-in, text-out machines. Even the multimodal ones that accept images (GPT-4V, Claude, Gemini) handle video poorly or not at all. There's a reason for this.

A video is not just a sequence of images. It's multiple information streams running simultaneously: visuals that change over time, spoken audio, background music, on-screen text, brand logos, scene transitions, camera movements. A 3-minute product video might contain a spoken script, text overlays with pricing, brand logos from partners, and visual demonstrations of features, all layered on top of each other.

Processing all of that requires extracting each stream, analyzing them independently, then synthesizing everything into a coherent description. That's a pipeline, not a single model call.

So the answer isn't to wait for LLMs to get better at video. The answer is to build a bridge: take the video, extract all the meaningful information from it, convert it to structured text, and hand that text to your agent.

The architecture

Here's how it works at a high level:

Your Agent
  │
  ├─ receives task: "analyze this product video"
  │
  ├─ calls video analysis API with video URL or file
  │
  ├─ receives structured text output:
  │    - transcript of spoken audio
  │    - visual scene descriptions with timestamps
  │    - on-screen text extracted
  │    - brands and logos detected
  │    - metadata (duration, resolution, etc.)
  │
  └─ reasons over the structured text to complete the task

The agent never touches the video itself. It sends it to an API that does the heavy extraction, then works with the text output like it would any other document. From the agent's perspective, a video becomes just another piece of context to reason about.

Tutorial: adding video understanding to a Python agent

We'll use VidContext as the video analysis API. You can sign up and get 15 free credits, or use 5 analyses without an account at all.

Step 1: Get an API key

Go to vidcontext.com/app and create an account. In the developer settings, generate an API key. It'll start with vc_.

Step 2: Analyze a video

The simplest call sends a video URL and gets back structured text. Here's the Python:

import requests

API_KEY = "vc_your_key_here"
VIDEO_URL = "https://example.com/product-demo.mp4"

response = requests.post(
    "https://api.vidcontext.com/v1/analyze",
    headers={"X-API-Key": API_KEY},
    data={
        "url": VIDEO_URL,
        "output_format": "context"
    }
)

result = response.json()
print(result)

You can also upload a file directly:

with open("video.mp4", "rb") as f:
    response = requests.post(
        "https://api.vidcontext.com/v1/analyze",
        headers={"X-API-Key": API_KEY},
        files={"file": ("video.mp4", f, "video/mp4")},
        data={"output_format": "context"}
    )

Or use curl if you prefer:

curl -X POST https://api.vidcontext.com/v1/analyze \
  -H "X-API-Key: vc_your_key_here" \
  -F "url=https://example.com/product-demo.mp4" \
  -F "output_format=context"

Processing takes around 50 seconds for a typical 3-minute video.

Step 3: Parse the structured output

The API returns a JSON response with sections that your agent can work with directly:

{
  "success": true,
  "data": {
    "metadata": {
      "duration": 184,
      "resolution": "1920x1080",
      "fps": 30
    },
    "transcript": "Welcome to our product walkthrough...",
    "visual_scenes": [
      {
        "timestamp": "0:00-0:15",
        "description": "Opening title card with company logo on dark background"
      },
      {
        "timestamp": "0:15-0:45",
        "description": "Screen recording of dashboard interface, cursor navigating to settings panel"
      }
    ],
    "on_screen_text": [
      "Acme Corp",
      "Starting at $29/month",
      "Try it free for 14 days"
    ],
    "brands_detected": ["Acme Corp"],
    "audio_analysis": {
      "has_speech": true,
      "has_music": true,
      "tone": "professional, upbeat"
    }
  }
}

Each section gives your agent something specific to work with. The transcript captures what was said. The visual scenes describe what was shown, in order. The on-screen text pulls out any text overlays, CTAs, or pricing. Brands detected catches logos and company names.

Step 4: Feed it to your agent

Here's a minimal agent loop that can handle video-related tasks. This uses OpenAI's chat completions API, but the same pattern works with any LLM.

import requests
import json

VIDCONTEXT_KEY = "vc_your_key_here"
OPENAI_KEY = "sk-your-key-here"

def analyze_video(video_url, mode="context"):
    """Call VidContext API and return structured analysis."""
    response = requests.post(
        "https://api.vidcontext.com/v1/analyze",
        headers={"X-API-Key": VIDCONTEXT_KEY},
        data={"url": video_url, "output_format": mode}
    )
    response.raise_for_status()
    return response.json()["data"]

def ask_agent(question, video_url):
    """Send a question to the agent along with video analysis."""
    video_data = analyze_video(video_url)

    context = f"""Video Analysis:
Duration: {video_data['metadata']['duration']}s
Transcript: {video_data['transcript']}
Scenes: {json.dumps(video_data['visual_scenes'], indent=2)}
On-screen text: {', '.join(video_data['on_screen_text'])}
Brands: {', '.join(video_data['brands_detected'])}
Audio: {json.dumps(video_data['audio_analysis'])}"""

    response = requests.post(
        "https://api.openai.com/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {OPENAI_KEY}",
            "Content-Type": "application/json"
        },
        json={
            "model": "gpt-4o",
            "messages": [
                {
                    "role": "system",
                    "content": "You are an assistant that analyzes videos. "
                               "You receive structured video analysis data and "
                               "answer questions based on it."
                },
                {
                    "role": "user",
                    "content": f"{context}\n\nQuestion: {question}"
                }
            ]
        }
    )

    return response.json()["choices"][0]["message"]["content"]

answer = ask_agent(
    "What product is being advertised and what's the pricing?",
    "https://example.com/product-ad.mp4"
)
print(answer)

That's it. Your agent can now answer questions about any video you point it at.

Using the MCP server

If you're working with Claude Desktop, Cursor, Windsurf, or any tool that supports Model Context Protocol, there's a faster path. VidContext has an MCP server that gives your AI tool direct access to video analysis as a callable tool.

Install

pip install vidcontext-mcp

Configure for Claude Desktop

Add this to your Claude Desktop config file (claude_desktop_config.json):

{
  "mcpServers": {
    "vidcontext": {
      "command": "vidcontext-mcp",
      "env": {
        "VIDCONTEXT_API_KEY": "vc_your_key_here"
      }
    }
  }
}

Configure for Cursor

In Cursor's MCP settings, add a new server:

{
  "mcpServers": {
    "vidcontext": {
      "command": "vidcontext-mcp",
      "env": {
        "VIDCONTEXT_API_KEY": "vc_your_key_here"
      }
    }
  }
}

Once configured, your AI assistant can call analyze_video as a tool. Ask it "analyze this video: [URL]" and it'll call the VidContext API, get the structured output, and reason over it directly. No code required on your part.

Choosing an analysis mode

VidContext supports 7 analysis modes. Each one tunes the output for a specific use case:

ModeBest forWhat you get
contextGeneral understandingBalanced analysis of all video elements
ad_analysisMarketing and advertisingCTA effectiveness, messaging, target audience, production quality
creator_analysisYouTube and social contentCreator style, engagement hooks, content structure
ecommerce_productProduct videosFeature mentions, pricing, competitor positioning
training_educationLearning contentKey concepts, teaching quality, knowledge structure
ugc_influencerInfluencer contentAuthenticity signals, brand integration, audience fit
competitor_intelCompetitive researchPositioning, claims, strengths and weaknesses

Pick the mode that matches your agent's job. A content moderation agent should use context. A marketing agent analyzing competitor ads should use ad_analysis or competitor_intel. The mode changes what the API focuses on during analysis, so you get more relevant output without extra prompting.

Real-world agent patterns

Here are four patterns we've seen people build.

Content moderation agent

The agent monitors a queue of user-uploaded videos. For each one, it calls VidContext with context mode, then checks the transcript and scene descriptions against your content policy. If anything flags, it routes the video to a human reviewer with a summary of what was found and where.

def moderate_video(video_url, policy_rules):
    analysis = analyze_video(video_url, mode="context")

    prompt = f"""Review this video analysis against our content policy.

Policy rules:
{policy_rules}

Video analysis:
Transcript: {analysis['transcript']}
Scenes: {json.dumps(analysis['visual_scenes'])}

List any policy violations found, with timestamps.
If the video is clean, say "No violations found."
"""
    return ask_llm(prompt)

Marketing analysis agent

This one takes a batch of competitor ad URLs and runs them through ad_analysis mode. It extracts messaging, CTAs, target audience signals, and production quality scores. Then it compares them against your own ads.

Research agent

Point it at YouTube videos relevant to a topic. It pulls out key claims, data points, and expert opinions. Useful for market research, academic survey work, or just keeping tabs on what's being said in your industry.

E-commerce quality audit agent

Run your product catalog videos through ecommerce_product mode. The agent checks whether each video mentions key features, shows the product clearly, includes pricing, and has a call to action. Outputs a quality score and specific recommendations for each video.

Tips for production

Cache your results. If you're analyzing the same video multiple times (say, different agents need different information from it), store the analysis output. The same video will produce the same analysis, so there's no reason to pay for it twice. A simple key-value store keyed on video URL or file hash works fine.

Handle long videos. VidContext supports videos up to 15 minutes. For longer content, you have two options: split the video into segments before sending, or extract the portion you care about. Most agent tasks don't need an entire hour-long webinar analyzed. Just send the relevant clip.

Add proper error handling. Videos fail for real-world reasons: the URL is behind authentication, the file is corrupted, the format isn't supported, the video is too long. Handle these gracefully.

def safe_analyze(video_url, mode="context"):
    try:
        response = requests.post(
            "https://api.vidcontext.com/v1/analyze",
            headers={"X-API-Key": VIDCONTEXT_KEY},
            data={"url": video_url, "output_format": mode},
            timeout=120
        )
        response.raise_for_status()
        result = response.json()

        if not result.get("success"):
            return None, result.get("error", "Unknown error")

        return result["data"], None

    except requests.Timeout:
        return None, "Video analysis timed out. The video may be too long."
    except requests.HTTPError as e:
        return None, f"API returned {e.response.status_code}"
    except Exception as e:
        return None, str(e)

Set a timeout. A 3-minute video takes about 50 seconds to process. Set your HTTP timeout to at least 120 seconds to give longer videos room. If you're processing batches, consider running analyses in parallel with a thread pool, but keep concurrency reasonable. Three or four at a time is plenty.

FAQ

How accurate is the video analysis?

VidContext uses Gemini 2.5 Pro at high resolution with 4 frames per second sampling. It catches spoken words, on-screen text, brand logos, scene changes, and visual details reliably. It's not perfect for every frame of fast-moving footage, but for the vast majority of business video content (ads, product demos, explainers, UGC), it captures what matters.

Can I use this with agents other than Python-based ones?

Yes. The API is a standard HTTP endpoint. Any language that can make HTTP POST requests can call it. The MCP server works with any MCP-compatible client, which includes Claude Desktop, Cursor, Windsurf, and the Claude Code CLI. If your agent framework supports tool calling, you can register the VidContext API as a tool.

What video formats and sources are supported?

The API accepts MP4, MOV, AVI, WebM, and MKV files. You can upload a file directly or pass a URL. YouTube URLs, direct file links, and most publicly accessible video URLs work. The video needs to be under 15 minutes and the URL needs to be reachable from the server (so no localhost links or files behind login walls).

Wrapping up

Giving an AI agent the ability to understand video is mostly a plumbing problem. The hard part (actually extracting meaning from video) is handled by the analysis API. Your job is just connecting the pipes: agent receives task, calls the API, gets structured text back, reasons over it.

The code in this tutorial is production-ready. The analyze_video function, the error handling wrapper, the agent loop: you can copy these directly into your project and start using them.

If you want to try it without writing any code first, install the MCP server and use it from Claude Desktop. Ask it to analyze a video and you'll see exactly what the output looks like before you commit to building anything.

Get an API key and start building. Five free analyses, no credit card required.

Try VidContext free

5 analyses without an account. 15 credits on signup. No credit card.

Get started