Video Transcription API

Transcription captures what was said. VidContext captures what was said, what was shown, what appeared on screen, what brands were visible, and what the audio environment sounded like. One API call gives you the full picture — not just the script.

Why transcription alone is not enough

A product demo where the presenter says "click here" is meaningless without knowing what "here" refers to. A cooking video where someone says "add this" loses context without the visual. An ad where the brand name only appears as a logo on screen — and is never spoken — is invisible to transcription.

VidContext bridges this gap. The transcript is paired with timestamped visual scene descriptions, so you know exactly what was on screen when each word was spoken. On-screen text is extracted separately, catching everything from lower-third titles to URL watermarks. The result is a text output that preserves the full meaning of the video, not just the audio track.

Everything you get beyond the transcript

Timestamped transcript

Segmented speech-to-text with timing markers for every passage. Same quality as dedicated transcription services.

Visual scene context

For every transcript segment, a parallel description of what was visually happening — connecting words to their on-screen meaning.

On-screen text (OCR)

Every piece of text visible in the video extracted separately: titles, captions, URLs, phone numbers, code on screen.

Brand and logo detection

Brands that appear visually but are never mentioned verbally — captured and listed with timestamps.

Audio beyond speech

Music genre, sound effects, ambient noise, and tonal shifts classified alongside the transcript.

Privacy by default

Your video is deleted immediately after processing. No storage, no retention, no copies.

Transcript with full video context

The response pairs every transcript segment with what was happening visually, so you never lose context.

curl -X POST https://api.vidcontext.com/v1/analyze \
  -H "X-API-Key: vc_your_key" \
  -F "source=https://example.com/webinar.mp4" \
  -F "output_format=context"
{
  "transcript": [
    {
      "time": "0:00-0:18",
      "text": "Let me show you how the dashboard works. Click on analytics..."
    },
    {
      "time": "0:18-0:35",
      "text": "You can see the traffic graph updating in real time..."
    }
  ],
  "visual_scenes": [
    {
      "time": "0:00-0:10",
      "description": "Presenter visible in webcam overlay, screen shows app login page"
    },
    {
      "time": "0:10-0:30",
      "description": "Screen recording: dashboard with sidebar, cursor clicks Analytics tab, line chart appears"
    }
  ],
  "on_screen_text": ["Analytics Dashboard", "Last 30 Days", "2,847 visits"],
  "brands_detected": ["Google Analytics"],
  "audio": { "music": "light ambient", "effects": ["mouse click"] }
}

When context-aware transcription matters

  • Tutorial and training documentation — Convert video tutorials into step-by-step guides where each instruction is paired with what was on screen, not just what was said.
  • Meeting and webinar archives — Capture not just the discussion, but every slide, chart, and screen share that was presented during the meeting.
  • Video SEO and indexing — Make video content searchable by every element: spoken words, displayed text, visual content, and brand references.

Frequently asked questions

How is this different from Whisper or other transcription APIs?

Whisper and similar services convert speech to text — that is all. VidContext does that, and also describes every visual scene, extracts all on-screen text, identifies brands and logos, classifies audio beyond speech, and provides structured metadata. You get a complete understanding of the video, not just what was said.

Do you support multiple languages?

VidContext processes the audio and visual content of the video using Gemini 3.1 Pro, which supports over 100 languages for speech recognition. The transcript is returned in the language spoken. Visual descriptions and on-screen text extraction work regardless of language.

What about videos with music but no speech?

VidContext handles these well. When there is no speech, the transcript section will be empty or minimal, but visual scene descriptions, on-screen text, brand detection, and audio classification (music genre, sound effects, ambient sounds) are all populated normally. The video is fully analyzed with or without dialogue.

Do you detect different speakers?

VidContext provides speaker context within transcript segments, noting when the speaker changes and describing who is speaking when visually identifiable (e.g., 'male presenter' or 'female narrator, off-camera'). It does not perform speaker diarization by name unless the speaker is identified on screen.

How accurate is the transcription?

VidContext uses Gemini 3.1 Pro for transcription, which matches or exceeds Whisper's accuracy on most content types. Accuracy is highest for clear speech in common languages. Background noise, heavy accents, and overlapping speakers can reduce accuracy, similar to any transcription service.

See how we compareFull API documentation

Ready to start?

5 free analyses without an account. 20 credits on signup. No credit card required.

Try free