I was running meeting audio through two APIs. Gemini collapsed it into one.
For six months I ran every engineering meeting through the same two-step pipeline: upload audio to Whisper for transcription, pipe the text into GPT-4 for summarization. Two API keys, two billing accounts, a temp file sitting on disk between the steps, and retry logic wrapped around each stage. It worked. But when I started processing screen recordings and user testing sessions, the whole thing broke down — video became a third separate problem, and suddenly I was stitching together four different services. I switched to Gemini and the pipeline collapsed to a single API call.
What the old pipeline looked like
The two-step flow is the standard pattern anyone builds before they know better. Audio file comes in, hit Whisper for a transcript, stuff the transcript into GPT-4 with a summarization prompt, parse the output. On paper it is simple. In practice:
- Whisper has its own retry behavior and occasionally returns partial transcripts on long recordings
- The intermediate transcript file has to live somewhere — disk, S3, an in-memory buffer if you are being clever
- You are paying for tokens twice: once for Whisper's speech recognition and once for GPT-4 to read the transcript
- Debugging failures means figuring out which stage failed and whether the transcript was the cause
- Video was not handled at all — you had to extract audio first using ffmpeg as a third subprocess
When I added user testing session recordings to the pipeline, I had to bolt on a fourth service. The architecture diagram went from two boxes to five. That was the moment I looked for something better.
What Gemini does differently
Claude does not natively accept audio or video files. You transcribe first, then pass text. GPT-4 handles images but audio and video require the same pre-processing detour.
Gemini processes audio and video natively. You upload the file, pass the URI to the model, and it reasons directly over the media — no transcription step, no intermediate artifact. For meeting recordings this means one API call, one billing account, and structured output straight from the audio.
The File API is the key mechanism. You upload once, get back a URI, and reference that URI in as many prompts as you want for 48 hours. If you need to run three different analyses on the same recording — action items, decisions, risk flags — you upload one time and make three requests against the same URI. You are not re-uploading or paying upload costs on each analysis.
Use case 1: meeting audio to structured summary
The pattern that replaced my Whisper + GPT-4 pipeline. One upload, one prompt, structured JSON out.
import google.generativeai as genai
import json
import time
genai.configure(api_key="YOUR_API_KEY")
def upload_audio(file_path: str):
print(f"Uploading {file_path}...")
audio_file = genai.upload_file(path=file_path, mime_type="audio/mp3")
# Wait for processing to complete
while audio_file.state.name == "PROCESSING":
time.sleep(2)
audio_file = genai.get_file(audio_file.name)
if audio_file.state.name != "ACTIVE":
raise RuntimeError(f"Upload failed: {audio_file.state.name}")
return audio_file
def summarize_meeting(file_path: str) -> dict:
audio_file = upload_audio(file_path)
model = genai.GenerativeModel("gemini-3.5-flash")
prompt = """Analyze this meeting recording and return a JSON object with:
- action_items: list of objects with {owner, task, deadline_mentioned}
- decisions: list of strings, each a concrete decision made
- open_questions: list of strings, each an unresolved question
- attendee_sentiment: overall tone (positive/neutral/tense)
Return only the JSON object, no markdown fencing."""
response = model.generate_content([audio_file, prompt])
# File persists for 48 hours — reuse the URI for other analyses
print(f"File URI for reuse: {audio_file.uri}")
return json.loads(response.text)
result = summarize_meeting("standup-2026-10-22.mp3")
print(json.dumps(result, indent=2))
The output is structured from the start. No parsing a transcript. No second prompt to extract action items from prose. The model reasons over the audio directly and returns the shape you asked for.
Use case 2: screen recording to bug report
When a QA engineer finds a bug, the usual workflow is: watch the recording, write down steps to reproduce, paste into Jira. That middle step is tedious and lossy — people skip details under time pressure. Gemini can watch the recording and generate the bug report.
import google.generativeai as genai
import json
import time
genai.configure(api_key="YOUR_API_KEY")
def generate_bug_report(recording_path: str, reported_issue: str) -> dict:
print(f"Uploading screen recording...")
video_file = genai.upload_file(
path=recording_path,
mime_type="video/mp4"
)
while video_file.state.name == "PROCESSING":
time.sleep(3)
video_file = genai.get_file(video_file.name)
model = genai.GenerativeModel("gemini-3.5-flash")
prompt = f"""Watch this screen recording. The reporter said: "{reported_issue}"
Return a JSON bug report with:
- title: concise bug title (under 80 chars)
- severity: critical | high | medium | low
- steps_to_reproduce: ordered list of strings, each a concrete UI action visible in the recording
- expected_behavior: what should have happened
- actual_behavior: what happened instead
- affected_component: which part of the UI or system this touches
- notes: anything else visible in the recording that might help the engineer
Return only the JSON object."""
response = model.generate_content([video_file, prompt])
return json.loads(response.text)
report = generate_bug_report(
recording_path="bug-checkout-flow.mp4",
reported_issue="Payment form disappears after entering card number"
)
print(json.dumps(report, indent=2))
The steps to reproduce come directly from what the model watched. If the engineer clicked a dropdown, scrolled, then typed in a field, the model captures that sequence. The QA engineer reviews, edits if needed, and files the ticket. The annotation work drops from ten minutes to under two.
Use case 3: user testing session to UX friction analysis
User research recordings sit unwatched for weeks because someone has to find the time to watch an hour of footage and write a report. The File API is particularly useful here because a UX researcher might want to run several different analyses — accessibility issues, navigation confusion, emotional signals — without re-uploading the same 800MB file.
import google.generativeai as genai
import json
import time
genai.configure(api_key="YOUR_API_KEY")
def analyze_user_session(video_path: str) -> dict:
print("Uploading user session recording...")
video_file = genai.upload_file(path=video_path, mime_type="video/mp4")
while video_file.state.name == "PROCESSING":
time.sleep(3)
video_file = genai.get_file(video_file.name)
model = genai.GenerativeModel("gemini-3.5-flash")
# First analysis: friction points
friction_prompt = """Watch this user testing session and identify friction points.
Return JSON with:
- friction_points: list of objects, each with:
- timestamp_approx: rough time in the video (e.g. "2:34")
- description: what the user struggled with
- severity: high | medium | low
- signal: what behavior indicated friction (hesitation, backtracking, error click, etc.)
- overall_task_completion: completed | partial | abandoned
- time_on_task_approx: estimated total time from start to completion or abandonment"""
friction_response = model.generate_content([video_file, friction_prompt])
friction_data = json.loads(friction_response.text)
# Second analysis using the same uploaded file (no re-upload, file persists 48h)
emotion_prompt = """Watch this user testing session and describe the user's emotional state.
Return JSON with:
- emotional_arc: list of objects with {phase, emotion, indicator} covering key moments
- moments_of_delight: list of strings describing positive reactions
- moments_of_frustration: list of strings describing negative reactions
- overall_sentiment: positive | neutral | negative | mixed"""
emotion_response = model.generate_content([video_file, emotion_prompt])
emotion_data = json.loads(emotion_response.text)
return {
"friction_analysis": friction_data,
"emotional_analysis": emotion_data,
"file_uri": video_file.uri # reuse for more analyses within 48h
}
analysis = analyze_user_session("user-test-session-04.mp4")
print(json.dumps(analysis, indent=2))
Two analyses, one upload. The file URI stays valid for 48 hours, so if the researcher wants a third pass — looking specifically at navigation patterns or where the user read versus skimmed — they call genai.get_file(name) with the stored name and run another prompt without touching the original file.
The File API details worth knowing
Files uploaded via genai.upload_file() persist for 48 hours. After that they are deleted automatically. You cannot extend the TTL, so if you want to run analyses on the same file across multiple days, you need to re-upload. For most pipelines that process and discard, 48 hours is more than enough.
The file goes through a PROCESSING state before it becomes ACTIVE. For a 60-minute meeting recording in mp3 format, processing takes 10 to 30 seconds in my experience. The polling loop in the examples above is the correct pattern — check every few seconds until the state is ACTIVE.
Supported formats include mp3, mp4, mov, avi, wav, flac, and several others. The mime_type you pass on upload should match the actual file format. Getting this wrong causes silent failures where the model gets confused about what it is looking at.
One thing that surprised me: the model timestamps its observations. When I asked for friction points with approximate timestamps, it gave me answers like "around 3:15 the user clicked the back button unexpectedly." It is not frame-accurate, but it is close enough to navigate to the right moment in the recording.
What this eliminated
The two-step Whisper + GPT-4 pipeline is gone. So is the ffmpeg subprocess for extracting audio from video. So are two API keys, two billing accounts, and the retry logic that wrapped each stage independently. The intermediate temp file that sat on disk between the transcription and summarization steps no longer exists.
The new pipeline for meeting audio is: upload file, call model with prompt, parse JSON response. Three lines of logic instead of forty. When something fails, there is one place to look.
This is not a post about Gemini being better than Claude or GPT-4 at text tasks. For pure text reasoning I still reach for different models depending on what I need. But for anything that involves audio or video as the primary input, the native multimodal path eliminates infrastructure that did not need to exist in the first place. If you are running media through a multi-step pipeline today, the question is not whether Gemini handles it better — it is whether your pipeline needs to exist at all.