Scaling and Cost Optimization for AI Video Pipelines

Key takeaways

Teams budget the wrong line item. The retrieval bill is visible and predictable; the ingestion bill is invisible until the full corpus runs through it — and lands 5-10× higher than the approved budget.
One hour of video produces 100-200 retrieval units. Transcription costs 10-30 cents per hour; embedding costs 1-8 cents per hour depending on model. At 10,000 hours of corpus, ingestion is the line item that decides whether the project survives.
Chunk size is a cost decision, not just a quality one. 300-token chunks generate 50% more embedding calls than 450-token chunks for the same hour — for the same retrieval-quality sweet spot in the 200-500 token band.
Hot/warm/cold tiering on the vector index can run a 10,000-hour corpus on a cluster sized for the hot tier plus a small warm overflow — often a third to half the cost of provisioning the full corpus at hot-tier latency.
Reprocessing is inevitable. Build versioned index aliases with blue-green swap from day one so embedding model changes, chunker changes, and schema changes happen without downtime.

The morning the bill came in

A team builds a video search feature for a client. The pilot runs on twenty hours of footage and the query path costs fractions of a cent per request. Someone multiplies that number by the expected query volume, adds a comfortable margin, and ships a number to finance. Everyone signs off. The launch goes well.

The corpus turns out not to be twenty hours. It is eight thousand hours of archived material the client wants searchable from day one, plus two hundred new hours rolling in every month. The retrieval bill lands roughly where the team predicted. The embedding bill, which nobody modelled carefully, arrives at five to ten times the entire approved budget. The finance team's email subject line is a single word: explain.

This is the most common cost mistake we see in video AI work, and it happens because the team optimised the wrong axis. Retrieval is a per-query cost, and per-query costs are visible — every search shows up in logs. Ingestion is a per-asset cost, and per-asset costs are invisible until someone runs the full corpus through them. It is the difference between the price on the menu and the kitchen renovation no one mentioned at the table. The architecture in From Transcript to Intelligence tells you how the pipeline works on one video. This piece is what changes when one becomes ten thousand.

The bill, written out

The math is simple once you put it on paper.

A one-hour video produces roughly 8,000 to 12,000 words of transcript, which chunks into 100 to 200 retrieval units of 200 to 500 tokens each. Embedding those chunks costs roughly one to three cents per hour of video on Titan, and three to eight cents per hour on Cohere Embed v3. Transcription is separate and depends on provider — call it ten to thirty cents per hour for production-grade accuracy.

Multiply by ten thousand hours. Transcription lands at $1,000 to $3,000. Embedding at $100 to $800. Storage stays small. Reasoning on a query is a fraction of a cent. Vector search on a query is a smaller fraction still.

The shape of the bill is now legible. Ingestion is a one-time spike in the thousands of dollars per ten thousand hours of corpus. Retrieval is a recurring cost in the cents per thousand queries. A team that budgeted for the retrieval cost and got blindsided by the ingestion cost was reading the right page of the right book at the wrong number. The two costs even rhyme — both in cents — but one is per query and one is per asset, and the asset side is the one that compounds invisibly until it lands all at once.

The optimisation work falls out of this naturally. Money lives, in order, in transcription provider choice, embedding model choice, chunk count, reasoning model on the query path, and — distantly — everything else.

A pipeline that survives ten thousand hours

The ingestion pipeline we deploy is event-driven end to end. The shape:

S3 upload → Transcription job → Chunking → Embedding → Vector write
     ↓             ↓                 ↓           ↓           ↓
   SQS           SQS               SQS         SQS         DLQ

Each stage has its own SQS queue. Each stage has its own dead-letter queue. The whole thing scales because Lambda handles orchestration and each stage scales independently of the next. Like a factory floor where each station has its own conveyor, its own reject bin, and its own foreman.

Three things go wrong at this volume that never went wrong on twenty hours of pilot data.

Individual videos fail in ways that block nothing else. A 47-minute file is corrupted at minute 31. The transcription provider returns an error. Without per-stage queues, that single failure cascades — the worker holds the slot, the queue backs up, throughput collapses, the whole conveyor stalls because one bad bottle jammed one station. With per-stage queues and DLQs, the failed item lands in a queue an operator can inspect, and the next nine hundred ninety-nine videos in the batch keep moving.

Throttling becomes structural rather than exceptional. Bedrock has rate limits. Transcription providers have rate limits. At ten thousand hours of ingestion, you will hit them. The pipeline must back off with jitter — the pattern from Production AI Pipelines on AWS — and the embedding stage in particular must batch. Sending one chunk per API call leaves most of your rate limit unused, the way driving a lorry with one parcel in the back leaves most of the engine's work unspent.

Reruns happen. Something will change — the chunker, the embedding model, the metadata schema — and you will need to reprocess. The ingestion pipeline must be idempotent. Each video carries a content hash, and each chunk carries a deterministic ID derived from videoId + chunker version + chunk index. Re-running the pipeline over an already-processed video produces the same IDs, so writes upsert rather than duplicate.

Where embedding cost compounds

Embedding is where most of the ingestion bill lives, and where small choices compound the fastest. Four levers matter.

Chunk size first. A chunker that produces 300-token chunks generates roughly 50% more chunks than one that produces 450-token chunks. That is a 50% rise in embedding calls, embedding cost, vectors stored, and index size — for exactly the same hour of video. Chunking is not just a retrieval-quality decision. It is a cost decision wearing a quality-decision mask. The sweet spot for video transcripts sits between 200 and 500 tokens. Going smaller hurts the bill more than it helps precision. Going larger erodes retrieval quality because chunks start covering multiple topics, and the embedding starts pointing nowhere in particular.

Batching is the second lever. Bedrock embedding APIs accept arrays of inputs in a single call, up to provider-specific limits. Sending one chunk per call wastes per-request overhead and hits rate limits faster. Sending 25 to 100 chunks per call is usually the practical maximum. The cost per chunk does not change, but throughput rises roughly an order of magnitude — which means a ten-thousand-hour reprocess completes overnight instead of over a week.

Model choice is the third. Titan Embed is cheaper than Cohere Embed v3 by a factor of two to three. For English-only corpora where retrieval quality is good enough, Titan is the right default. Cohere wins on multilingual content, where its retrieval precision is meaningfully higher and worth the cost. The wrong choice here, multiplied across a large corpus, is the difference between a $200 ingestion job and a $700 one.

Caching is the fourth, and the dullest. If a video gets re-uploaded with the same content under a different filename, the content hash catches it before anything else runs. If a chunker version is bumped but the embedding model is unchanged, re-embedding is unnecessary for chunks whose text did not change. None of this is exotic. It is bookkeeping the pipeline does once and then never has to think about again — the cleaning closet that pays for itself every quarter.

Sizing an index for ten million vectors

A pilot OpenSearch index with 50,000 vectors behaves nothing like a production index with 10 million. The pilot is a corner shop. The production index is a warehouse. The same shelf arrangement breaks under different load.

Shard count first. OpenSearch's default shard count is too low for a vector index that will grow into the millions. We size shards so each one holds 1 to 5 million vectors. Above that, query latency climbs because the HNSW graph per shard gets too large. Below that, you are paying for shard overhead that buys you nothing.

Replica strategy second. One replica per shard is the production minimum. Replicas serve reads, so they directly affect query throughput. For a query-heavy deployment, two replicas per shard is often the right answer. The cost is storage and memory. The gain is parallelism on the read path.

HNSW parameters third. The m parameter for graph connectivity and ef_construction for index build quality trade build cost against query quality. The defaults are conservative. For a large corpus where retrieval quality matters, increasing m from 16 to 32 and ef_construction from 100 to 200 measurably improves recall, at the cost of slower indexing and a larger memory footprint. Worth doing once, on a corpus that will be queried for years.

Memory sizing fourth, and this is where teams quietly burn money. OpenSearch holds the HNSW graph in memory for low-latency search. If the index does not fit, query latency degrades sharply, the way a CPU that runs out of RAM hits the swap and falls off a cliff. The rule of thumb is roughly 1.5 to 2 bytes per dimension per vector, plus overhead. Ten million 1024-dimension vectors is 20 to 25 GB of memory just for the graph. Right-sizing the cluster up front is cheaper than discovering you under-provisioned three months in.

The biggest latency win lives before the search

The single biggest performance win at scale is not in vector search. It is in what runs before vector search.

Metadata filtering before kNN is the first lever. A query that searches all chunks in the corpus is doing vector math against millions of candidates. A query that searches chunks from this user's accessible videos, in this date range, in this language might be doing vector math against ten thousand. The latency gap is roughly an order of magnitude — the difference between asking a librarian for a book and asking them to search the entire library. The architectural decision is to make filtering cheap by indexing the right metadata fields with the right types, and to make sure every query carries the most selective filter the use case allows.

This is also the security boundary. Permissions live on the chunk and are enforced as filters at query time, as Transcript to Intelligence covers. At scale, that filter does double duty — enforcing access and dramatically narrowing the search space. A query whose filter reduces the candidate set by 100x is a query that returns in 50ms instead of 500ms.

Reranking is the second lever. It adds 150 to 400ms of latency, which at scale becomes the single largest contributor after the model call. It is also the largest contributor to retrieval quality. Keep reranking on for queries that matter. Skip it for cheap traversal queries — autocomplete, related-items lookups — where the precision gain is not worth the latency.

Hybrid search is the third. Combining BM25 and kNN in one query adds modest latency, usually under 50ms, and routinely improves recall on edge cases that pure semantic search misses — proper nouns, identifiers, verbatim phrases. The overhead is small enough that hybrid is the right default once the corpus is large enough for pure semantic drift to become visible.

Hot, warm, cold

A production video corpus has a long tail. Some videos are queried daily. Most are queried once a quarter, if at all. Treating both the same is the same as keeping every book in your library on the front display shelf.

The pattern we deploy tiers the index. The hot tier lives in the primary OpenSearch cluster, fully replicated, fully in-memory — the front display shelf. The warm tier lives in the same cluster but on a lower-resource node group with fewer replicas, the back-of-the-shop bookcase; queries against warm chunks are slightly slower but cost less to host. The cold tier lives in S3 as serialised vector files, optionally with a small Postgres index for metadata lookup — boxes in the basement. A query that needs cold data triggers a rehydration job, which is fine because cold queries are rare.

The classifier for which tier a video belongs to is not exotic: last query time, query frequency over the last 30 days, age of the asset. Videos get demoted to warm after 90 days without a query, demoted to cold after 365. Promotions happen automatically when a cold video is queried more than twice in a week.

The savings are real. A corpus of 10,000 hours where 90% of queries hit 10% of the videos can run on a cluster sized for the hot tier plus a small warm overflow — often a third to half the cost of provisioning the entire corpus at hot-tier latency.

Swapping the engine while the car is moving

The embedding model will change. The chunker will change. The metadata schema will change. At small scale, you take downtime, you reprocess, you move on. At ten thousand hours, downtime is not an option.

The pattern is a versioned index alias. The live system queries an alias — say, chunks-current. Behind that alias is an actual index, chunks-v3. When a new embedding model arrives, ingestion starts writing to chunks-v4 in parallel. New uploads go to v4. The backlog of existing videos gets reprocessed into v4 by a backfill job that runs as a background workload, throttled to leave headroom for live ingestion.

When chunks-v4 is fully populated and validated against a sample of canonical queries, the alias swap is atomic. chunks-current now points at v4. The old index gets kept for a rollback window and eventually deleted.

This is operationally identical to a blue-green deployment for vector indexes — the new bridge is built parallel to the old one, traffic switches in a moment, and the old bridge stays standing until you trust the new one. The cost is the period of double-indexing. The gain is that the user-facing system never goes down.

The same pattern handles chunker changes (write new chunks alongside old, swap when done) and schema changes (add the new field to incoming writes, backfill historical records, then update the read path).

Cost visibility at three resolutions

A scaled video pipeline needs cost visibility at three resolutions. Per video. Per tenant. Per query.

Per video, when ingested, the pipeline records token counts for transcription, embedding calls, and storage footprint. The result is a unit-economics number — this video cost $0.18 to ingest — that lets the team flag anomalies (a video that cost ten times the median, probably because it triggered a retry loop) and that lets product and finance reason about pricing without guessing.

Per tenant, in multi-tenant deployments, every query, every ingestion job, every storage byte is tagged with a tenant ID. Cost rolls up by tenant for invoicing, for capacity planning, and for spotting tenants whose usage patterns are about to break a pricing model before the model breaks them.

Per query, every request records its model calls, token counts, retrieval candidate count, and total latency. The aggregate is the per-query cost over time. The interesting signal is variance — a query class whose cost suddenly doubles is either a retrieval regression returning too many candidates or an abuse pattern, a tenant running automated queries no one planned for.

CloudWatch alarms fire on three thresholds. Daily total cost as a budget guardrail. Per-tenant cost variance to catch abuse and runaway usage. Per-query P99 latency to catch retrieval degradation before users complain.

FAQs

Titan or Cohere Embed v3 for video transcript chunks?

Titan is cheaper by a factor of two to three. For English-only corpora where retrieval quality is good enough, Titan is the right default. Cohere wins on multilingual content where its retrieval precision is meaningfully higher and worth the cost. The wrong decision multiplied across a 10,000-hour corpus is the difference between a $200 ingestion job and a $700 one.

How do I size an OpenSearch vector index for 10 million vectors?

Shard so each one holds 1-5 million vectors. Budget roughly 1.5-2 bytes per dimension per vector plus overhead for memory — ten million 1024-dimension vectors is roughly 20-25 GB of memory just for the HNSW graph. One replica per shard is the production minimum; two replicas for read-heavy workloads. Right-sizing the cluster up front is cheaper than discovering you under-provisioned three months in.

What is the single biggest latency win at scale?

Metadata filtering before kNN search. A query against "all chunks in the corpus" does vector math against millions of candidates; a query filtered to "this user's accessible videos, this date range, this language" might run against ten thousand. The latency difference is roughly an order of magnitude. The filter is also doing double duty — enforcing permissions and narrowing the search space.

How do you handle reprocessing without downtime?

Versioned index alias with blue-green swap. The live system queries chunks-current. Behind the alias is chunks-v3. When a new embedding model arrives, ingestion writes to chunks-v4 in parallel. New uploads go to v4. A backfill job reprocesses existing videos. When v4 is fully populated and validated against canonical queries, the alias swap is atomic. The cost is the period of double-indexing; the gain is no downtime.

What cost metrics should fire CloudWatch alarms?

Three thresholds. Daily total cost as a budget guardrail. Per-tenant cost variance to catch abuse and runaway usage before the invoicing cycle. Per-query latency P99 to catch retrieval degradation before users complain. Per-video, per-tenant, and per-query cost tagging makes anomalies (a video that cost ten times the median, probably from a retry loop) visible quickly.

The shape that holds at any scale

The same architecture that runs the retrieve-first pipeline on one video runs it on ten thousand hours. What changes is where the cost lives, where the latency lives, and where the failure modes hide. The model on the query path stays roughly the same price per request. Everything upstream — ingestion, embedding, indexing, reprocessing — becomes the engineering problem.

Teams that ship video AI features successfully at scale share one habit. They model the ingestion cost honestly before they ship the budget. They size the index for the corpus they will have in eighteen months, not the corpus they have today. They build reprocessing into the architecture from day one, because the embedding model will change. They monitor cost at a resolution that lets them catch the surprise before finance does.

The retrieval bill is the one you see. The ingestion bill is the one that decides whether the project survives the year.