Query-Driven Video Clip Extraction with Semantic Alignment
Query-Driven Video Clip Extraction with Semantic Alignment
At Joyspace, we process thousands of hours of video content weekly. Our initial approach used a learned "interestingness" classifier—a binary model trained to identify highlight-worthy moments. This failed immediately in production. The same 90-minute sales call would yield completely different highlights depending on who was watching: sales teams wanted objection handling, product teams wanted feature requests, and marketing needed customer testimonials.
The fundamental insight: interestingness is not a property of the video; it's a function of viewer intent.
This led us to rebuild our extraction pipeline around explicit query specification and semantic matching. This post details our technical approach, focusing on the crisp attention mechanism that improved matching accuracy by 12% while reducing computational cost by 60%.
Problem Formulation
Given a video V of duration T seconds and a natural language query q, we need to extract n semantically coherent clips C = {c₁, c₂, ..., cₙ} where each clip cᵢ = (tₛ, tₑ) represents start/end timestamps.
Constraints:
- Clips must be ranked by semantic relevance to query intent
- Real-time processing:
<10minutes for 90-minute videos - Deterministic: same inputs → same outputs
- Semantic coherence: clips must represent complete thoughts
The core technical challenge is computing a relevance function R(q, v) that accurately measures alignment between query intent and video segment semantics across modalities (visual, audio, transcript).
Architecture Overview
We use a dual-encoder architecture that maps queries and video segments into a shared embedding space where semantic similarity corresponds to cosine distance. The key innovation is applying crisp attention—structured sparsity in the attention mechanism—to both query understanding and video-segment matching.
Query q → Intent Encoder → q̂ ∈ ℝᵈ
Video V → Segment Encoder → {v̂₁, v̂₂, ..., v̂ₘ} ∈ ℝᵈ
Relevance: R(q, vᵢ) = cos(q̂, v̂ᵢ) = q̂ᵀv̂ᵢ / (||q̂|| · ||v̂ᵢ||)
Where d = 512 (embedding dimension chosen to balance expressiveness and computational efficiency).
Crisp Attention for Query Intent
Traditional attention mechanisms in transformers compute weighted combinations across all input tokens. For query understanding, this creates a problem: ambiguous or hedge words ("maybe," "possibly," "sort of") dilute the signal from intent-carrying tokens.
The Sparsity Hypothesis
Recent work in transformer optimization has shown that structured attention sparsity can improve model accuracy, not just efficiency. The hypothesis: forcing the model to select a small subset of highly relevant features acts as implicit regularization, preventing overfitting to weak correlations in training data.
We applied this principle to query intent extraction. Instead of dense attention over all query tokens, we use top-k attention where only the k highest-scoring tokens contribute to the intent representation:
Traditional attention:
α = softmax(QKᵀ/√d)
output = αV
Crisp attention (k=5):
α = softmax(QKᵀ/√d)
α_sparse = top_k(α, k) # zero out all but top-k values
α_normalized = α_sparse / sum(α_sparse)
output = α_normalized V
For k=5, we force the model to identify the 5 most intent-relevant tokens. Empirically, this improved intent classification accuracy from 79% to 87% on our evaluation set of 2,000 manually labeled queries.
Intent Disambiguation
Users write queries with varying specificity:
- Specific: "show pricing objection at 23 minutes"
- Moderate: "find technical architecture discussion"
- Vague: "interesting moments"
For specific queries, crisp attention focuses on entity terms ("pricing," "objection," "23 minutes"). For vague queries, it focuses on context terms that disambiguate based on video metadata (speaker roles, video category, historical user preferences).
We train a small routing network (2-layer MLP, 256 hidden units) that predicts optimal k per query:
k = routing_network(query_embedding, video_metadata)
k ∈ {3, 5, 7, 10} # discrete choices
Specific queries → smaller k (more focused) Vague queries → larger k (more context)
This adaptive sparsity improved end-to-end matching precision by 6%.
Multimodal Video Encoding
Videos contain three signal modalities: visual frames, audio waveforms, and transcript text. Naive concatenation fails because modalities have different temporal resolutions and information densities.
Per-Modality Encoding
Visual: Sample frames at 1 FPS. Encode with a vision transformer to produce frame embeddings Fᵥ ∈ ℝᵗˣᵈᵛ.
Audio: Convert to mel spectrograms (80 bins, 25ms windows, 10ms hop). Encode with 1D CNN to produce audio embeddings Fₐ ∈ ℝᵗˣᵈₐ.
Transcript: Run ASR (automatic speech recognition) with word-level timestamps. Segment into sentences. Encode with transformer to produce text embeddings Fₜ ∈ ℝˢˣᵈₜ.
Cross-Modal Fusion with Crisp Attention
Different queries require different modality weightings. "Show product demo" needs visual focus. "Find pricing discussion" needs text focus. We learn query-dependent modality attention.
First, project all modalities to common dimension d:
Hᵥ = Fᵥ Wᵥ
Hₐ = Fₐ Wₐ
Hₜ = Fₜ Wₜ
Then apply crisp cross-modal attention. For each modality, we compute attention over the other two modalities, but only retain top-k=3 attention weights:
For visual features Hᵥ:
Attend to [Hₐ, Hₜ] with crisp attention
α_va, α_vt = crisp_attention(Hᵥ, [Hₐ, Hₜ])
H'ᵥ = α_va·Hₐ + α_vt·Hₜ
This forces the model to decisively choose which modalities are relevant rather than hedging across all three.
Finally, we learn query-dependent fusion weights:
w = softmax(MLP(query_embedding))
w ∈ ℝ³ (weights for visual, audio, text)
H_fused = w₁·H'ᵥ + w₂·H'ₐ + w₃·H'ₜ
For text-heavy queries, w₃ dominates. For visual queries, w₁ dominates. The fusion is learned end-to-end during training.
Semantic Matching via Contrastive Learning
Training the dual encoder requires (query, video_segment, relevance) labels. We don't have these at scale. Instead, we use weak supervision from two sources:
- Transcript overlap: Segments where transcript has high TF-IDF similarity with query
- User interactions: Segments users selected after issuing queries (sparse but high-quality)
Hard Negative Mining
The key to effective contrastive learning is hard negatives—examples that are semantically similar but not correct matches. Random negatives are too easy; the model learns to separate them quickly but fails on subtle distinctions.
Our mining strategy:
For query q with positive segment v⁺:
Compute similarity s_i = cos(q̂, v̂_i) for all segments in batch
Select negatives where 0.3 < s_i < 0.6
(Too similar → might be false negative)
(Too dissimilar → uninformative for learning)
Example:
- Query: "customer testimonial"
- Hard negative: Salesperson describing typical customer results (similar keywords, wrong speaker)
- Easy negative: Technical architecture discussion (completely unrelated)
Hard negatives force the model to learn fine-grained distinctions (first-person vs. third-person, customer vs. employee).
Loss Function
We use InfoNCE with temperature τ = 0.07:
L = -log(exp(cos(q̂,v̂⁺)/τ) / (exp(cos(q̂,v̂⁺)/τ) + Σᵢ exp(cos(q̂,v̂ᵢ⁻)/τ)))
Lower temperature sharpens the distribution, requiring the model to strongly distinguish positives from hard negatives.
To handle false negatives from weak supervision, we down-weight high-similarity negatives:
For negative v⁻ with similarity s:
weight = max(0, 1 - 2·(s - 0.5)) if s > 0.5 else 1
This reduces penalty when the model ranks potential false negatives highly.
Efficient Retrieval at Scale
At inference, we need to find top-n segments from potentially thousands of candidates. Brute-force comparison is O(m·d) where m is segment count—too slow for real-time use.
Two-Stage Coarse-to-Fine Retrieval
Stage 1: Coarse filtering
Segment video into 30-second windows with 50% overlap. For a 90-minute video, this yields ~360 segments. Encode all segments in parallel, producing embedding matrix V ∈ ℝ³⁶⁰ˣ⁵¹².
Compute similarities via single matrix multiplication:
scores = q̂ᵀV # shape: (360,)
top_100 = argsort(scores)[-100:] # select top-100
This aggressive filtering (360 → 100) is safe: recall@100 > 95% in offline eval.
Stage 2: Fine-grained scoring
For the 100 candidates:
- Extend temporal context (±10 seconds)
- Re-encode at higher resolution
- Apply crisp attention at segment level
- Compute refined similarity scores
This two-stage approach reduces compute by 60% while maintaining quality.
Temporal Boundary Refinement
Initial segments are fixed-length (30s), which often cuts mid-sentence. We refine boundaries for semantic coherence:
- Transcript alignment: Find sentence boundaries within ±5s of segment edges
- Scene detection: Compute frame-to-frame similarity; break at discontinuities
- Silence detection: Prefer breaks during silence (amplitude < -40dB for >0.5s)
We learn a boundary scoring function:
boundary_score(t) = w₁·is_sentence_boundary(t) +
w₂·is_scene_boundary(t) +
w₃·is_silence(t)
Weights w trained on 500 human-annotated "good" vs "bad" clip boundaries.
Non-Maximum Suppression
Due to overlapping windows, we get redundant high-scoring candidates. Apply temporal NMS:
candidates = sort_by_score_descending(segments)
result = []
for c in candidates:
if overlap(c, any segment in result) > 0.3:
skip c
else:
result.append(c)
return result[:n]
30% overlap threshold allows partial overlap (long discussions may produce multiple clips) while preventing near-duplicates.
Evaluation
We evaluated on a held-out test set of 500 videos with human-annotated relevant segments for 2,000 queries.
Crisp Attention Ablation
| Configuration | Precision@5 | Recall@10 | MAP@10 | Compute (relative) |
|---|---|---|---|---|
| Dense attention | 0.79 | 0.71 | 0.76 | 1.0× |
| Crisp attention (k=10) | 0.84 | 0.74 | 0.80 | 0.7× |
| Crisp attention (k=5) | 0.89 | 0.76 | 0.83 | 0.4× |
| Crisp attention (k=3) | 0.86 | 0.73 | 0.81 | 0.3× |
k=5 provides optimal balance: 12% precision improvement over dense attention with 60% compute reduction.
Query Type Performance
| Query Type | Count | Precision@5 | Notes |
|---|---|---|---|
| Specific (with timestamps) | 180 | 0.94 | "pricing at 23 min" |
| Moderate (topic only) | 1,100 | 0.89 | "technical architecture" |
| Vague (general intent) | 520 | 0.81 | "interesting moments" |
| Entity-based | 200 | 0.92 | "kubernetes discussion" |
Specific queries benefit most from crisp attention (forcing focus on timestamp entities). Vague queries see smaller gains (require more context).
Boundary Quality
91% of clips start at sentence boundaries (vs 45% with naive fixed-length segmentation). 87% end without cutting off speech. Average user rating: 4.2/5.0 for clip coherence.
Failure Modes
Query-Video Mismatch
When query asks for content not present in video, even top-scoring segment may be irrelevant. We use confidence thresholding: if max(scores) < 0.6, return "No matching segments found."
This false-negative rate is 2% (legitimate matches scored below threshold), but prevents returning random clips just to meet requested count.
Repetitive Content
Speaker repeats same point multiple times. All repetitions score highly. Mitigation: semantic deduplication using transcript embeddings. If cos(transcript(cᵢ), transcript(cⱼ)) > 0.85, keep higher-scored clip.
Context Dependency
Clip references prior context not included in segment ("Yes, that's exactly right" without the question). We penalize clips containing unresolved references ("that," "it," "this") without antecedents.
Lessons Learned
1. Sparsity as regularization, not just optimization
We initially explored crisp attention for compute efficiency. The accuracy improvement was unexpected. Hypothesis: forcing the model to select a small feature set prevents overfitting to spurious correlations in training data. The model learns more robust, generalizable features.
2. Query intent matters more than query expansion
Traditional IR systems expand queries with synonyms. For intent-driven matching, this dilutes focus. Better to narrow the query to core intent (via crisp attention) than expand it.
3. Hard negative mining is critical
With only easy negatives (random segments), the model achieved 82% precision. Adding hard negatives (0.3-0.6 similarity range) improved to 89%. The model must learn fine-grained semantic distinctions.
4. Multimodal fusion requires query-dependent weighting
Equal weighting of visual/audio/text yielded 81% precision. Learning query-dependent weights (via small MLP) improved to 89%. Different queries need different modality emphasis.
5. Two-stage retrieval is necessary at scale
Brute-force scoring all segments is too slow. Coarse-to-fine (360 → 100 → 10) maintains quality while reducing latency by 60%.
Future Directions
Cross-video search: Extend to corpus-level retrieval. Challenge: indexing 100M+ segments while maintaining <1s query latency. Exploring approximate nearest neighbor methods (HNSW, ScaNN).
Compositional queries: Support boolean operations: "(pricing AND objections) NOT discounts." Requires careful score calibration to make set operations meaningful.
Zero-shot generalization: Current model trained on specific domain (business videos). Exploring meta-learning approaches to generalize to unseen video categories without fine-tuning.
Temporal grounding: Instead of discrete clips, return continuous playback starting at query-relevant timestamp. Already supported by architecture (we have frame-level timestamps).
Conclusion
By applying crisp attention to query-driven video extraction, we achieved 91% precision—a 12% improvement over dense attention—while reducing computational cost by 60%. The key insight: structured sparsity acts as implicit regularization, forcing the model to commit to strong semantic signals rather than hedging across weak correlations.
This principle extends beyond video retrieval to any task requiring intent understanding and semantic matching. When in doubt, constrain what the model can attend to. Less attention, focused on the right features, beats more attention spread thin.
Try the system at https://joyspace.ai.
For technical discussion or collaboration: hello@joyspace.ai
Link to this post: https://joyspace.ai/query-driven-video-clip-extraction
Ready to Get Started?
Join thousands of content creators who have transformed their videos with Joyspace AI.
Start Creating For Free →Share This Article
Help others discover this valuable video marketing resource
Share on Social Media
*Some platforms may require you to add your own message due to their sharing policies.