Quick Answer: SAM 2 for Video Annotation

SAM 2 reduces video annotation time from 37.8 seconds per frame (frame-by-frame with SAM 1) to 4.5 seconds per frame, an 8.4x speedup published in Meta ICLR 2025 research. It achieves this by tracking objects across frames from a single keyframe prompt using a streaming memory architecture, eliminating the need to re-identify objects on every frame. The model also handles occlusions automatically through a dedicated occlusion head that suppresses mask output when the target object is not visible.
Why Frame-by-Frame Video Annotation Breaks Down at Scale
Frame-by-frame video annotation treats a video as an ordered collection of individual images. An annotator opens frame 1, draws bounding boxes or segmentation masks around each object of interest, moves to frame 2, and repeats. This approach works acceptably for short clips with a small number of objects and a slow frame rate. It fails in every other scenario.
The Volume Problem
The fundamental issue is multiplicative. A one-minute video at 24 frames per second produces 1,440 frames. A dataset of 500 such videos produces 720,000 frames. Meta SAM 2 research measured the baseline annotation time at 37.8 seconds per frame for production-quality labeling using frame-by-frame methods with SAM 1. At that rate, annotating 720,000 frames requires approximately 7,560 hours of work. At a standard 40-hour working week, that is nearly four years of a single annotator time. No reasonable annotation budget or timeline accommodates that.Even at lower frame rates, the numbers remain impractical. Annotation projects for autonomous vehicle datasets, drone footage analysis, and surveillance video typically require tens of millions of labeled frames. Frame-by-frame manual annotation is not a scalable path to those numbers.
The Consistency Problem
Beyond raw volume, frame-by-frame annotation introduces consistency problems that degrade training data quality. When an annotator labels object boundaries independently on each frame, the position and shape of the mask fluctuate between frames in ways that reflect human variation rather than actual object movement. This creates a labeling artifact called bounding box drift or mask jitter, where the training signal teaches the model that object boundaries are noisier than they actually are.Consistency also breaks down across annotators on long projects. An annotator who labeled the first 10,000 frames of a video dataset interprets edge cases differently from the annotator who labels frames 50,000 to 60,000. Without a mechanism that propagates annotations from keyframes to intermediate frames automatically, there is no structural way to enforce consistency across the dataset.
The Object Identity Problem
In video annotation, objects must retain a consistent identity across the full clip. The same vehicle in frame 1 must carry the same label in frame 300, even if it is partially occluded in frames 150 to 200. Frame-by-frame annotation without automated tracking gives annotators no mechanical assistance in maintaining that identity. They must track it visually and manually, which is error-prone and becomes increasingly difficult as clip length increases, scene density increases, or occlusions become frequent.
How SAM 2 Addresses Each of These Problems

SAM 2 was designed specifically for video segmentation, and its architecture addresses all three failure modes of frame-by-frame annotation in a unified way. Understanding the architecture helps explain why the throughput improvements are as large as they are.
Streaming Memory for Temporal Consistency
SAM 2 processes video frames one at a time in a streaming fashion. For each frame, the model image encoder generates feature embeddings from the current frame. A memory attention module then conditions those features on information stored in a memory bank from previously processed frames. The mask decoder uses both the current frame features and the memory context to produce a segmentation mask.This memory mechanism is the architectural innovation that makes SAM 2 fundamentally different from applying SAM 1 to individual frames. The model maintains a representation of the target object across time. When an annotator clicks on an object in the first frame of a clip, that object representation is stored in the memory bank and used to generate masks for subsequent frames without requiring additional clicks. The result is temporally consistent mask propagation from a single prompt.
Occlusion Handling Through a Dedicated Prediction Head
SAM 2 includes an occlusion head, a dedicated component that predicts whether the target object is present in each frame. When the model predicts that an object has become occluded or has exited the frame, it suppresses mask output for that frame rather than hallucinating a mask in an incorrect location. When the object reappears, the memory bank context allows SAM 2 to resume accurate tracking from the last known object state.This solves the object identity problem that plagues frame-by-frame annotation. The model maintains object identity through occlusions automatically, without requiring the annotator to re-identify the object in every frame where it reappears.
Inference Speed That Makes Scale Feasible
SAM 2 achieves real-time inference at approximately 44 frames per second on a single NVIDIA A100 GPU, using PyTorch 2.3.1 with automatic mixed precision. This inference speed holds for all model variants except the largest. The throughput is sufficient for real-time interactive annotation, meaning annotators are not waiting for model predictions. They are reviewing them as fast as they can navigate the timeline.
Published Performance Benchmarks
How to Structure a SAM 2 Video Annotation Workflow
Integrating SAM 2 into a production video annotation pipeline requires more than switching on AI-assisted labeling. The workflow needs to be structured to capture the throughput advantage while maintaining the quality controls that production training data requires.
Step 1: Define the Annotation Schema Before Processing Any Video
Video annotation schemas must account for temporal behavior in ways that image annotation schemas do not. Before annotating a single frame, the schema should specify: which object categories are annotated, how object identity is maintained when objects enter and exit the frame, what constitutes an occluded object versus an out-of-frame object, the minimum number of visible pixels required before an object receives a label, and whether partially visible objects at frame edges receive annotations.These decisions need to be made at the schema level rather than left to annotator judgment. In a large video dataset, inconsistent handling of edge cases accumulates into systematic labeling errors that reduce model performance in exactly the scenarios where the model most needs reliable training signal.
Step 2: Select Keyframes Strategically
SAM 2 propagates masks forward through a video clip from any prompted frame, using its streaming memory to carry object state from frame to frame. The quality of propagation depends on how representative the prompted frames are of the object appearance and position across the clip.For short clips with limited object movement, a single keyframe prompt is often sufficient. For longer clips with significant object movement, appearance changes, or frequent occlusions, multiple keyframes are required. A practical keyframe selection approach: prompt on the first frame, verify propagation quality at the midpoint and end of the clip, and add keyframe prompts at any point where propagation accuracy falls below your quality threshold. This requires annotators to review the full propagated sequence rather than working frame by frame, which is structurally faster even when multiple keyframes are required.
Step 3: Apply Human Review at the Propagated Mask Level
The review workflow for SAM 2-assisted video annotation is fundamentally different from frame-by-frame review. Annotators review propagated mask sequences rather than individual frames, which means the review unit is the object track across the clip rather than the frame. An annotator checking a correctly propagated mask sequence for a single object across a 300-frame clip can do so far faster than reviewing 300 independent annotations.Structure the review workflow to flag frames where SAM 2 occlusion head predicted low confidence, frames where mask area changes significantly between adjacent frames (which may indicate tracking failure), and frames near the beginning and end of each object presence in the clip. These are the areas where propagation errors are most likely to occur.
Step 4: Handle Occlusions Explicitly in Your Quality Protocol
Object occlusion is the most common source of propagation failure in SAM 2-assisted video annotation. When an object is occluded for an extended period and then reappears, SAM 2 resumes tracking from the last memory state. If the object appearance has changed significantly during the occlusion (due to lighting change, viewpoint change, or partial disappearance), the resumed track may drift.
For annotation projects with frequent occlusions, build an explicit re-prompt protocol: when an annotator identifies a tracking failure after an occlusion, they provide a new prompt on the first post-occlusion frame where the object is clearly visible. This correction is stored in the memory bank and propagates forward from that point. The total number of manual corrections required is significantly lower than frame-by-frame annotation even in occlusion-heavy footage.
Quality Control for SAM 2-Assisted Video Annotation
The throughput advantage of SAM 2 only translates into production value if the resulting annotations meet the quality thresholds required for model training. Quality control for SAM 2-assisted video annotation differs from both frame-by-frame annotation review and image annotation review.
Step 5: Run an Independent Quality Verification Pass Before Export
Propagated mask sequences should go through an independent quality verification pass before export to the training pipeline. The verification step should confirm that each object maintains consistent identity across the full clip, that occluded frames are handled according to the schema specification, and that mask boundaries are accurate on a sampled subset of frames rather than only on the keyframes.Meta SAM 2 data engine used a verification step in which a separate set of annotators classified each masklet as satisfactory or unsatisfactory before it was added to the training dataset. Unsatisfactory masklets were returned to the annotation pipeline for correction. For production annotation projects, building an equivalent verification stage is the quality control mechanism that makes SAM 2-assisted annotation reliable at scale.
Quality Metrics to Track for Video Annotation Projects
Common Failure Modes in SAM 2 Video Annotation
What SAM 2 Does Not Replace in Video Annotation
SAM 2 accelerates the mechanical work of video annotation substantially. It does not replace the human judgment that determines annotation quality. Several aspects of video annotation remain human-dependent regardless of SAM 2 capabilities.
Schema Development
Domain-Specific Judgment
Quality Verification
Video Annotation Use Cases Where SAM 2 Creates the Most Value

SAM 2 throughput gains are not uniform across annotation types. The model creates the greatest value in specific video annotation scenarios where frame-by-frame approaches are most prohibitively expensive.
Autonomous Vehicle Dataset Annotation
Sports and Motion Analysis
Drone and Aerial Footage Annotation
Medical and Surgical Video Annotation
How Scematics Supports Video Annotation at Scale
Scematics integrates SAM 2 into its video annotation platform, enabling annotators to use keyframe prompting and automated mask propagation within a workflow that includes configurable quality review stages, annotator performance analytics, and multi-format dataset export.
Platform Capabilities for Video Annotation
Managed Video Annotation Services
Related Resources
Video Annotation at Scale: Frequently Asked Questions
How much faster is SAM 2 than frame-by-frame video annotation?
What is the main problem with frame-by-frame video annotation?
What is a keyframe in SAM 2 video annotation?
How does SAM 2 handle objects that go behind other objects?
What video annotation export formats does Scematics support?
What does SAM 2 not do in video annotation?
How many objects can SAM 2 track simultaneously in a video?
Scale Your Video Annotation with SAM 2
Scematics integrates SAM 2 natively with keyframe propagation, configurable quality review workflows, annotator performance analytics, and multi-format export built in. Self-serve annotation platform access and fully managed video annotation services are both available.If you are assessing SAM 2 for a specific video annotation project, the Scematics team can review your dataset requirements, footage characteristics, and quality thresholds to recommend the workflow configuration that delivers reliable training data at your target scale.
Scematics Copyrights Reserved
Post comments
Comments