Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality of alpha mattes without ground truth. It produces a pixel-wise evaluation map that identifies reliable and erroneous regions, enabling fine-grained quality assessment. The MQE scales up video matting in two ways: (1) as an online matting-quality feedback during training to suppress erroneous regions, providing comprehensive supervision, and (2) as an offline selection module for data curation, improving annotation quality by combining the strengths of leading video and image matting models. This process allows us to build a large-scale real-world video matting dataset, VMReal, containing 28K clips and 2.4M frames. To handle large appearance variations in long videos, we introduce a reference-frame training strategy that incorporates long-range frames beyond the local window for effective training. Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.
Different from MatAnyone, MatAnyone 2 is trained with our real-world VMReal dataset, where each alpha label is paired with a binary evaluation map. This enables masked matting loss \( \mathcal{L}_{mat}^M \) on reliable pixels, while the learned Matting Quality Evaluator (MQE) supplies \( \mathcal{L}_{eval} \) to supervise both core and boundary regions. For large variations, a reference-frame strategy adds long-range temporal cues, improving robustness to large appearance changes without extra memory cost.
Our automated dual-branch annotation pipeline. This pipeline enables large-scale construction of real-world Video Matting (VM) datasets, resulting in our VMReal dataset. We combine two complementary annotation branches: (1) the temporally stable \(B_V\) branch provides the base annotation, while (2) the detail-preserving \(B_I\) branch offers fine boundary details. Pixels where \(B_V\) fails but \(B_I\) succeeds are selectively integrated to produce the fused alpha and its evaluation map. The example on the right shows that the fused annotation preserves semantic stability as well as rich boundary details, making it suitable for VM training.
@InProceedings{yang2025matanyone2,
title = {{MatAnyone 2}: Scaling Video Matting via a Learned Quality Evaluator},
author = {Yang, Peiqing and Zhou, Shangchen and Hao, Kai and Tao, Qingyi},
booktitle = {arXiv preprint arXiv:2512.11782},
year = {2025}
}