MatAnyone 2

Scaling Video Matting via a Learned Quality Evaluator

Peiqing Yang1,   Shangchen Zhou1†,   Kai Hao1,   Qingyi Tao2,  
1S-Lab, Nanyang Technological University, 2SenseTime Research, Singapore
Project lead
(Click on either video to pause)
Input
Hover for alpha
MatAnyone 2

MatAnyone 2 is a practical human video matting framework that preserves fine details by avoiding segmentation-like boundaries, while also shows enhanced robustness under challenging real-world conditions.

Abstract

Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality of alpha mattes without ground truth. It produces a pixel-wise evaluation map that identifies reliable and erroneous regions, enabling fine-grained quality assessment. The MQE scales up video matting in two ways: (1) as an online matting-quality feedback during training to suppress erroneous regions, providing comprehensive supervision, and (2) as an offline selection module for data curation, improving annotation quality by combining the strengths of leading video and image matting models. This process allows us to build a large-scale real-world video matting dataset, VMReal, containing 28K clips and 2.4M frames. To handle large appearance variations in long videos, we introduce a reference-frame training strategy that incorporates long-range frames beyond the local window for effective training. Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.

Method

Different from MatAnyone, MatAnyone 2 is trained with our real-world VMReal dataset, where each alpha label is paired with a binary evaluation map. This enables masked matting loss \( \mathcal{L}_{mat}^M \) on reliable pixels, while the learned Matting Quality Evaluator (MQE) supplies \( \mathcal{L}_{eval} \) to supervise both core and boundary regions. For large variations, a reference-frame strategy adds long-range temporal cues, improving robustness to large appearance changes without extra memory cost.

Baseline Comparisons

(Click on either video to pause)
Input
GVM MatAnyone 2
MatAnyone MatAnyone 2

Data Pipeline

Our automated dual-branch annotation pipeline. This pipeline enables large-scale construction of real-world Video Matting (VM) datasets, resulting in our VMReal dataset. We combine two complementary annotation branches: (1) the temporally stable \(B_V\) branch provides the base annotation, while (2) the detail-preserving \(B_I\) branch offers fine boundary details. Pixels where \(B_V\) fails but \(B_I\) succeeds are selectively integrated to produce the fused alpha and its evaluation map. The example on the right shows that the fused annotation preserves semantic stability as well as rich boundary details, making it suitable for VM training.

MatAnyone 2 Demo

BibTeX

@InProceedings{yang2025matanyone2,
      title     = {{MatAnyone 2}: Scaling Video Matting via a Learned Quality Evaluator},
      author    = {Yang, Peiqing and Zhou, Shangchen and Hao, Kai and Tao, Qingyi},
      booktitle = {arXiv preprint arXiv:2512.11782},
      year      = {2025}
}