Auxiliary-free human video matting methods, which rely solely on input frames, often struggle with complex or ambiguous backgrounds. To address this, we propose MatAnyone, a robust framework tailored for target-assigned video matting. Specifically, building on a memory-based paradigm, we introduce a consistent memory propagation module via region-adaptive memory fusion, which adaptively integrates memory from the previous frame. This ensures semantic stability in core regions while preserving fine-grained details along object boundaries. For robust training, we present a larger, high-quality, and diverse dataset for video matting. Additionally, we incorporate a novel training strategy that efficiently leverages large-scale segmentation data, boosting matting stability. With this new network design, dataset, and training strategy, MatAnyone delivers robust and accurate video matting results in diverse real-world scenarios, outperforming existing methods.
MatAnyone is a memory-based framework for video matting. Given a target segmentation map in the first frame, our model achieves stable and high-quality matting through consistent memory propagation, with a region-adaptive memory fusion module to combine information from the previous and current frame. To overcome the scarcity of real video matting data, we incorporate a new training strategy that effectively leverages matting data for fine-grained matting details and segmentation data for semantic stability, with designed losses separately.
The assignment of target object at the first frame gives us flexibility for instance/interactive video matting. Thanks to the success of promptable segmentation methods, the target object could be easily assigned with a few clicks (segmentation mask annotated in the figure). MatAnyone demonstrates superior performance in instance video matting, particularly in maintaining object tracking stability and preserving fine-grained details of alpha mattes. (Click full screen for best view)
Given the first-frame segmentation mask, the first-frame alpha matte is predicted based on that, and impacts performance in subsequent frames. The sequential prediction in the memory-based paradigm enables recurrent refinement without the need for retraining during inference. This (1) enhances robustness to the given segmentation mask and (2) refines matting details to achieve image-matting-level quality.
Input
Iteratively Refined Alpha Matte (from seg mask)
@InProceedings{yang2025matanyone,
title = {{MatAnyone}: Stable Video Matting with Consistent Memory Propagation},
author = {Yang, Peiqing and Zhou, Shangchen and Zhao, Jixin and Tao, Qingyi and Loy, Chen Change},
booktitle = {arXiv preprint arXiv:2501.14677},
year = {2025}
}