Abstract
MaskDistill, a unified masked image modeling technique, reconstructs normalized semantic features for improved performance in image classification and semantic segmentation.
Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers, achieving impressive performance on various downstream tasks. In this work, we propose a unified view of masked image modeling after revisiting existing methods. Under the unified view, we introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions, conditioning on corrupted input images. Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods. When using the huge vision Transformer and pretraining 300 epochs, MaskDistill obtains 88.3% fine-tuning top-1 accuracy on ImageNet-1k (224 size) and 58.8% semantic segmentation mIoU metric on ADE20k (512 size). The code and pretrained models will be available at https://aka.ms/unimim.
Community
I implemented MaskDistill from scratch in PyTorch and reproduced the paper's results with ViT-Base. Code and pre-trained weights are open sourced:
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation (2026)
- Masked Next-Scale Prediction for Self-supervised Scene Text Recognition (2026)
- Vision Foundation Models as Generalist Tokenizers for Image Generation (2026)
- TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection (2026)
- Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation (2026)
- Let ViT Speak: Generative Language-Image Pre-training (2026)
- AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2210.10615 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper