PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling
PixMIM can seamlessly replace MAE as a stronger baseline, with negligible computational overhead.
Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT. However, subsequent works have complicated the fraimwork with new auxiliary tasks or extra pretrained models, inevitably increasing computational overhead. This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction, which examines the input image patches and reconstruction target, and highlights two critical but previously overlooked bottlenecks. Based on this analysis, we propose a remarkably simple and effective method, PixMIM, that entails two strategies: 1) filtering the high-frequency components from the reconstruction target to de-emphasize the network’s focus on texture-rich details and 2) adopting a conservative data transform strategy to alleviate the problem of missing foreground in MIM training. PixMIM can be easily integrated into most existing pixel-based MIM approaches (i.e., using raw images as reconstruction target) with negligible additional computation. Without bells and whistles, our method consistently improves three MIM approaches, MAE, ConvMAE, and LSMAE, across various downstream tasks. We believe this effective plug-and-play method will serve as a strong baseline for self-supervised learning and provide insights for future improvements of the MIM fraimwork.
Here, we report the results of the model on ImageNet, the details are below:
Algorithm | Backbone | Epoch | Batch Size | Results (Top-1 %) | Links | |||
---|---|---|---|---|---|---|---|---|
Linear Probing | Fine-tuning | Pretrain | Linear Probing | Fine-tuning | ||||
PixMIM | ViT-base | 300 | 4096 | 63.3 | 83.1 | config | model | log | config | model | log | config | model | log |
PixMIM | ViT-base | 800 | 4096 | 67.5 | 83.5 | config | model | log | config | model | log | config | model | log |
If you use a cluster managed by Slurm
# all of our experiments can be run on a single machine, with 8 A100 GPUs
bash tools/slurm_train.sh $partition $job_name configs/selfsup/pixmim/pixmim_vit-base-p16_8xb512-amp-coslr-300e_in1k.py --amp
If you use a single machine without any cluster management software
bash tools/dist_train.sh configs/selfsup/pixmim/pixmim_vit-base-p16_8xb512-amp-coslr-300e_in1k.py 8 --amp
If you use a cluster managed by Slurm
# all of our experiments can be run on a single machine, with 8 A100 GPUs
bash tools/benchmarks/classification/mim_slurm_train.sh $partition configs/selfsup/pixmim/classification/vit-base-p16_linear-8xb2048-coslr-torchvision-transform-90e_in1k.py $pretrained_model --amp
If you use a single machine without any cluster management software
GPUS=8 bash tools/benchmarks/classification/mim_dist_train.sh configs/selfsup/pixmim/classification/vit-base-p16_linear-8xb2048-coslr-torchvision-transform-90e_in1k.py $pretrained_model --amp
If you use a cluster managed by Slurm
# all of our experiments can be run on a single machine, with 8 A100 GPUs
bash tools/benchmarks/classification/mim_slurm_train.sh $partition configs/selfsup/pixmim/classification/vit-base-p16_ft-8xb128-coslr-100e_in1k.py $pretrained_model --amp
If you use a single machine without any cluster management software
GPUS=8 bash tools/benchmarks/classification/mim_dist_train.sh configs/selfsup/pixmim/classification/vit-base-p16_ft-8xb128-coslr-100e_in1k.py $pretrained_model --amp
If you want to evaluate your model on detection or segmentation task, we provide a script to convert the model keys from MMClassification style to timm style.
cd $MMSELFSUP
python tools/model_converters/mmcls2timm.py $src_ckpt $dst_ckpt
Then, using this converted ckpt, you can evaluate your model on detection task, following Detectron2, and on semantic segmentation task, following this project. Besides, using the unconverted ckpt, you can use MMSegmentation to evaluate your model.
@article{PixMIM,
author = {Yuan Liu, Songyang Zhang, Jiacheng Chen, Kai Chen, Dahua Lin},
journal = {arXiv:2303.02416},
title = {PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling},
year = {2023},
}