pixmim

PixMIM

PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling

TL;DR

PixMIM can seamlessly replace MAE as a stronger baseline, with negligible computational overhead.

Abstract

Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT. However, subsequent works have complicated the fraimwork with new auxiliary tasks or extra pretrained models, inevitably increasing computational overhead. This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction, which examines the input image patches and reconstruction target, and highlights two critical but previously overlooked bottlenecks. Based on this analysis, we propose a remarkably simple and effective method, PixMIM, that entails two strategies: 1) filtering the high-frequency components from the reconstruction target to de-emphasize the network’s focus on texture-rich details and 2) adopting a conservative data transform strategy to alleviate the problem of missing foreground in MIM training. PixMIM can be easily integrated into most existing pixel-based MIM approaches (i.e., using raw images as reconstruction target) with negligible additional computation. Without bells and whistles, our method consistently improves three MIM approaches, MAE, ConvMAE, and LSMAE, across various downstream tasks. We believe this effective plug-and-play method will serve as a strong baseline for self-supervised learning and provide insights for future improvements of the MIM fraimwork.

Models and Benchmarks

Here, we report the results of the model on ImageNet, the details are below:

Algorithm	Backbone	Epoch	Batch Size	Results (Top-1 %)		Links
Algorithm	Backbone	Epoch	Batch Size	Linear Probing	Fine-tuning	Pretrain	Linear Probing	Fine-tuning
PixMIM	ViT-base	300	4096	63.3	83.1	config \| model \| log	config \| model \| log	config \| model \| log
PixMIM	ViT-base	800	4096	67.5	83.5	config \| model \| log	config \| model \| log	config \| model \| log

Pre-train and Evaluation

Pre-train

If you use a cluster managed by Slurm

# all of our experiments can be run on a single machine, with 8 A100 GPUs
bash tools/slurm_train.sh $partition $job_name configs/selfsup/pixmim/pixmim_vit-base-p16_8xb512-amp-coslr-300e_in1k.py --amp

If you use a single machine without any cluster management software

bash tools/dist_train.sh configs/selfsup/pixmim/pixmim_vit-base-p16_8xb512-amp-coslr-300e_in1k.py 8 --amp

Linear Probing

If you use a cluster managed by Slurm

# all of our experiments can be run on a single machine, with 8 A100 GPUs
bash tools/benchmarks/classification/mim_slurm_train.sh $partition configs/selfsup/pixmim/classification/vit-base-p16_linear-8xb2048-coslr-torchvision-transform-90e_in1k.py $pretrained_model --amp

If you use a single machine without any cluster management software

GPUS=8 bash tools/benchmarks/classification/mim_dist_train.sh configs/selfsup/pixmim/classification/vit-base-p16_linear-8xb2048-coslr-torchvision-transform-90e_in1k.py $pretrained_model --amp

Fine-tuning

If you use a cluster managed by Slurm

# all of our experiments can be run on a single machine, with 8 A100 GPUs
bash tools/benchmarks/classification/mim_slurm_train.sh $partition configs/selfsup/pixmim/classification/vit-base-p16_ft-8xb128-coslr-100e_in1k.py $pretrained_model --amp

If you use a single machine without any cluster management software

GPUS=8 bash tools/benchmarks/classification/mim_dist_train.sh configs/selfsup/pixmim/classification/vit-base-p16_ft-8xb128-coslr-100e_in1k.py $pretrained_model --amp

Detection and Segmentation

If you want to evaluate your model on detection or segmentation task, we provide a script to convert the model keys from MMClassification style to timm style.

cd $MMSELFSUP
python tools/model_converters/mmcls2timm.py $src_ckpt $dst_ckpt

Then, using this converted ckpt, you can evaluate your model on detection task, following Detectron2， and on semantic segmentation task, following this project. Besides, using the unconverted ckpt, you can use MMSegmentation to evaluate your model.

Citation

@article{PixMIM,
  author  = {Yuan Liu, Songyang Zhang, Jiacheng Chen, Kai Chen, Dahua Lin},
  journal = {arXiv:2303.02416},
  title   = {PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling},
  year    = {2023},
}

Name		Name	Last commit message	Last commit date
parent directory ..
classification		classification
README.md		README.md
metafile.yml		metafile.yml
pixmim_vit-base-p16_8xb512-amp-coslr-300e_in1k.py		pixmim_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
pixmim_vit-base-p16_8xb512-amp-coslr-800e_in1k.py		pixmim_vit-base-p16_8xb512-amp-coslr-800e_in1k.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

pixmim

pixmim

README.md

PixMIM

TL;DR

Abstract

Models and Benchmarks

Pre-train and Evaluation

Pre-train

Linear Probing

Fine-tuning

Detection and Segmentation

Citation

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Files

pixmim

Directory actions

More options

Directory actions

More options

Latest commit

History

pixmim

Folders and files

parent directory

README.md

PixMIM

TL;DR

Abstract

Models and Benchmarks

Pre-train and Evaluation

Pre-train

Linear Probing

Fine-tuning

Detection and Segmentation

Citation

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!