Paper brief
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions
ViT-CoMer keeps the flexibility of a plain Vision Transformer, then adds a parallel convolutional branch so dense prediction tasks can use richer multi-scale features.
Why this paper matters
Vision Transformers are attractive backbones because they can reuse strong pretrained weights. The problem is that a plain ViT usually works with a single-scale feature map, while dense prediction tasks such as object detection and semantic segmentation benefit from features at multiple spatial resolutions.
ViT-CoMer asks a practical question: can we keep a plain ViT backbone, avoid redesigning the whole network, and still give dense prediction models the multi-scale information they expect?
The bite
The key idea is to run two complementary branches in parallel. The ViT branch contributes global representation power and compatibility with pretrained weights. The convolutional branch contributes local, multi-resolution features that are useful for dense prediction.
Instead of replacing ViT with a vision-specific transformer, ViT-CoMer lets convolutional features interact with ViT features through lightweight modules. This makes the backbone feel less like a full architecture rewrite and more like a feature interaction layer around a reusable ViT.
How it works
The method is built around two modules. The first is the Multi-Receptive Field Feature Pyramid module, or MRFP. It extracts convolutional features at several resolutions and mixes them with depth-wise convolutions using different receptive fields.
The second is the CNN-Transformer Bidirectional Fusion Interaction module, or CTI. CTI blends the ViT feature with the convolutional feature pyramid. A useful way to understand it is as a two-way exchange: ViT receives local multi-scale detail, while the CNN branch receives global transformer context.
The paper also uses a learnable, zero-initialized scaling factor when adding convolutional information back into the ViT stream. That detail matters because it reduces disruption early in training: the model can start close to the original ViT behavior and gradually learn how much convolutional interaction to use.
What to look at in the results
The most useful comparisons are not just whether ViT-CoMer improves a score, but where the improvement appears. Dense prediction is where multi-scale features should matter, so object detection and semantic segmentation are the right places to inspect first.
Also check how the method behaves under different pretraining settings. One of the main reasons to keep a plain ViT branch is the ability to benefit from strong pretrained ViT weights without committing to a heavily modified backbone.
Practical takeaways
- Plain ViT is powerful, but dense prediction often needs multi-scale spatial detail.
- A parallel CNN branch can add local multi-resolution information without discarding ViT.
- Feature interaction modules are the central design move, not just extra convolution.
- Zero-initialized scaling is a small but important stabilizing detail.
- When reading the paper, focus on detection and segmentation results first.