Summary Table of NLP Models (HF-based) – Vision-Type Models

NameFull NameArchitectureBase ModelDevelopedTraining DatasetLib. & FrameworkUse CasesHF URLGithhub URL
BEiTBidirectional Encoder representation from Image TransformersVision TransformerViT2021ImageNet-21k, ImageNet-1kPyTorch, Hugging Face TransformersImage classification, semantic segmentationhttps://huggingface.co/microsoft/beit-base-patch16-224https://github.com/microsoft/unilm/tree/master/beit
BiTBig TransferResNetResNet2019JFT-300M, ImageNet-21kTensorFlow, Hugging Face TransformersImage classification, transfer learninghttps://huggingface.co/google/bit-50https://github.com/google-research/big_transfer
Conditional DETRConditional DETRTransformerDETR2021COCOPyTorch, Hugging Face TransformersObject detectionhttps://huggingface.co/microsoft/conditional-detr-resnet-50https://github.com/Atten4Vis/ConditionalDETR
ConvNeXTConvNeXTConvolutional Neural NetworkResNet2022ImageNet-1kPyTorch, Hugging Face TransformersImage classificationhttps://huggingface.co/facebook/convnext-tiny-224https://github.com/facebookresearch/ConvNeXt
ConvNeXTV2ConvNeXT V2Convolutional Neural NetworkConvNeXT2023ImageNet-22kPyTorch, Hugging Face TransformersImage classificationhttps://huggingface.co/facebook/convnextv2-tiny-1k-224https://github.com/facebookresearch/ConvNeXt-V2
CvTConvolutional vision TransformerVision TransformerViT2021ImageNet-1kPyTorch, Hugging Face TransformersImage classificationhttps://huggingface.co/microsoft/cvt-13https://github.com/microsoft/CvT
DAB-DETRDynamic Anchor Boxes DETRTransformerConditional DETR2022COCO 2017PyTorch, Hugging Face TransformersObject detectionhttps://huggingface.co/IDEA-Research/dab-detr-resnet-50https://github.com/IDEA-Research/DAB-DETR
Deformable DETRDeformable DETRTransformerDETR2020COCOPyTorch, Hugging Face TransformersObject detectionhttps://huggingface.co/SenseTime/deformable-detrhttps://github.com/fundamentalvision/Deformable-DETR
DeiTData-efficient image TransformersVision TransformerViT2020ImageNet-1kPyTorch, Hugging Face TransformersImage classificationhttps://huggingface.co/facebook/deit-base-distilled-patch16-224https://github.com/facebookresearch/deit
Depth AnythingDepth AnythingVision TransformerDPT2024MiDaS dataset, custom large-scale datasetPyTorch, Hugging Face TransformersMonocular depth estimationhttps://huggingface.co/LiheYoung/depth-anything-small-hfhttps://github.com/LiheYoung/Depth-Anything
Depth Anything V2Depth Anything V2Dense Prediction Transformer (DPT)DINOv22024595K synthetic images, 62M+ real unlabeled imagesPyTorch, Hugging Face TransformersMonocular depth estimationhttps://huggingface.co/LiheYoung/depth-anything-small-hfhttps://github.com/LiheYoung/Depth-Anything
DepthProDepth ProMulti-scale Vision Transformer2024Mix of real and synthetic imagesPyTorchMonocular depth estimation, AR applications
DETADetection Transformers with AssignmentTransformerSwin Transformer2023COCOPyTorch, Hugging Face TransformersObject detectionhttps://huggingface.co/jozhang97/deta-swin-largehttps://github.com/jozhang97/DETA
DETRDEtection TRansformerTransformerResNet2020COCOPyTorch, Hugging Face TransformersObject detectionhttps://huggingface.co/facebook/detr-resnet-50https://github.com/facebookresearch/detr
DiNATDilated Neighborhood Attention TransformerHierarchical Vision TransformerNAT2022ImageNet-1kPyTorch, NATTENImage classification, object detection, segmentationhttps://huggingface.co/shi-labs/dinat-mini-in1k-224https://github.com/SHI-Labs/Neighborhood-Attention-Transformer
DINOV2DINO v2Vision TransformerViT2023Curated dataset from diverse sourcesPyTorch, Hugging Face TransformersImage classification, visual feature extractionhttps://huggingface.co/facebook/dinov2-basehttps://github.com/facebookresearch/dinov2
DINOv2 with RegistersDINO v2 with RegistersVision TransformerDINOv22025Same as DINOv2PyTorch, Hugging Face TransformersImage classification, visual feature extractionhttps://huggingface.co/facebook/dinov2-with-registers-basehttps://github.com/facebookresearch/dinov2
DiTDocument Image TransformerVision TransformerBEiT2022Various document datasetsPyTorch, Hugging Face TransformersDocument image analysis, layout analysis, table detectionhttps://huggingface.co/microsoft/dit-basehttps://github.com/microsoft/unilm/tree/master/dit
DPTDense Prediction TransformerVision TransformerViT2021Various, including NYU Depth V2PyTorch, Hugging Face TransformersMonocular depth estimation, semantic segmentationhttps://huggingface.co/Intel/dpt-largehttps://github.com/isl-org/DPT
EfficientFormerEfficientFormerTransformer2022ImageNet-1KPyTorchImage Classification, Object Detection, Segmentationhttps://huggingface.co/docs/transformers/model_doc/efficientformerhttps://github.com/snap-research/EfficientFormer
EfficientNetEfficientNetConvolutional Neural NetworkMobileNetV22019ImageNetTensorFlow, PyTorchImage classification, transfer learninghttps://huggingface.co/google/efficientnet-b0https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
FocalNetFocal Modulation NetworkVision Transformer2022ImageNet-1K, ImageNet-22KPyTorchImage classification, object detection, semantic segmentationhttps://huggingface.co/microsoft/focalnet-tinyhttps://github.com/microsoft/FocalNet
GLPNGlobal-Local Path NetworksHierarchical mix-TransformerSegFormer2022NYU Depth V2, KITTIPyTorch, Hugging Face TransformersMonocular depth estimationhttps://huggingface.co/vinvino02/glpn-kittihttps://github.com/vinvino02/GLPDepth
HieraHierarchical Vision TransformerVision Transformer2023ImageNet-1KPyTorchImage and video recognitionhttps://huggingface.co/facebook/hiera-base-224https://github.com/facebookresearch/hiera
I-JEPAImage Joint Embedding Predictive ArchitectureJoint Embedding Predictive Architecture2024Large-scale image datasetsPyTorchSelf-supervised image representation learning
ImageGPTGenerative Pretraining from PixelsGPT-2-likeGPT-22020ImageNetPyTorch, TransformersImage Generation, Image Classificationhttps://huggingface.co/docs/transformers/model_doc/imagegpthttps://github.com/openai/image-gpt
LeViTLeViTVision Transformer2018ImageNetPyTorch, Hugging Face TransformersImage classificationhttps://huggingface.co/docs/transformers/model_doc/levithttps://github.com/huggingface/transformers
Mask2FormerMasked-attention Mask TransformerTransformerSwin Transformer2022COCO, ADE20K, CityscapesPyTorch, Detectron2Instance Segmentation, Panoptic Segmentation, Semantic Segmentationhttps://huggingface.co/docs/transformers/model_doc/mask2formerhttps://github.com/facebookresearch/Mask2Former
MaskFormerMaskFormerTransformer2023ADE20K, Cityscapes, COCO, Mapillary VistasPyTorch, Hugging Face TransformersSemantic segmentation, instance segmentation, panoptic segmentationhttps://huggingface.co/facebook/maskformer-swin-base-adehttps://github.com/facebookresearch/MaskFormer
MobileNetV1MobileNet Version 1Convolutional Neural Network2017ImageNetTensorFlow, PyTorchMobile and embedded vision applicationshttps://huggingface.co/google/mobilenet_v1_0.75_192https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilene
MobileNetV2MobileNet Version 2Convolutional Neural NetworkMobileNetV12019ImageNetTensorFlow, Keras, PyTorchMobile and embedded vision applications, image classification, object detectionhttps://huggingface.co/google/mobilenet_v2_1.0_224https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet
MobileViTMobile Vision TransformerVision Transformer2021ImageNetPyTorch, Hugging Face TransformersImage classification, object detectionhttps://huggingface.co/apple/mobilevit-smallhttps://github.com/apple/ml-cvnets
MobileViTV2Mobile Vision Transformer Version 2Vision TransformerMobileViT2023ImageNetPyTorch, Hugging Face TransformersImage classification, object detectionhttps://huggingface.co/apple/mobilevitv2-1.0https://github.com/apple/ml-cvnets
NATNeighborhood Attention TransformerVision Transformer2022ImageNetPyTorchImage classification, object detection, segmentationhttps://huggingface.co/shi-labs/nat-mini-in1k-224https://github.com/SHI-Labs/Neighborhood-Attention-Transformer
PoolFormerPoolFormerTransformer2022ImageNet-1KPyTorchImage Classificationhttps://huggingface.co/docs/transformers/model_doc/poolformerhttps://github.com/sail-sg/poolformer
PVTPyramid Vision TransformerVision Transformer2021ImageNetPyTorch, Hugging Face TransformersImage classification, object detection, segmentationhttps://huggingface.co/microsoft/pvt-tiny-224https://github.com/whai362/PVT
PVTv2Pyramid Vision Transformer Version 2Vision TransformerPVT2022ImageNetPyTorch, Hugging Face TransformersImage classification, object detection, segmentationhttps://huggingface.co/microsoft/pvt-v2-b0-224https://github.com/whai362/PVT
RegNetDesigning Network Design SpacesConvNet2020ImageNetPyTorch, FAIRImage Classification, Object Detectionhttps://huggingface.co/docs/transformers/model_doc/regnethttps://github.com/facebookresearch/pycls
ResNetResidual NetworkConvolutional Neural Network2015ImageNetPyTorch, TensorFlow, KerasImage classification, object detection, segmentationhttps://huggingface.co/microsoft/resnet-50https://github.com/KaimingHe/deep-residual-networks
RT-DETRReal-Time Detection TransformerTransformer2024COCOPyTorch, Hugging Face TransformersReal-time object detectionhttps://huggingface.co/docs/transformers/model_doc/rt_detrhttps://github.com/PaddlePaddle/PaddleDetection/tree/develop/configs/rtdetr
RT-DETRv2Real-Time Detection Transformer Version 2TransformerRT-DETR2024COCOPyTorch, Hugging Face TransformersReal-time object detectionhttps://huggingface.co/docs/transformers/model_doc/rt_detrhttps://github.com/PaddlePaddle/PaddleDetection/tree/develop/configs/rtdetr
SegFormerSegmentation TransformerVision Transformer2021ADE20K, CityscapesPyTorch, Hugging Face TransformersSemantic segmentationhttps://huggingface.co/docs/transformers/model_doc/segformerhttps://github.com/NVlabs/SegFormer
SegGptSegmenting Everything In ContextTransformerGPT2023SA-1BPyTorchImage Segmentation, Visual Groundinghttps://huggingface.co/BAAI/SegGPThttps://github.com/baaivision/Painter
SuperGlueSuperGlueGraph Neural Network2020MegaDepth, COCOPyTorchFeature Matching, Image Registrationhttps://huggingface.co/docs/transformers/model_doc/supergluehttps://github.com/magicleap/SuperGluePretrainedNetwork
SuperPointSuperPointConvNet2018MS-COCOPyTorchFeature Detection, Descriptionhttps://huggingface.co/docs/transformers/model_doc/superpointhttps://github.com/magicleap/SuperPointPretrainedNetwork
SwiftFormerSwiftFormerTransformer-based with efficient additive attention2023ImageNet-1KPyTorch, Hugging Face TransformersImage classification, mobile vision applicationshttps://huggingface.co/MBZUAI/swiftformer-shttps://github.com/huggingface/transformers/blob/main/src/transformers/models/swiftformer/modeling_swiftformer.py
Swin TransformerSwin TransformerHierarchical Transformer2021ImageNet-1K, ImageNet-22KPyTorch, Hugging Face TransformersImage classification, object detection, semantic segmentationhttps://huggingface.co/microsoft/swin-tiny-patch4-window7-224https://github.com/microsoft/Swin-Transformer
Swin Transformer V2Swin Transformer V2Hierarchical Transformer with improved training stabilitySwin Transformer2022ImageNet-22KPyTorch, Hugging Face TransformersImage classification, object detection, semantic segmentationhttps://huggingface.co/microsoft/swinv2-tiny-patch4-window8-256https://github.com/microsoft/Swin-Transformer
Swin2SRSwin2SRSwin Transformer for Super-ResolutionSwin Transformer2022DIV2K, Flickr2KPyTorchImage super-resolutionhttps://huggingface.co/caidas/swin2SR-classical-sr-x2-64https://github.com/mv-lab/swin2sr
Table TransformerTable TransformerTransformer-basedDETR2022PubTables-1MPyTorch, Hugging Face TransformersTable structure recognitionhttps://huggingface.co/microsoft/table-transformer-detectionhttps://github.com/microsoft/table-transformer
TextNetTextNetCNN-based2018SynthText, Total-TextPyTorchScene text detection and recognitionhttps://huggingface.co/microsoft/trocr-base-printedhttps://github.com/tonghe90/textnet
Timm WrapperPyTorch Image Models WrapperVarious2025ImageNetPyTorch, Hugging Face TransformersImage classificationhttps://huggingface.co/docs/transformers/en/model_doc/timm_wrapperhttps://github.com/huggingface/transformers
UperNetUnified Perceptual Parsing NetworkTransformerVarious (e.g., Swin, ConvNeXt)2018ADE20K, CityscapesPyTorch, Hugging Face TransformersSemantic segmentationhttps://huggingface.co/docs/transformers/model_doc/upernethttps://github.com/huggingface/transformers
VANVisual Attention NetworkAttention-based CNN2022ImageNet-1KPyTorch, Hugging Face TransformersImage classificationhttps://huggingface.co/Visual-Attention-Network/van-basehttps://github.com/Visual-Attention-Network/VAN-Classification
Vision Transformer (ViT)Vision TransformerTransformer2020ImageNetPyTorch, TensorFlow, Hugging Face TransformersImage classificationhttps://huggingface.co/google/vit-base-patch16-224https://github.com/google-research/vision_transformer
ViT HybridVision Transformer HybridHybrid CNN-Transformer2020ImageNet-21K, ImageNet-1KPyTorch, Hugging Face TransformersImage classificationhttps://huggingface.co/google/vit-hybrid-base-bit-384https://github.com/google-research/vision_transformer
ViTDetVision Transformer for Object DetectionTransformer-basedViT2022COCOPyTorch, Detectron2Object detectionhttps://huggingface.co/facebook/vit-det-basehttps://github.com/facebookresearch/detectron2
ViTMAEVision Transformer with Masked AutoencodersTransformer-basedViT2021ImageNet-1KPyTorch, Hugging Face TransformersSelf-supervised learning, image classificationhttps://huggingface.co/facebook/vit-mae-basehttps://github.com/facebookresearch/mae
ViTMatteVision Transformer for Image MattingTransformer-basedViT2022Adobe Image Matting DatasetPyTorchImage mattinghttps://huggingface.co/hustvl/vitmatte-small-composition-1khttps://github.com/hustvl/ViTMatte
ViTMSNVision Transformer with Masked Siamese NetworksTransformer-basedViT2022ImageNet-1KPyTorchSelf-supervised learning, image classificationhttps://huggingface.co/facebook/vit-msn-smallhttps://github.com/facebookresearch/msn
ViTPoseVision Transformer for Human Pose EstimationTransformer-basedViT2022COCOPyTorch, MMPoseHuman pose estimationhttps://huggingface.co/open-mmlab/vit-pose-basehttps://github.com/open-mmlab/mmpose
YOLOSYou Only Look at One SequenceTransformer-basedDETR2021COCOPyTorch, Hugging Face TransformersObject detectionhttps://huggingface.co/hustvl/yolos-tinyhttps://github.com/hustvl/YOLOS
ZoeDepthZoeDepthTransformer-basedDPT2023NYU Depth V2, KITTIPyTorchMonocular depth estimationhttps://huggingface.co/shariqfarooq/ZoeDepthhttps://github.com/isl-org/ZoeDepth