Summary Table of NLP Models (HF-based) – Video Type Models

NameFull NameArchitectureBase ModelDevelopedTraining DatasetLib. & FrameworkUse CasesHF URLGithhub URL
TimeSformerTimeSformer (Time-Space Transformer)TransformerVision Transformer (ViT)2021Evaluated on datasets like Kinetics-400 and Kinetics-600PyTorchVideo classification and action recognition taskshttps://github.com/facebookresearch/TimeSformer
VideoMAEVideo Masked AutoencodersMasked autoencoderVision Transformer (ViT)2022Pre-trained on large-scale video datasets; specifics vary by implementationPyTorchVideo classification, action recognition, and efficient video representation learninghttps://huggingface.co/docs/transformers/en/model_doc/videomae
ViViTVideo Vision TransformerPure transformer-based modelVision Transformer (ViT)2021Trained and evaluated on datasets such as Kinetics-400, Kinetics-600, Epic Kitchens, Something-Something V2, and Moments in Time.TensorFlow and JAXVideo classification and action recognition taskshttps://huggingface.co/docs/transformers/en/model_doc/vivithttps://github.com/google-