Summary Table of NLP Models (HF-based) – Video Type Models

Name	Full Name	Architecture	Base Model	Developed	Training Dataset	Lib. & Framework	Use Cases	HF URL	Githhub URL
TimeSformer	TimeSformer (Time-Space Transformer)	Transformer	Vision Transformer (ViT)	2021	Evaluated on datasets like Kinetics-400 and Kinetics-600	PyTorch	Video classification and action recognition tasks		https://github.com/facebookresearch/TimeSformer
VideoMAE	Video Masked Autoencoders	Masked autoencoder	Vision Transformer (ViT)	2022	Pre-trained on large-scale video datasets; specifics vary by implementation	PyTorch	Video classification, action recognition, and efficient video representation learning	https://huggingface.co/docs/transformers/en/model_doc/videomae
ViViT	Video Vision Transformer	Pure transformer-based model	Vision Transformer (ViT)	2021	Trained and evaluated on datasets such as Kinetics-400, Kinetics-600, Epic Kitchens, Something-Something V2, and Moments in Time.	TensorFlow and JAX	Video classification and action recognition tasks	https://huggingface.co/docs/transformers/en/model_doc/vivit	https://github.com/google-