Summary Table of NLP Models (HF-based) – Audio Type Models

Name	Full Name	Architecture	Base Model	Developed	Training Dataset	Lib. & Framework	Use Cases	HF URL	Githhub URL
Audio Spectrogram Transformer	Audio Spectrogram Transformer	Transformer	ViT	2021	AudioSet	PyTorch, Hugging Face Transformers	Audio classification, sound event detection	https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer	https://github.com/YuanGongND/ast
Bark	Bark	GPT-like, Transformer	GPT-2	2023	Proprietary dataset	PyTorch, Hugging Face Transformers	Text-to-speech, voice synthesis	https://huggingface.co/docs/transformers/model_doc/bark	https://github.com/suno-ai/bark
CLAP	Contrastive Language-Audio Pretraining	Dual-encoder	CLIP	2022	AudioSet, Clotho	PyTorch, Hugging Face Transformers	Audio-text matching, zero-shot audio classification	https://huggingface.co/docs/transformers/model_doc/clap	https://github.com/microsoft/CLAP
dac	Discrete Audio Codec	Transformer	None (trained from scratch)	2023	AudioSet	PyTorch, Hugging Face Transformers	Audio compression, audio generation	https://huggingface.co/docs/transformers/model_doc/dac	https://github.com/descriptinc/descript-audio-codec
EnCodec	EnCodec	Convolutional neural network	None (trained from scratch)	2022	LibriTTS, VoxCeleb2	PyTorch, Hugging Face Transformers	Neural audio codec, audio compression	https://huggingface.co/docs/transformers/model_doc/encodec	https://github.com/facebookresearch/encodec
FastSpeech2Conformer	FastSpeech2Conformer	Conformer, FastSpeech2	None (trained from scratch)	2021	LJSpeech	PyTorch, ESPnet	Text-to-speech, voice synthesis	https://huggingface.co/docs/transformers/model_doc/fastspeech2_conformer	https://github.com/espnet/espnet
Hubert	Hidden Unit BERT	Transformer	BERT	2021	LibriSpeech	PyTorch, Hugging Face Transformers	Speech recognition, speech representation learning	https://huggingface.co/docs/transformers/model_doc/hubert	https://github.com/pytorch/fairseq/tree/main/examples/hubert
MCTCT	Multi-Channel Transformer Transducer	Transformer Transducer	None (trained from scratch)	2022	CHiME-6	PyTorch, ESPnet	Multi-channel speech recognition	https://huggingface.co/docs/transformers/model_doc/mctct	https://github.com/espnet/espnet
Mimi	Mimi	Transformer	None (trained from scratch)	2023	Proprietary dataset	PyTorch, Hugging Face Transformers	Speech recognition, speech translation	https://huggingface.co/docs/transformers/model_doc/mimi	Not publicly available
MMS	Massively Multilingual Speech	Transformer	XLS-R	2023	MLS, VoxPopuli, BABEL, CommonVoice	PyTorch, Hugging Face Transformers	Multilingual speech recognition, language identification	https://huggingface.co/docs/transformers/model_doc/mms	https://github.com/facebookresearch/fairseq/tree/main/examples/mms
Moonshine	Moonshine	Sequence-to-sequence	None (trained from scratch)	2024	200,000 hours of audio and transcripts	PyTorch, Hugging Face Transformers	Automatic speech recognition, English transcription	https://huggingface.co/UsefulSensors/moonshine-base	Not publicly available
Moshi	Moshi	Transformer	Helium (text language model)	2024	7M hours unsupervised audio, Fisher dataset, 170 hours supervised multi-stream, 20,000 hours synthetic data	PyTorch, Hugging Face Transformers	Speech-to-speech generation, real-time dialogue	https://huggingface.co/kyutai/moshiko-pytorch-bf16	https://github.com/kyutai-labs/moshi
MusicGen	MusicGen	Transformer	None (trained from scratch)	2023	Not specified	PyTorch, Hugging Face Transformers	Music generation from text prompts	https://huggingface.co/facebook/musicgen-large	https://github.com/facebookresearch/audiocraft
MusicGen Melody	MusicGen Melody	Transformer	MusicGen	2023	Not specified	PyTorch, Hugging Face Transformers	Music generation from text and melody inputs	https://huggingface.co/facebook/musicgen-melody	https://github.com/facebookresearch/audiocraft
Pop2Piano	Pop2Piano	Transformer	None (trained from scratch)	2023	Not specified	PyTorch, Hugging Face Transformers	Pop song to piano cover generation	https://huggingface.co/sweetcocoa/pop2piano	https://github.com/sweetcocoa/pop2piano
Seamless-M4T	Seamless Multilingual and Multimodal Machine Translation	Transformer	None (trained from scratch)	2023	Not specified	PyTorch, Hugging Face Transformers	Multilingual and multimodal translation	https://huggingface.co/facebook/seamless-m4t-large	https://github.com/facebookresearch/seamless_communication
SeamlessM4T-v2	Seamless Multilingual and Multimodal Machine Translation v2	Transformer	Seamless-M4T	2024	Not specified	PyTorch, Hugging Face Transformers	Improved multilingual and multimodal translation	https://huggingface.co/facebook/seamless-m4t-v2-large	https://github.com/facebookresearch/seamless_communication
SEW	Squeezed and Efficient Wav2Vec	Convolutional Neural Network	Wav2Vec	2021	LibriSpeech	PyTorch, Hugging Face Transformers	Speech recognition, audio feature extraction	https://huggingface.co/asapp/sew-tiny-100k	https://github.com/asappresearch/sew
SEW-D	Squeezed and Efficient Wav2Vec with Depthwise Separable Convolutions	Convolutional Neural Network	SEW	2021	LibriSpeech	PyTorch, Hugging Face Transformers	Efficient speech recognition, audio feature extraction	https://huggingface.co/asapp/sew-d-tiny-100k	https://github.com/asappresearch/sew
Speech2Text	Speech2Text	Sequence-to-sequence	None (trained from scratch)	2021	CommonVoice, LibriSpeech	PyTorch, Hugging Face Transformers	Speech recognition, speech translation	https://huggingface.co/facebook/s2t-small-librispeech-asr	https://github.com/huggingface/transformers/tree/main/src/transformers/models/speech_to_tex
Speech2Text2	Speech2Text2	Transformer	None (trained from scratch)	2021	CommonVoice, LibriSpeech	PyTorch, Hugging Face Transformers	Speech recognition, speech translation	https://huggingface.co/facebook/s2t-small-librispeech-asr	https://github.com/huggingface/transformers/tree/main/src/transformers/models/speech_to_text_2
SpeechT5	SpeechT5	Encoder-Decoder Transformer	None (trained from scratch)	2022	Not specified	PyTorch, Hugging Face Transformers	Speech recognition, speech synthesis, voice conversion	https://huggingface.co/microsoft/speecht5_vc	https://github.com/microsoft/SpeechT5
UniSpeech	UniSpeech	Transformer	None (trained from scratch)	2021	Not specified	PyTorch, Hugging Face Transformers	Speech recognition, speaker identification	https://huggingface.co/microsoft/unispeech-base-100h	https://github.com/microsoft/UniSpeech
UniSpeech-SAT	UniSpeech Speaker Aware Pre-Training	HuBERT-based	HuBERT	2021	94,000 hours of public audio data	PyTorch, Hugging Face Transformers	Universal speech representation, speaker identification	https://huggingface.co/microsoft/unispeech-sat-base-100h-libri-ft	https://github.com/microsoft/UniSpeech
UnivNet	UnivNet	GAN	None (trained from scratch)	2021	Not specified	PyTorch, Hugging Face Transformers	Neural vocoder, waveform generation	https://huggingface.co/dg845/univnet-dev	Not publicly available
VITS	Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech	Variational Autoencoder, GAN	None (trained from scratch)	2021	Not specified	PyTorch, Hugging Face Transformers	Text-to-speech synthesis	https://huggingface.co/facebook/mms-tts	https://github.com/jaywalnut310/vits
Wav2Vec2	Wav2Vec2	Transformer	None (trained from scratch)	2020	LibriSpeech	PyTorch, Hugging Face Transformers	Speech recognition, speech representation learning	https://huggingface.co/facebook/wav2vec2-base	https://github.com/pytorch/fairseq/tree/main/examples/wav2vec
Wav2Vec2-BERT	Wav2Vec2-BERT	Transformer	Wav2Vec2	2024	4.5M hours of audio	PyTorch, Hugging Face Transformers	Speech recognition, speech representation learning	https://huggingface.co/facebook/wav2vec2-bert-base	Not publicly available
Wav2Vec2-Conformer	Wav2Vec2-Conformer	Conformer	Wav2Vec2	2021	Not specified	PyTorch, Hugging Face Transformers	Speech recognition, speech representation learning	https://huggingface.co/facebook/wav2vec2-conformer-rel-pos-large	https://github.com/pytorch/fairseq/tree/main/examples/wav2vec
Wav2Vec2Phoneme	Wav2Vec2Phoneme	Transformer	Wav2Vec2	2021	Not specified	PyTorch, Hugging Face Transformers	Phoneme recognition, speech-to-phoneme conversion	https://huggingface.co/facebook/wav2vec2-lv-60-espeak-cv-ft	https://github.com/pytorch/fairseq/tree/main/examples/wav2vec
WavLM	WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing	Transformer	None (trained from scratch)	2021	960 hrs LibriSpeech (Base), 94k hrs (Base+, Large)	PyTorch, Hugging Face Transformers	Speech recognition, speaker verification, speech representation learning	https://huggingface.co/microsoft/wavlm-base	https://github.com/microsoft/unilm/tree/master/wavlm
Whisper	Whisper	Encoder-Decoder Transformer	None (trained from scratch)	2022	680,000 hours of labeled audio data	PyTorch, Hugging Face Transformers	Automatic speech recognition, speech translation	https://huggingface.co/openai/whisper-large	https://github.com/openai/whisper
XLS-R	XLS-R: Self-supervised Cross-lingual Speech Representation Learning	Transformer	wav2vec 2.0	2021	436,000 hours of unlabeled speech data from 128 languages	PyTorch, fairseq	Speech translation, speech recognition, language identification, speaker identification	https://huggingface.co/facebook/wav2vec2-xls-r-300m	https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec/xlsr
XLSR-Wav2Vec2	XLSR-Wav2Vec2: Unsupervised Cross-Lingual Representation Learning For Speech Recognition	Transformer	wav2vec 2.0	2020	Not specified	PyTorch, Hugging Face Transformers	Cross-lingual speech recognition, speech representation learning	https://huggingface.co/facebook/wav2vec2-large-xlsr-53	https://github.com/pytorch/fairseq/tree/main/examples/wav2vec