Summary Table of NLP Models (HF-based) – Audio Type Models

NameFull NameArchitectureBase ModelDevelopedTraining DatasetLib. & FrameworkUse CasesHF URLGithhub URL
Audio Spectrogram TransformerAudio Spectrogram TransformerTransformerViT2021AudioSetPyTorch, Hugging Face TransformersAudio classification, sound event detectionhttps://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformerhttps://github.com/YuanGongND/ast
BarkBarkGPT-like, TransformerGPT-22023Proprietary datasetPyTorch, Hugging Face TransformersText-to-speech, voice synthesishttps://huggingface.co/docs/transformers/model_doc/barkhttps://github.com/suno-ai/bark
CLAPContrastive Language-Audio PretrainingDual-encoderCLIP2022AudioSet, ClothoPyTorch, Hugging Face TransformersAudio-text matching, zero-shot audio classificationhttps://huggingface.co/docs/transformers/model_doc/claphttps://github.com/microsoft/CLAP
dacDiscrete Audio CodecTransformerNone (trained from scratch)2023AudioSetPyTorch, Hugging Face TransformersAudio compression, audio generationhttps://huggingface.co/docs/transformers/model_doc/dachttps://github.com/descriptinc/descript-audio-codec
EnCodecEnCodecConvolutional neural networkNone (trained from scratch)2022LibriTTS, VoxCeleb2PyTorch, Hugging Face TransformersNeural audio codec, audio compressionhttps://huggingface.co/docs/transformers/model_doc/encodechttps://github.com/facebookresearch/encodec
FastSpeech2ConformerFastSpeech2ConformerConformer, FastSpeech2None (trained from scratch)2021LJSpeechPyTorch, ESPnetText-to-speech, voice synthesishttps://huggingface.co/docs/transformers/model_doc/fastspeech2_conformerhttps://github.com/espnet/espnet
HubertHidden Unit BERTTransformerBERT2021LibriSpeechPyTorch, Hugging Face TransformersSpeech recognition, speech representation learninghttps://huggingface.co/docs/transformers/model_doc/huberthttps://github.com/pytorch/fairseq/tree/main/examples/hubert
MCTCTMulti-Channel Transformer TransducerTransformer TransducerNone (trained from scratch)2022CHiME-6PyTorch, ESPnetMulti-channel speech recognitionhttps://huggingface.co/docs/transformers/model_doc/mctcthttps://github.com/espnet/espnet
MimiMimiTransformerNone (trained from scratch)2023Proprietary datasetPyTorch, Hugging Face TransformersSpeech recognition, speech translationhttps://huggingface.co/docs/transformers/model_doc/mimiNot publicly available
MMSMassively Multilingual SpeechTransformerXLS-R2023MLS, VoxPopuli, BABEL, CommonVoicePyTorch, Hugging Face TransformersMultilingual speech recognition, language identificationhttps://huggingface.co/docs/transformers/model_doc/mmshttps://github.com/facebookresearch/fairseq/tree/main/examples/mms
MoonshineMoonshineSequence-to-sequenceNone (trained from scratch)2024200,000 hours of audio and transcriptsPyTorch, Hugging Face TransformersAutomatic speech recognition, English transcriptionhttps://huggingface.co/UsefulSensors/moonshine-baseNot publicly available
MoshiMoshiTransformerHelium (text language model)20247M hours unsupervised audio, Fisher dataset, 170 hours supervised multi-stream, 20,000 hours synthetic dataPyTorch, Hugging Face TransformersSpeech-to-speech generation, real-time dialoguehttps://huggingface.co/kyutai/moshiko-pytorch-bf16https://github.com/kyutai-labs/moshi
MusicGenMusicGenTransformerNone (trained from scratch)2023Not specifiedPyTorch, Hugging Face TransformersMusic generation from text promptshttps://huggingface.co/facebook/musicgen-largehttps://github.com/facebookresearch/audiocraft
MusicGen MelodyMusicGen MelodyTransformerMusicGen2023Not specifiedPyTorch, Hugging Face TransformersMusic generation from text and melody inputshttps://huggingface.co/facebook/musicgen-melodyhttps://github.com/facebookresearch/audiocraft
Pop2PianoPop2PianoTransformerNone (trained from scratch)2023Not specifiedPyTorch, Hugging Face TransformersPop song to piano cover generationhttps://huggingface.co/sweetcocoa/pop2pianohttps://github.com/sweetcocoa/pop2piano
Seamless-M4TSeamless Multilingual and Multimodal Machine TranslationTransformerNone (trained from scratch)2023Not specifiedPyTorch, Hugging Face TransformersMultilingual and multimodal translationhttps://huggingface.co/facebook/seamless-m4t-largehttps://github.com/facebookresearch/seamless_communication
SeamlessM4T-v2Seamless Multilingual and Multimodal Machine Translation v2TransformerSeamless-M4T2024Not specifiedPyTorch, Hugging Face TransformersImproved multilingual and multimodal translationhttps://huggingface.co/facebook/seamless-m4t-v2-largehttps://github.com/facebookresearch/seamless_communication
SEWSqueezed and Efficient Wav2VecConvolutional Neural NetworkWav2Vec2021LibriSpeechPyTorch, Hugging Face TransformersSpeech recognition, audio feature extractionhttps://huggingface.co/asapp/sew-tiny-100khttps://github.com/asappresearch/sew
SEW-DSqueezed and Efficient Wav2Vec with Depthwise Separable ConvolutionsConvolutional Neural NetworkSEW2021LibriSpeechPyTorch, Hugging Face TransformersEfficient speech recognition, audio feature extractionhttps://huggingface.co/asapp/sew-d-tiny-100khttps://github.com/asappresearch/sew
Speech2TextSpeech2TextSequence-to-sequenceNone (trained from scratch)2021CommonVoice, LibriSpeechPyTorch, Hugging Face TransformersSpeech recognition, speech translationhttps://huggingface.co/facebook/s2t-small-librispeech-asrhttps://github.com/huggingface/transformers/tree/main/src/transformers/models/speech_to_tex
Speech2Text2Speech2Text2TransformerNone (trained from scratch)2021CommonVoice, LibriSpeechPyTorch, Hugging Face TransformersSpeech recognition, speech translationhttps://huggingface.co/facebook/s2t-small-librispeech-asrhttps://github.com/huggingface/transformers/tree/main/src/transformers/models/speech_to_text_2
SpeechT5SpeechT5Encoder-Decoder TransformerNone (trained from scratch)2022Not specifiedPyTorch, Hugging Face TransformersSpeech recognition, speech synthesis, voice conversionhttps://huggingface.co/microsoft/speecht5_vchttps://github.com/microsoft/SpeechT5
UniSpeechUniSpeechTransformerNone (trained from scratch)2021Not specifiedPyTorch, Hugging Face TransformersSpeech recognition, speaker identificationhttps://huggingface.co/microsoft/unispeech-base-100hhttps://github.com/microsoft/UniSpeech
UniSpeech-SATUniSpeech Speaker Aware Pre-TrainingHuBERT-basedHuBERT202194,000 hours of public audio dataPyTorch, Hugging Face TransformersUniversal speech representation, speaker identificationhttps://huggingface.co/microsoft/unispeech-sat-base-100h-libri-fthttps://github.com/microsoft/UniSpeech
UnivNetUnivNetGANNone (trained from scratch)2021Not specifiedPyTorch, Hugging Face TransformersNeural vocoder, waveform generationhttps://huggingface.co/dg845/univnet-devNot publicly available
VITSConditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-SpeechVariational Autoencoder, GANNone (trained from scratch)2021Not specifiedPyTorch, Hugging Face TransformersText-to-speech synthesishttps://huggingface.co/facebook/mms-ttshttps://github.com/jaywalnut310/vits
Wav2Vec2Wav2Vec2TransformerNone (trained from scratch)2020LibriSpeechPyTorch, Hugging Face TransformersSpeech recognition, speech representation learninghttps://huggingface.co/facebook/wav2vec2-basehttps://github.com/pytorch/fairseq/tree/main/examples/wav2vec
Wav2Vec2-BERTWav2Vec2-BERTTransformerWav2Vec220244.5M hours of audioPyTorch, Hugging Face TransformersSpeech recognition, speech representation learninghttps://huggingface.co/facebook/wav2vec2-bert-baseNot publicly available
Wav2Vec2-ConformerWav2Vec2-ConformerConformerWav2Vec22021Not specifiedPyTorch, Hugging Face TransformersSpeech recognition, speech representation learninghttps://huggingface.co/facebook/wav2vec2-conformer-rel-pos-largehttps://github.com/pytorch/fairseq/tree/main/examples/wav2vec
Wav2Vec2PhonemeWav2Vec2PhonemeTransformerWav2Vec22021Not specifiedPyTorch, Hugging Face TransformersPhoneme recognition, speech-to-phoneme conversionhttps://huggingface.co/facebook/wav2vec2-lv-60-espeak-cv-fthttps://github.com/pytorch/fairseq/tree/main/examples/wav2vec
WavLMWavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech ProcessingTransformerNone (trained from scratch)2021960 hrs LibriSpeech (Base), 94k hrs (Base+, Large)PyTorch, Hugging Face TransformersSpeech recognition, speaker verification, speech representation learninghttps://huggingface.co/microsoft/wavlm-basehttps://github.com/microsoft/unilm/tree/master/wavlm
WhisperWhisperEncoder-Decoder TransformerNone (trained from scratch)2022680,000 hours of labeled audio dataPyTorch, Hugging Face TransformersAutomatic speech recognition, speech translationhttps://huggingface.co/openai/whisper-largehttps://github.com/openai/whisper
XLS-RXLS-R: Self-supervised Cross-lingual Speech Representation LearningTransformerwav2vec 2.02021436,000 hours of unlabeled speech data from 128 languagesPyTorch, fairseqSpeech translation, speech recognition, language identification, speaker identificationhttps://huggingface.co/facebook/wav2vec2-xls-r-300mhttps://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec/xlsr
XLSR-Wav2Vec2XLSR-Wav2Vec2: Unsupervised Cross-Lingual Representation Learning For Speech RecognitionTransformerwav2vec 2.02020Not specifiedPyTorch, Hugging Face TransformersCross-lingual speech recognition, speech representation learninghttps://huggingface.co/facebook/wav2vec2-large-xlsr-53https://github.com/pytorch/fairseq/tree/main/examples/wav2vec