Name		Name	Last commit message	Last commit date
parent directory ..
asr_adapters		asr_adapters
images		images
ASR_CTC_Language_Finetuning.ipynb		ASR_CTC_Language_Finetuning.ipynb
ASR_Confidence_Estimation.ipynb		ASR_Confidence_Estimation.ipynb
ASR_Context_Biasing.ipynb		ASR_Context_Biasing.ipynb
ASR_Example_CommonVoice_Finetuning.ipynb		ASR_Example_CommonVoice_Finetuning.ipynb
ASR_TTS_Tutorial.ipynb		ASR_TTS_Tutorial.ipynb
ASR_for_telephony_speech.ipynb		ASR_for_telephony_speech.ipynb
ASR_with_NeMo.ipynb		ASR_with_NeMo.ipynb
ASR_with_Subword_Tokenization.ipynb		ASR_with_Subword_Tokenization.ipynb
ASR_with_Transducers.ipynb		ASR_with_Transducers.ipynb
Buffered_Transducer_Inference.ipynb		Buffered_Transducer_Inference.ipynb
Buffered_Transducer_Inference_with_LCS_Merge.ipynb		Buffered_Transducer_Inference_with_LCS_Merge.ipynb
Canary_Multitask_Speech_Model.ipynb		Canary_Multitask_Speech_Model.ipynb
Intro_to_Transducers.ipynb		Intro_to_Transducers.ipynb
Multilang_ASR.ipynb		Multilang_ASR.ipynb
Offline_ASR_with_VAD_for_CTC_models.ipynb		Offline_ASR_with_VAD_for_CTC_models.ipynb
Online_ASR_Microphone_Demo_Buffered_Streaming.ipynb		Online_ASR_Microphone_Demo_Buffered_Streaming.ipynb
Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb		Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb
Online_Noise_Augmentation.ipynb		Online_Noise_Augmentation.ipynb
Online_Offline_Microphone_VAD_Demo.ipynb		Online_Offline_Microphone_VAD_Demo.ipynb
Online_Offline_Speech_Commands_Demo.ipynb		Online_Offline_Speech_Commands_Demo.ipynb
README.md		README.md
Self_Supervised_Pre_Training.ipynb		Self_Supervised_Pre_Training.ipynb
Speech_Commands.ipynb		Speech_Commands.ipynb
Streaming_ASR.ipynb		Streaming_ASR.ipynb
Transducers_with_HF_Datasets.ipynb		Transducers_with_HF_Datasets.ipynb
Voice_Activity_Detection.ipynb		Voice_Activity_Detection.ipynb

README.md

Speech Recognition Tutorials

In this repository, you will find several tutorials discussing what is Automatic Speech Recognition (ASR), general concepts, specific models and multiple sub-domains of ASR such as Speech Classification, Voice Activity Detection, Speaker Recognition, Speaker Identification and Speaker Diarization.

Automatic Speech Recognition

ASR_with_NeMo: Discussion of the task of ASR, handling of data, understanding the acoustic features, using an Acoustic Model and train on an ASR dataset, and finally evaluating the model's performance.
ASR_with_Subword_Tokenization: Modern ASR models benefit from several improvements in neural network design and data processing. In this tutorial we discuss how we can use Tokenizers (commonly found in NLP) to significantly improve the efficiency of ASR models without sacrificing any accuracy during transcription.
ASR_CTC_Language_Finetuning: Until now, we have discussed how to train ASR models from scratch. Once we get pretrained ASR models, we can then fine-tune them on domain specific use cases, or even other languages! This notebook discusses how to fine-tune an English ASR model onto another language, and discusses several methods to improve the efficiency of transfer learning.
Online_ASR_Microphone_Demo: A short notebook that enables us to speak into a microphone and transcribe speech in an online manner. Note that this is not the most efficient way to perform streaming ASR, and it is more of a demo.
Online_ASR_Microphone_Demo_Cache_Aware_Streaming: This notebook allows you to do real-time ("streaming") speech recognition on audio recorded from your microphone, using "cache-aware" NeMo ASR models specifically tuned for the streaming ASR usecase.
ASR_for_telephony_speech: Audio sources are not homogenous, nor are the ways to store large audio datasets. Here, we discuss our observations and recommendations when working with audio obtained from Telephony speech sources.
Online_Noise_Augmentation: While academic datasets are useful for training ASR model, there can often be cases where such datasets are pristine and don't really represent the use case in the real world. So we discuss how to make the model more noise robust with Online audio augmentation.
Intro_to_Transducers: Previous tutorials discuss ASR models in context of the Connectionist Temporal Classification Loss. In this tutorial, we introduce the Transducer loss, and the components of this loss function that are constructed in the config file. This tutorial is a prerequisite to the ASR_with_Transducers tutorial.
ASR_with_Transducers: In this tutorial, we take a deep dive into Transducer based ASR models, discussing the similarity of setup and config to CTC models and then train a small ContextNet model on the AN4 dataset. We then discuss how to change the decoding strategy of a trained Transducer from greedy search to beam search. Finally, we wrap up this tutorial by extraining the alignment matrix from a trained Transducer model.
Self_Supervised_Pre_Training: It can often be difficult to obtain labeled data for ASR training. In this tutorial, we demonstrate how to pre-train a speech model in an unsupervised manner, and then fine-tune with CTC loss.
Offline_ASR_with_VAD_for_CTC_models: In this tutorial, we will demonstrate how to use offline VAD to extract speech segments and transcribe the speech segments with CTC models. This will help to exclude some non_speech utterances and could save computation resources by removing unnecessary input to the ASR system.
Multilang_ASR: We will learn how to work with existing checkpoints of multilingual ASR models and how to train new ones. It is possible to create a multilingual version of any ASR model that uses tokenizers. This notebook shows how to create a multilingual version of the small monolingual Conformer Transducer model.
ASR_Example_CommonVoice_Finetuning: Learn how to fine-tune an ASR model using CommonVoice to a new alphabet, Esperanto. We walk through the data processing steps of MCV data using HuggingFace Datasets, preparation of the tokenizer, model and then setup fine-tuning.
ASR_Context_Biasing: This tutorial aims to show how to improve the recognition accuracy of specific words in NeMo Framework for CTC and Trasducer (RNN-T) ASR models by using the fast context-biasing method with CTC-based Word Spotter.

Automatic Speech Recognition with Adapters

Please refer to the asr_adapter sub-folder which contains tutorials on the use of Adapter modules to perform domain adaptation on ASR models, as well as its sub-domains.

Streaming / Buffered Automatic Speech Recognition

Streaming_ASR: Some ASR models cannot be used to evaluate very long audio segments due to their design. For example, self attention models consume quadratic memory with respect to sequence length. For such cases, this notebook shows how to perform streaming audio recognition in a buffered manner.
Buffered_Transducer_Inference: In this notebook, we explore a simple algorithm to perform streaming audio recognition in a buffered manner for Transducer models. This enables the use of transducers on very long speech segments, similar to CTC models.
Buffered_Transducer_Inference_with_LCS_Merge: This is an optional notebook, that discusses a different merge algorithm that can be utilized for streaming/buffered inference for Transducer models. It is not a required tutorial, but is useful for researchers who wish to analyse and improve buffered inference algorithms.

Speech Command Recognition

Speech_Commands: Here, we study the task of speech classification - a subset of speech recognition that allows us to classify a spoken sentence into a single label. This allows to speak a command and the model can then recognize this command and perform an action.
Online_Offline_Speech_Commands_Demo: We perform a joint online-offline inference of speech command recognition. We utilize an online VAD model to detect speech segments (whether audio is in fact speech or background), and if speech is detected then a speech command recognition model classifies that speech in an offline manner. Note that this demo is a demonstration of a possible approach and is not meant for large scale use.
Voice_Activity_Detection: A special case of Speech Command Recognition - where the task is to classify whether some audio segment is speech or not. It is often a tiny model that is used prior to a large ASR model being used.
Online_Offline_Microphone_VAD_Demo: Similar to before, we demo an online-offline inference of voice activity detection. We discuss metrics for comparing the performance of streaming VAD models, and how one can try to perform streaming VAD inference with a microphone. Note that as always, this demo is a demonstration of a possible approach and is not meant for large scale use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

asr

asr

README.md

Speech Recognition Tutorials

Automatic Speech Recognition

Automatic Speech Recognition with Adapters

Streaming / Buffered Automatic Speech Recognition

Speech Command Recognition

Files

asr

Directory actions

More options

Directory actions

More options

Latest commit

History

asr

Folders and files

parent directory

README.md

Speech Recognition Tutorials

Automatic Speech Recognition

Automatic Speech Recognition with Adapters

Streaming / Buffered Automatic Speech Recognition

Speech Command Recognition