Major Update for PyTorch and PyTorch Libraries
PyTorch released new updates yesterday to both PyTorch and PyTorch libraries. PyTorch 1.10 is a major release that includes CUDA Graphs APIs, Frontend and Compiler Improvements, among a host of other new features, bug fixes, and other performance improvements.
PyTorch 1.10 updates are focused on improving training and performance of PyTorch, and developer usability. The full release notes are available here.
Highlights include:
- CUDA Graphs APIs are integrated to reduce CPU overheads for CUDA workloads.
- Several frontend APIs such as FX, torch.special, and nn.Module Parametrization, have moved from beta to stable.
- Support for automatic fusion in JIT Compiler expands to CPUs in addition to GPUs.
- Android NNAPI support is now available in beta.
New Library Releases in PyTorch 1.10
Alongside PyTorch 1.10, a number of new features and improvements to PyTorch libraries also were released.
Some highlights include:
- TorchX - a new SDK for quickly building and deploying ML applications from research & development to production.
- TorchAudio - Added text-to-speech pipeline, self-supervised model support, multi-channel support and MVDR beamforming module, RNN transducer (RNNT) loss function, and batch and filterbank support to lfilter function. See the TorchAudio release notes here.
- TorchVision - Added new RegNet and EfficientNet models, FX based feature extraction added to utilities, two new Automatic Augmentation techniques: Rand Augment and Trivial Augment, and updated training recipes. See the TorchVision release notes here.
Introducing TorchX
TorchX is a new SDK for quickly building and deploying ML applications from research & development to production. It offers various built-in components that encode MLOps (Marketing Operations) best practices and make advanced features like distributed training and hyperparameter optimization accessible to all.
Users can get started with TorchX 0.1 with no added setup cost since it supports popular Machine Learning (ML) schedulers and pipeline orchestrators that are already widely adopted and deployed in production. No two production environments are the same. To comply with various use cases, TorchX’s core APIs allow tons of customization at well-defined extension points so that even the most unique applications can be serviced without customizing the whole vertical stack.
Read the documentation for more details and try out this feature using this quickstart tutorial.
TorchAudio 0.10
[Beta] Text-to-speech pipeline
TorchAudio now adds the Tacotron2 model and pretrained weights. It is now possible to build a text-to-speech pipeline with existing vocoder implementations like WaveRNN and Griffin-Lim. Building a TTS pipeline requires matching data processing and pretrained weights, which are often non-trivial to users. So TorchAudio introduces a bundle API so that constructing pipelines for specific pretrained weights is easy. The following example illustrates this.
For the details of this API please refer to the documentation. You can also try this from the tutorial.
(Beta) Self-Supervised Model Support
TorchAudio added HuBERT model architecture and pre-trained weight support for wav2vec 2.0 and HuBERT. HuBERT and wav2vec 2.0 are novel ways for audio representation learning and they yield high accuracy when fine-tuned on downstream tasks. These models can serve as baseline in future research, therefore, TorchAudio is providing a simple way to run the model. Similar to the TTS pipeline, the pretrained weights and associated information, such as expected sample rates and output class labels (for fine-tuned weights) are put together as a bundle, so that they can be used to build pipelines.
Please refer to the documentation for more details and try out this feature using this tutorial.
(Beta) Multi-channel support and MVDR beamforming
Far-field speech recognition is a more challenging task compared to near-field recognition. Multi-channel methods such as beamforming help reduce the noises and enhance the target speech.
TorchAudio now adds support for differentiable Minimum Variance Distortionless Response (MVDR) beamforming on multi-channel audio using Time-Frequency masks. Researchers can easily assemble it with any multi-channel ASR pipeline. There are three solutions (ref_channel, stv_evd, stv_power) and it supports single-channel and multi-channel (perform average in the method) masks. It provides an online option that recursively updates the parameters for streaming audio. PyTorch also provides a tutorial on how to apply MVDR beamforming to the multi-channel audio in the example directory.
Please refer to the documentation for more details and try out this feature using the MVDR tutorial.
The RNN transducer (RNNT) loss is part of the RNN transducer pipeline, which is a popular architecture for speech recognition tasks. Recently it has gotten attention for being used in a streaming setting, and has also achieved state-of-the-art WER for the LibriSpeech benchmark.
TorchAudio’s loss function supports float16 and float32 logits, has autograd and torchscript support, and can be run on both CPU and GPU, which has a custom CUDA kernel implementation for improved performance. The implementation is consistent with the original loss function in Sequence Transduction with Recurrent Neural Networks, but relies on code from Alignment Restricted Streaming Recurrent Neural Network Transducer.
Please refer to the documentation for more details.
(Beta) Batch support and filter bank support
torchaudio.functional.lfilter
now supports batch processing and multiple filters.
Automatic speech recognition (ASR) research and productization have increasingly focused on on-device applications. Towards supporting such efforts, TorchAudio now includes Emformer, a memory-efficient transformer architecture that has achieved state-of-the-art results on LibriSpeech in low-latency streaming scenarios, as a prototype feature.
Please refer to the documentation for more details.
GPU builds that support custom CUDA kernels in TorchAudio, like the one being used for RNN transducer loss, have been added. Following this change, TorchAudio’s binary distribution now includes CPU-only versions and CUDA-enabled versions. To use CUDA-enabled binaries, PyTorch also needs to be compatible with CUDA.
TorchVision 0.11
RegNet and EfficientNet are two popular architectures that can be scaled to different computational budgets. In this release PyTorch includes 22 pre-trained weights for their classification variants. The models were trained on ImageNet and the accuracies of the pre-trained models obtained on ImageNet val can be found below (see #4403, #4530 and #4293 for more details).
The models can be used as follows:
import torch
from torchvision import models
x = torch.rand(1, 3, 224, 224)
regnet = models.regnet_y_400mf(pretrained=True)
regnet.eval()
predictions = regnet(x)
efficientnet = models.efficientnet_b0(pretrained=True)
efficientnet.eval()
predictions = efficientnet(x)
See the full list of new models on the torchvision.models documentation page.
(Beta) FX-based Feature Extraction
A new Feature Extraction method has been added to the utilities. It uses torch.fx and enables you to retrieve the outputs of intermediate layers of a network which is useful for feature extraction and visualization.
Here is an example of how to use the new utility:
import torch from torchvision.models import resnet50 from torchvision.models.feature_extraction import create_feature_extractor x = torch.rand(1, 3, 224, 224) model = resnet50() return_nodes = { "layer4.2.relu_2": "layer4" } model2 = create_feature_extractor(model, return_nodes=return_nodes) intermediate_outputs = model2(x) print(intermediate_outputs['layer4'].shape)
(Stable) New Data Augmentations
Two new Automatic Augmentation techniques were added: RandAugment and Trivial Augment.
They apply a series of transformations on the original data to enhance them and to boost the performance of the models. The new techniques build on top of the previously added AutoAugment and focus on simplifying the approach, reducing the search space for the optimal policy and improving the performance gain in terms of accuracy. These techniques enable users to reproduce recipes to achieve state-of-the-art performance on the offered models. Additionally, it enables users to apply these techniques in order to do transfer learning and achieve optimal accuracy on new datasets.
Both methods can be used as drop-in replacement of the AutoAugment technique as seen below:
from torchvision import transforms
t = transforms.RandAugment()
# t = transforms.TrivialAugmentWide()
transformed = t(image)
transform = transforms.Compose([
transforms.Resize(256),
transforms.RandAugment(), # transforms.TrivialAugmentWide()
transforms.ToTensor()])
Read the automatic augmentation transforms for more details.
Updated Training Recipes
PyTorch updated their training reference scripts to add support for Exponential Moving Average, Label Smoothing, Learning-Rate Warmup, Mixup, Cutmix and other SOTA primitives. The above enabled them to improve the classification Acc@1 of some pre-trained models by over 4 points. A major update of the existing pre-trained weights is expected in the next release.
Have any questions? Feel free to contact us about using PyTorch with our Deep Learning, Machine Learning and AI systems.