Selfish Sparse RNN Training

Shiwei Liu, Decebal Constantin Mocanu, Yulong Pei, Mykola Pechenizkiy

Research output: Chapter in Book/Report/Conference proceedingChapterAcademicpeer-review

13 Downloads (Pure)


Sparse neural networks have been widely applied to reduce the necessary resource requirements to train and deploy over-parameterized deep neural networks. For inference acceleration, methods that induce sparsity from a pre-trained dense network (dense-to-sparse) work effectively. Recently, dynamic sparse training (DST) has been proposed to train sparse neural networks without pre-training a dense network (sparse-to-sparse), so that the training process can also be accelerated. However, previous sparse-to-sparse methods mainly focus on Multilayer Perceptron Networks (MLPs) and Convolutional Neural Networks (CNNs), failing to match the performance of dense-to-sparse methods in Recurrent Neural Networks (RNNs) setting. In this paper, we propose an approach to train sparse RNNs with a fixed parameter count in one single run, without compromising performance. During training, we allow RNN layers to have a non-uniform redistribution across cell gates for a better regularization. Further, we introduce SNT-ASGD, a variant of the averaged stochastic gradient optimizer, which significantly improves the performance of all sparse training methods for RNNs. Using these strategies, we achieve state-of-the-art sparse training results with various types of RNNs on Penn TreeBank and Wikitext-2 datasets.
Original languageEnglish
Title of host publicationThe Thirty-eighth International Conference on Machine Learning, ICML 2021
Publication statusPublished - 18 Jul 2021
Event38th International Conference on Machine Learning, ICML 2021 - Virtual Conference
Duration: 18 Jul 202124 Jul 2021
Conference number: 38

Publication series

NamePMLR, Proceedings of Machine Learning Research


Conference38th International Conference on Machine Learning, ICML 2021
Abbreviated titleICML 2021
CityVirtual Conference


Dive into the research topics of 'Selfish Sparse RNN Training'. Together they form a unique fingerprint.

Cite this