Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Google
{hasim,andrewsenior,fsb@google.com}
recurrent
is connected to the LSTM layer. The recurrent connections in the rt
ft
LSTM layer are directly from the cell output units to the cell input it ot
units, input gates, output gates and forget gates. The cell output units
output
ct mt yt
input
are connected to the output layer of the network. The total number g ct1 h
projection
memory block, ignoring the biases, can be calculated as follows: pt
W = nc nc 4 + ni nc 4 + nc no + nc 3
DNN_10w5_6_1024 (8M)
of 200,000 frames. The trained models are evaluated in a speech DNN_10w5_5_512_lr256 (2M)
recognition system on a test set of 23,000 hand-transcribed utter- 60 DNN_10w5_2_864_lr_256 (2M)
ances and the word error rates (WERs) are reported. The vocabulary
55
size of the language model used in the decoding is 2.6 million.
The DNNs are trained with SGD with a minibatch size of 200 50
frames on a Graphics Processing Unit (GPU). Each network is fully
connected with logistic sigmoid hidden layers and with a softmax 45
output layer representing phone HMM states. For consistency with
the LSTM architectures, some of the networks have a low-rank pro- 40
jection layer [17]. The DNNs inputs consist of stacked frames from 350 1 2 3 4 5 6 7 8 9
an asymmetrical window, with 5 frames on the right and either 10 or
Number of Training Frames 1e9
15 frames on the left (denoted 10w5 and 15w5 respectively)
The LSTM and conventional RNN architectures of various
configurations are trained with ASGD with 24 threads, each asyn- Fig. 3. 2000 context dependent phone HMM states.
75 LSTM_c2048_r256_p256 (7.6M) RNN_c512 (350K)
LSTM_c2048_r512 (9.7M) 40 RNN_c1024_r128 (320K)
70 LSTM_c2048_r256 (5M)
LSTM_c1024_r256 (3.5M)
RNN_c2048_r512 (2.3M)
DNN_10w5_6_704 (3M)
LSTM_c512 (5.2M) DNN_10w5_6_1024 (6M)
65 DNN_16w5_6_1024_lr256 (8.5M) 35 DNN_10w5_6_2176 (25M)
Frame Accuracy (%)
WER (%)
DNN_10w5_6_704_lr256 (5.2M) LSTM_c2048_r256_p256 (3.6M)
30 LSTM_c2048_r512 (5.6M)
55
50 25
45
40 20
350 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10
Number of Training Frames 1e9 Number of Training Frames (x billion)
Fig. 4. 8000 context dependent phone HMM states. Fig. 5. 126 context independent phone HMM states.
Figure 2, 3, and 4 show the frame accuracy results for 126, 19 DNN_10w5_2_864_lr_256 (2M)
2000 and 8000 state outputs, respectively. In the figures, the name DNN_10w5_5_512_lr256 (2M)
DNN_10w5_6_1024 (8M)
of the network configuration contains the information about the net- 18 DNN_10w5_6_1024_lr256 (6.7M)
LSTM_c512 (2.2M)
work size and architecture. cN states the number (N ) of memory LSTM_c1024_r256 (2M)
cells in the LSTMs and the number of units in the hidden layer in LSTM_c2048_r256_p256 (4.5M)
the RNNs. rN states the number of recurrent projection units in the 17 LSTM_c2048_r256 (3.5M)
LSTM_c2048_r512 (6.6M)
WER (%)
LSTM_c2048_r256_p256 (7.6M) [12] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton,
13.5 Speech recognition with deep recurrent neural networks, in
Proceedings of ICASSP, 2013.
13.0 [13] Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky,
12.5 and Sanjeev Khudanpur, Recurrent neural network based lan-
guage model, in Proceedings of INTERSPEECH. 2010, vol.
12.0 2010, pp. 10451048, International Speech Communication
Association.
11.51 2 3 4 5 6 7 8 9 10 [14] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu
Number of Training Frames (x billion) Devin, Quoc V. Le, Mark Z. Mao, MarcAurelio Ranzato, An-
drew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng,
Fig. 7. 8000 context dependent phone HMM states. Large scale distributed deep networks., in NIPS, 2012, pp.
12321240.
[15] Gael Guennebaud, Benot Jacob, et al., Eigen v3,
5. REFERENCES
http://eigen.tuxfamily.org, 2010.
[1] Yoshua Bengio, Patrice Simard, and Paolo Frasconi, Learn- [16] Ronald J. Williams and Jing Peng, An efficient gradient-based
ing long-term dependencies with gradient descent is difficult, algorithm for online training of recurrent network trajectories,
Neural Networks, IEEE Transactions on, vol. 5, no. 2, pp. 157 Neural Computation, vol. 2, pp. 490501, 1990.
166, 1994.
[17] T.N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and
[2] Sepp Hochreiter and Jurgen Schmidhuber, Long short-term B. Ramabhadran, Low-rank matrix factorization for deep neu-
memory, Neural Computation, vol. 9, no. 8, pp. 17351780, ral network training with high-dimensional output targets, in
Nov. 1997. Proc. ICASSP, 2013.
[3] Felix A. Gers, Jurgen Schmidhuber, and Fred Cummins,
Learning to forget: Continual prediction with LSTM, Neural
Computation, vol. 12, no. 10, pp. 24512471, 2000.
[4] Felix A. Gers, Nicol N. Schraudolph, and Jurgen Schmidhu-
ber, Learning precise timing with LSTM recurrent networks,
Journal of Machine Learning Research, vol. 3, pp. 115143,
Mar. 2003.
[5] Felix A. Gers and Jurgen Schmidhuber, LSTM recurrent
networks learn simple context free and context sensitive lan-
guages, IEEE Transactions on Neural Networks, vol. 12, no.
6, pp. 13331340, 2001.
[6] Mike Schuster and Kuldip K. Paliwal, Bidirectional recurrent
neural networks, Signal Processing, IEEE Transactions on,
vol. 45, no. 11, pp. 26732681, 1997.
[7] Alex Graves and Jurgen Schmidhuber, Framewise phoneme
classification with bidirectional LSTM and other neural net-
work architectures, Neural Networks, vol. 12, pp. 56, 2005.
[8] Alex Graves, Marcus Liwicki, Santiago Fernandez, Roman
Bertolami, Horst Bunke, and Jurgen Schmidhuber, A novel
connectionist system for unconstrained handwriting recogni-
tion, Pattern Analysis and Machine Intelligence, IEEE Trans-
actions on, vol. 31, no. 5, pp. 855868, 2009.
[9] Abdel Rahman Mohamed, George E. Dahl, and Geoffrey E.
Hinton, Acoustic modeling using deep belief networks, IEEE
Transactions on Audio, Speech & Language Processing, vol.
20, no. 1, pp. 1422, 2012.
[10] George E. Dahl, Dong Yu, Li Deng, and Alex Acero,
Context-dependent pre-trained deep neural networks for
large-vocabulary speech recognition, IEEE Transactions on