阿里巴巴开源语音识别声学建模技术

薛定谔的Bug
• 阅读 4673

编者按:本文作者阿里巴巴机器智能技术实验室高级算法工程师张仕良。文章介绍了阿里巴巴的语音识别声学建模新技术: 前馈序列记忆神经网络(DFSMN),目前基于DFSMN的语音识别系统已经在法庭庭审识别、智能客服、视频审核和实时字幕转写、声纹验证、物联网等多个场景成功应用。本次,我们开源了基于Kaldi语音识别工具实现的DFSMN代码,同时开源了相关训练脚本。 通过开源的代码和训练流程,我们在公开的英文数据集LibriSpeech上可以获得目前最好的性能。

This post presents DFSMN, an improved Feedforward Sequential Memory Networks (FSMN) architecture for large vocabulary continuous speech recognition. We release the source codes and training recipes of DFSMN based on the popular Kaldi speech recognition toolkit and demonstrate that DFSMN can achieve the best performance in the LibriSpeech speech recognition task.

Acoustic Modeling in Speech Recognition

Deep neural networks have become the dominant acoustic models in large vocabulary continuous speech recognition systems. Depending on how the networks are connected, there exist various types of neural network architectures, such as feedforward fully-connected neural networks (FNN), convolutional neural networks (CNN) and recurrent neural networks (RNN).

For acoustic modeling, it is crucial to take advantage of the long term dependency within the speech signal. Recurrent neural networks (RNN) are designed to capture long term dependency within the sequential data using a simple mechanism of recurrent feedback. RNNs can learn to model sequential data over an extended period of time and store the memory in the connections, then carry out rather complicated transformations on the sequential data. As opposed to FNNs that can only learn to map a fixed-size input to a fixed-size output, RNNs can in principle learn to map from one variable-length sequence to another. Therefore, RNNs, especially the short term memory (LSTM), have become the most popular choice in acoustic modeling for speech recognition.

In our previous work, we have proposed a novel neural architecture non-recurrent structure, namely feedforward sequential memory networks (FSMN), which can effectively model long term dependency in sequential data without using any recurrent feedback. FSMN is inspired by the filter design knowledge in digital signal processing that any infinite impulse response (IIR) filter can be well approximated using a high-order finite impulse response (FIR) filter. Because the recurrent layer in RNNs can be conceptually viewed as a first-order IIR filter, it may be precisely approximated by a high-order FIR filter. Therefore, we extend the standard feedforward fully connected neural networks by augmenting some memory blocks, which adopt a tapped-delay line structure as in FIR filters, into the hidden layers. Fig. 1 (a) shows a FSMN with one memory block added into its -th hidden layer and Fig. 1 (b) shows the FIR filter like memory block in FSMN. As a result, the overall FSMN remains as a pure feedforward structure so that it can be learned in a much more efficient and stable way than RNNs. The learnable FIR like memory blocks in FSMNs may be used to encode long context information into a fixed-size representation, which helps the model to capture long-term dependency. Experimental results in the English recognition Switchboard task show that FSMN can outperform the popular BLSTM while faster in training speed.

阿里巴巴开源语音识别声学建模技术

Fig. 1. Illustration of FSMN and its tapped-delay memory block

阿里巴巴开源语音识别声学建模技术

Fig. 2. Illustration of Deep-FSMN (DFSMN) with skip connection

In this work, based on our previous FSMN works and recent works on neural networks with very deep architecture, we present an improved FSMN structure namely Deep-FSMN (DFSMN) (as show in Fig. 2) by introducing skip connections between memory blocks in adjacent layers. These skip connections enable the information flow across different layers and thus alleviate the gradient vanishing problem when building very deep structure. We can successfully build DFSMN with dozens of layers and significantly outperform the previous FSMN.

We implement the DFSMN based on the popular kaldi speech recognition toolkit and release the source code in (https://github.com/tramphero/... The DFSMN is embedded into the kaldi-nnet1 by adding some DFSMN related components and CUDA kernel functions. We use mini-batch based training instead of the multi-streams which is more stable and efficient.

Improving the State of Art

We have trained the DFSMN in the LibriSpeech corpus, which is a large (1000 hour) corpus of English read speech derived from audiobooks in the LibriVox project, sampled at 16 kHz. We trained DFSMN with two official settings using kaldi recipes: 1) model trained on the “cleaned data” (960-hours-setting); 2) model trained on the speed-perturbed and volume-perturbed “cleaned data” (3000-hours-setting).

For the plain 960-hours-setting, the previous kaldi official release best model is the cross-entropy trained BLSTM. For comparison, we trained the DFSMN with the same front-end processing as well as the decoding configurations as the official-BLSTM using the cross-entropy criterion. The experimental results are as shown in Table 1. For the augmented 3000-hours-setting, the previous best result is achieved by the TDNN trained with lattice-free MMI followed by sMBR based discriminative training. In comparison, we trained DFSMN with cross-entropy followed by one epoch sMBR based discriminative training. The experimental results are as shown in Table 2. For both settings, our DFSMN can achieve the significantly performance improvement compared to the previous best results.

Table 1. Performance (WER in %) of BLSTM and DFSMN trained on cleaned data.

阿里巴巴开源语音识别声学建模技术

Table 2. Performance (WER in %) of BLSTM and DFSMN trained on speed-perturbed and volume-perturbed cleaned data.

阿里巴巴开源语音识别声学建模技术

How to get our implementation and reproduce our results

We have released two methods to get the implementation and reproduce our results: 1) Github project based on the Kaldi; 2) A PATCH file with the DFSMN related codes and example scripts.

Get Github project
git clone https://github.com/tramphero/...

Apply PATCH
The PATCH is built based on the Kaldi speech recognition toolkit with commit "04b1f7d6658bc035df93d53cb424edc127fab819". One can apply this PATCH to your own kaldi branch by using the following commands:

**Take a look at what changes are in the patch

git apply --stat Alibaba_MIT_Speech_DFSMN.patch

**Test the patch before you actually apply it

git apply --check Alibaba_MIT_Speech_DFSMN.patch

**If you don’t get any errors, the patch can be applied cleanly.

git am --signoff < Alibaba_MIT_Speech_DFSMN.patch

The training scripts and experimental results for the LibriSpeech task is available at https://github.com/tramphero/... There are three DFSMN configurations with different model size: DFSMN_S, DFSMN_M, DFSMN_L.


**Training FSMN models on the cleaned-up data

**Three configurations of DFSMN with different model size: DFSMN_S, DFSMN_M, DFSMN_L

local/nnet/run_fsmn_ivector.sh DFSMN_S

local/nnet/run_fsmn_ivector.sh DFSMN_M

local/nnet/run_fsmn_ivector.sh DFSMN_L


The DFSMN_S is a small DFSMN with six DFSMN-components while DFSMN_L is a large DFSMN consist of 10 DFSMN-components. For the 960-hours-setting, it takes about 2-3 days to train DFSMN_S only using one M40 GPU. And the detailed experimental results are listed in the RESULTS file.

For more details, take a look at our paper and the open-source project.

本文作者:仁太
阅读原文
本文为云栖社区原创内容,未经允许不得转载。

点赞
收藏
评论区
推荐文章
数据堂 数据堂
1年前
深度学习在语音识别中的应用及挑战
一、引言随着深度学习技术的快速发展,其在语音识别领域的应用也日益广泛。深度学习技术可以有效地提高语音识别的精度和效率,并且被广泛应用于各种应用场景。本文将探讨深度学习在语音识别中的应用及所面临的挑战。二、深度学习在语音识别中的应用1.基于深度神经网络的语音
数据堂 数据堂
1年前
语音识别技术:现状、挑战与未来发展
一、引言语音识别技术是一种将人类语音转化为计算机可读文本的技术,它在许多领域都有广泛的应用,如智能助手、智能家居、医疗诊断等。本文将探讨语音识别技术的现状、挑战和未来发展。二、语音识别技术的现状1.深度学习驱动的语音识别:深度学习已经在语音识别领域取得了显
数据堂 数据堂
1年前
语音识别技术中的实时处理与云计算
一、引言语音识别技术是一种将人类语音转化为计算机可理解数据的技术。随着人工智能和云计算技术的不断发展,语音识别技术正朝着实时处理和云计算方向发展。本文将探讨语音识别技术中的实时处理与云计算的应用。二、实时处理在语音识别技术中的应用1.语音输入:实时处理技术
数据堂 数据堂
1年前
语音识别技术在智能家居中的应用与挑战
一、引言随着人工智能和物联网技术的不断发展,智能家居成为了人们生活中不可或缺的一部分。语音识别技术作为一种重要的人工智能技术,在智能家居领域中扮演着重要角色。本文将探讨语音识别技术在智能家居中的应用与挑战。二、语音识别技术在智能家居中的应用1.智能音箱:语
数据堂 数据堂
1年前
语音识别技术在智能客服领域的应用与挑战
一、引言随着人工智能技术的不断发展,智能客服成为了许多行业的重要应用。语音识别技术作为智能客服的重要组成部分,对于提高客户满意度和提升企业效率具有重要意义。本文将探讨语音识别技术在智能客服领域的应用与挑战。二、语音识别技术在智能客服领域的应用1.语音转文字
数据堂 数据堂
1年前
语音识别技术在智能家居领域的应用与前景
一、引言随着人工智能和物联网技术的快速发展,智能家居成为了人们日常生活的重要部分。语音识别技术作为智能家居的关键技术之一,能够为家庭生活带来诸多便利。本文将探讨语音识别技术在智能家居领域的应用以及未来的发展前景。二、语音识别技术在智能家居领域的应用1.智能
数据堂 数据堂
1年前
语音识别技术在智能客服领域的应用与优化
一、引言随着人工智能技术的不断发展,智能客服已成为企业提升服务质量和效率的重要手段。语音识别技术作为智能客服的核心技术之一,能够为客服工作带来诸多便利。本文将探讨语音识别技术在智能客服领域的应用以及如何进行优化。二、语音识别技术在智能客服领域的应用1.语音
数据堂 数据堂
1年前
语音识别技术在智能家居领域的创新应用与挑战
一、引言随着人工智能和物联网技术的快速发展,智能家居成为了人们日常生活的重要部分。语音识别技术作为智能家居的关键技术之一,能够为家庭生活带来诸多便利。本文将探讨语音识别技术在智能家居领域的创新应用以及面临的挑战。二、语音识别技术在智能家居领域的创新应用1.
数据堂 数据堂
1年前
基于深度学习的情感语音识别模型优化策略
一、引言情感语音识别技术是一种将人类语音转化为情感信息的技术,其应用范围涵盖了人机交互、智能客服、心理健康监测等多个领域。随着人工智能技术的不断发展,深度学习在情感语音识别领域的应用越来越广泛。本文将探讨基于深度学习的情感语音识别模型的优化策略,包括数据预
数据堂 数据堂
1年前
情感语音识别的前世今生
一、引言情感语音识别是指通过计算机技术和人工智能算法,对人类语音中的情感信息进行自动识别和理解。这种技术可以帮助我们更好地理解人类的情感状态,为智能客服、心理健康监测、娱乐产业等多个领域提供重要的支持。本文将探讨情感语音识别的前世今生,包括其发展历程、应用
数据堂 数据堂
1年前
情感语音识别技术的现状与未来
一、引言情感语音识别技术是近年来人工智能领域的研究热点之一,它通过分析人类语音中的情感信息,为智能客服、心理健康监测、娱乐产业等多个领域提供了重要的支持。本文将探讨情感语音识别技术的现状和未来发展趋势。二、情感语音识别技术的现状语音信号处理技术:情感语音识