TheanoLM

Welcome to TheanoLM documentation. See About section for a short introduction to the project. Getting Started provides a tutorial on how to get started with training neural network language models and performing various operations with them. The development documentation is intended to help extending the toolkit.

About

Introduction

TheanoLM is a recurrent neural network language modeling tool implemented using the Python library Theano. Theano allows the user to customize and extend the neural network very conveniently, still generating highly efficient code that can utilize multiple GPUs or CPUs for parallel computation. TheanoLM allows the user to specify an arbitrary network architecture. New layer types and optimization methods can be easily implemented.

TheanoLM can be used for rescoring n-best lists and Kaldi lattices, decoding HTK word lattices, and generating text. It can be called from command line or from a Python script.

Implementations of many currently popular layer types are provided, such as long short-term memory (LSTM), gated recurrent units (GRU), bidirectional recurrent networks, gated linear units (GLU), and highway networks are provided. Several different Stochastic Gradient Descent (SGD) based optimizers are implemented, including RMSProp, AdaGrad, ADADELTA, and Adam.

There are several features that are especially useful with very large vocabularies. The effective vocabulary size can be reduced by using a class model. TheanoLM supports also subword vocabularies create e.g. using Morfessor. In addition to the standard cross-entropy cost, one can use sampling based noise-contrastive estimation (NCE) or BlackOut.

Publications

Seppo Enarvi and Mikko Kurimo (2016), TheanoLM — An Extensible Toolkit for Neural Network Language Modeling. In Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH).

Seppo Enarvi, Peter Smit, Sami Virpioja, and Mikko Kurimo (2017), Automatic Speech Recognition with Very Large Conversational Finnish and Estonian Vocabularies. In IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Peter Smit, Siva Gangireddy, Seppo Enarvi, Sami Virpioja, and Mikko Kurimo (2017), Aalto System for the 2017 Arabic Multigenre Broadcast Challenge. In Proceedings of the 2017 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

License

TheanoLM is licensed under the Apache License, Version 2.0.

Getting Started

Installation

pip

TheanoLM is available from the Python Package Index. The easiest way to install it is using pip. It requires NumPy, Theano, and H5py packages. Theano requires also Six and Nose. pip tries to install all the dependencies automatically. Notice that TheanoLM supports only Python 3. In some systems a different version of pip is used to install Python 3 packages. In Ubuntu the command is pip3. To install system-wide, use:

sudo pip3 install TheanoLM

This requires that you have superuser privileges. You can also install TheanoLM and the dependencies under your home directory by passing the --user argument to pip:

pip3 install --user TheanoLM

Installing all Python modules into a single environment can make managing the packages and dependencies somewhat difficult. A convenient alternative is to install TheanoLM in an isolated Python environment created using virtualenv or the standard library module venv that is shipped with Python 3.3 and later. You have to install wheel before installing other packages. For example, to create a virtual environment for TheanoLM in ~/theanolm, use:

python3 -m venv ~/theanolm
source ~/theanolm/bin/activate
pip3 install wheel
pip3 install TheanoLM

You’ll continue working in the virtual environment until you deactivate it with the deactivate command. It can be activated again using source ~/theanolm/bin/activate.

Anaconda

Perhaps a more convenient tool for managing a large collection of Python packages is the conda package manager. Anaconda distribution includes the package manager and a number of packages for scientific computing. Make sure to select the Python 3 version.

Most of the dependencies are included in the distribution. Currently the bleeding-edge version of Theano is required. It can be installed through the mila-udem channel. It is also very easy to install the libgpuarray and pygpu dependencies, required for GPU computation, in the same way:

conda install -c mila-udem/label/pre theano pygpu libgpuarray

TheanoLM can be installed through the conda-forge channel:

conda install -c conda-forge TheanoLM

Linux

Linux distributions commonly provide most of the dependencies through their package repositories. You might want to keep the dependencies up to date using the system package manager. For example, you can install the dependencies (except Theano) in Ubuntu by issuing the following commands, before installing TheanoLM:

sudo apt-get install python3-numpy python3-h5py
sudo apt-get install python3-six python3-nose python3-nose-parameterized
sudo apt-get install python3-pip

Source Code

The source code is distributed through GitHub. For developing TheanoLM you need to work on the Git repository tree. The package can be installed using pip from the repository root. There is a convenient option --editable that causes pip to install stub scripts that call the program binaries from the repository. This way you avoid having to reinstall the project after every change.

I recommend forking the repository first on GitHub, so that you can commit changes to your personal copy of the repository. Then install Theano and H5py using pip or Anaconda. Clone the forked repository and run pip from the repository root:

git clone https://github.com/my-username/theanolm.git
cd theanolm
pip3 install --editable .

Basic Usage

theanolm command recognizes several subcommands:

theanolm train
Trains a neural network language model.
theanolm score
Performs text scoring and perplexity computation using a neural network language model.
theanolm decode
Decodes a word lattice using a neural network to compute the language model probabilities.
theanolm sample
Generates sentences by sampling words from a neural network language model.
theanolm version
Displays the version number and exits.

The complete list of command line options available for each subcommand can be displayed with the --help argument, e.g.:

theanolm train --help

Using GPUs

Theano can automatically utilize NVIDIA GPUs for numeric computation. Whether the CPU or a GPU is used, is selected by configuring Theano. This is totally transparent to TheanoLM.

First you need to have CUDA installed. The new GpuArray backend is the only GPU backend that Theano supports anymore. Before using it you have to install the libgpuarray library. Also, currently it requires cuDNN for all the necessary operations to work, and cuDNN requires a graphics card with compute capability 3.0 or higher. The backend is still under active development, so using the latest developmet versions of Theano and libgpuarray from GitHub is recommended.

The first GPU device can be selected using device=cuda0 in $THEANO_FLAGS environment variable, or in .theanorc configuration file. The simplest way to get started is to set $THEANO_FLAGS as follows:

export THEANO_FLAGS=floatX=float32,device=cuda0

floatX=float32 selects 32-bit floating point precision, which is not required anymore in the new backend, but is a good idea to conserve memory. In order to use multiple GPUs, one would map the cuda devices to dev names, e.g:

export THEANO_FLAGS=floatX=float32,contexts=dev0->cuda0;dev1->cuda1"

For details on configuring Theano, see Theano Configuration in the API documentation.

Training a language model

Vocabulary

Because of the softmax normalization performed over the vocabulary at the output layer of a neural network, vocabulary size has a huge impact on training speed. Vocabulary size can be reduced by clustering words into classes, and estimating a language model over the word classes, or using subword units. Another option is to approximate the softmax normalization using hierarchical softmax, noise-contrastive estimation, or BlackOut. These options are explained below:

  • Class-based models are probably the fastest to train and evaluate, because the vocabulary size is usually a few thousand. TheanoLM will use unigram probabilities for words inside the classes. TheanoLM is not able to generate word classes automatically. You can use for example Percy Liang’s brown-cluster, ngram-class from SRILM, mkcls from GIZA++, or word2vec (with -classes switch). Creating the word classes can take a considerable amount of time.
  • A feasible alternative with agglutinative languages is to segment words into subword units. For example, a typical vocabulary created with Morfessor is of the order of 10,000 statistical morphs. The vocabulary and training text then contain morphs instead of words, and <w> token is used to separate words.
  • A vocabulary as large as hundreds of thousands of words is possible, when using hierarchical softmax (hsoftmax) output. The output layer is factorized into two levels, both performing normalization over an equal number of choices. Training will be considerably faster than with regular softmax, but the number of parameters will still be large, meaning that the amount of GPU memory may limit the usable vocabulary size.
  • A new alternative to hierarchical softmax is to approximate softmax by sampling a subset of the vocabulary for each mini-batch and contrast the correct target words to these noise words only, instead of the whole vocabulary. Only normal softmax output layer supports sampling. This is explained in the Cost function section below.

A vocabulary can be provided for theanolm train command using the --vocabulary argument. If a vocabulary is not given, all the words from the training set will be added to the vocabulary. If a vocabulary is read from a file, those words will be called a shortlist. The shortlist words will be predicted by the neural network. The rest of the words from the training data will be added to the vocabulary, but they will not be predicted b the neural network. Their probability can be computed using the <unk> token and their frequencies in the training data.

If classes are not used, a vocabulary file is simply a list of words, one per line, and --vocabulary-format words argument should be given. Words that do not appear in the vocabulary will be mapped to the <unk> token. The vocabulary file can also contain classes in one of two formats, specified by the --vocabulary-format argument:

  • classes Each line contains a word and an integer class ID. Class membership probabilities p(word | class) are computed as unigram maximum likelihood estimates from the training data.
  • srilm-classes Vocabulary file is expected to contain word class definitions in SRILM format. Each line contains a class name, class membership probability, and a word.

Network structure description

The neural network layers are specified in a text file. The file contains input layer elements, one element on each line. Input elements start with the word input and should contain the following fields:

  • type is either word or class and selects the input unit.
  • name is used to identify the input.

Layer elements start with the word layer and may contain the following fields:

  • type selects the layer class. Has to be specified for all layers. See below for possible values.
  • name is used to identify the layer. Has to be specified for all layers.
  • input specifies a network input or a layer whose output will be the input of this layer. Some layers types allow multiple inputs.
  • size gives the number of output connections. If not given, defaults to the number of input connections. Will be automatically set to the size of the vocabulary in the output layer.
  • dropout_rate may be set in the dropout layer.

Currently the following layer types are implemented:

  • projection projects words to continuous vectors. Required as the first layer.
  • tanh basic feedforward layer with tanh activation.
  • lstm long short-term memory.
  • gru gated recurrent unit.
  • blstm bidirectional LSTM.
  • bgru bidirectional GRU.
  • highwaytanh highway network layer with tanh activation
  • dropout a layer without any units that just performs Dropout.
  • softmax normal softmax output layer. The last layer has to be softmax or hsoftmax.
  • hsoftmax two-level hierarchical softmax.

The elements have to specified in the order that the network is constructed, i.e. an element can have in its inputs only elements that have already been specified. Multiple layers may have the same element in their input. The first layer should be a projection layer. The last layer is where the network output will be read from. Description of a typical LSTM neural network language model could look like this:

input type=class name=class_input
layer type=projection name=projection_layer input=class_input size=100
layer type=lstm name=hidden_layer input=projection_layer size=300
layer type=softmax name=output_layer input=hidden_layer

A dropout layer is not a real layer in the sense that it does not contain any neurons. It can be added after another layer, and only sets some activations randomly to zero at train time. This is helpful with larger networks to prevent overlearning. The effect can be controlled using the dropout_rate parameter. The training converges slower the larger the dropout rate.

A larger network with dropout layers, word input, and hierarchical softmax output, could be specified using the following description:

input type=word name=word_input
layer type=projection name=projection_layer input=word_input size=500
layer type=dropout name=dropout_layer_1 input=projection_layer dropout_rate=0.2
layer type=lstm name=hidden_layer_1 input=dropout_layer_1 size=1500
layer type=dropout name=dropout_layer_2 input=hidden_layer_1 dropout_rate=0.2
layer type=tanh name=hidden_layer_2 input=dropout_layer_2 size=1500
layer type=dropout name=dropout_layer_3 input=hidden_layer_2 dropout_rate=0.2
layer type=hsoftmax name=output_layer input=dropout_layer_3

Optimization

The objective of the implemented optimization methods is to maximize the likelihood of the training sentences. All the implemented optimization methods are based on Gradient Descent, meaning that the neural network parameters are updated by taking steps proportional to the negative of the gradient of the cost function. The true gradient is approximated by subgradients on subsets of the training data called “mini-batches”.

The size of the step taken when updating neural network parameters is controlled by “learning rate”. The initial value can be set using the --learning-rate argument. The average per-word gradient will be multiplied by this factor. In practice the gradient is scaled by the number of words by dividing the cost function by the number of training examples in the mini-batch. In most of the cases, something between 0.1 and 1.0 works well, depending on the optimization method and data.

The number of sequences included in one mini-batch can be set with the --batch-size argument. Larger mini-batches are more efficient to compute on a GPU, and result in more reliable gradient estimates. However, when a larger batch size is selected, the learning rate may have to be reduced to keep the optimization stable. This makes a too large batch size inefficient. Usually something like 16 or 32 works well.

Maximum sequence length may be given with the --sequence-length argument, which limits the time span for which the network can learn dependencies. Longer sentences will be split to multiple sequences. If the argument is not given, the sequences in a mini-batch correspond to sentences. There’s no point in using a value greater than 100, and smaller values such as 25 or 50 can be used to limit the memory consumption and make the computation more efficient.

The optimization method can be selected using the --optimization-method argument. Methods that adapt the gradients before updating parameters can considerably improve the speed of convergence, but training may be less stable. In order to avoid the gradients exploding, gradient normalization is recommended. With the --max-gradient-norm argument one can set the maximum for the norm of the (adapted) gradients. Typically 5 or 15 works well. The table below suggests some values for learning rate. Those are a good starting point, assuming gradient normalization is used.

Optimization Method –optimization-method –learning-rate
Stochastic Gradient Descent sgd 1
Nesterov Momentum nesterov 1 or 0.1
AdaGrad adagrad 1 or 0.1
ADADELTA adadelta 10 or 1
SGD with RMSProp rmsprop-sgd 0.1
Nesterov Momentum with RMSProp rmsprop-nesterov 0.01
Adam adam 0.01

AdaGrad automatically scales the gradients before updating the neural network parameters. It seems to be the fastest method to converge and usually reaches close to the optimum without manual annealing. ADADELTA is an extension of AdaGrad that is not as aggressive in scaling down the gradients. It seems to benefit from manual annealing, but still stay behind AdaGrad in terms of final model performance. Nesterov Momentum requires manual annealing, but may find a better final model.

Cost function

The objective of the optimization can be change by selecting a different cost function using the --cost argument. The standard cross-entropy cost involves normalization by computing all the output probabilities. Recently proposed alternatives, noise-contrastive estimation (nce) and BlackOut (blackout), perform normalization only on a subset of the vocabulary during training. This subset, called noise words, is randomly sampled.

The sampling based costs can be faster to compute, but less stable and slower to converge. For each data word k noise words are sampled, where k can be set using the --num-noise-samples argument. The higher the number of noise samples, the more stable and slower the training is.

Creating a different noise sample for every data word is very slow. The noise sample can be shared across the mini-batch using the --noise-sharing argument. The value batch creates just one noise sample for the entire mini-batch. The value seq creates one noise sample for each time step (word inside a sequence), but shares the noise samples between sequences. Because of how multinomial sampling is currently implemented in Theano, noise sharing is practically necessary and it limits the total number of noise samples per mini-batch to the vocabulary size.

The distribution where the noise samples are drawn from plays an important role. Uniform sampling is very fast, but rarely gives good results. It can be selected by setting the --noise-dampening argument to zero. Setting that argument to one corresponds to sampling from the unigram distribution in the training data. The problem with the unigram distribution is that very rare words may never get sampled. Usually the optimum value is a bit lower than one.

Command line

Train command takes two mandatory arguments: the output model path and the --training-set argument followed by path to one or more training data files. The rest of the arguments have default values. You probably want to provide a validation text to monitor the progress of the training. Below is an example that shows what the command line may look like at its simplest:

theanolm train model.h5 \
  --training-set training-data.txt \
  --validation-file validation-data.txt

The input files can be either plain text or compressed with gzip. Text data is read one utterance per line. Start-of-sentence and end-of-sentence tags (<s> and </s>) will be added to the beginning and end of each utterance, if they are missing. If an empty line is encountered, it will be ignored, instead of interpreted as the empty sentence <s> </s>.

The default lstm300 network architecture is used unless another architecture is selected with the --architecture argument. A larger network can be selected with lstm1500, or a path to a custom network architecture description can be given.

The no-improvement stopping condition can be used when validation data is provided. It halves the learning rate when validation set perplexity stops improving, and stops training when the perplexity did not improve at all with the current learning rate. --validation-frequency argument defines how many cross-validations are performed on each epoch. --patience argument defines how many times perplexity is allowedto increase before learning rate is reduced.

Below is a more complex example that reads word classes from vocabulary.classes and uses Nesterov Momentum optimizer with annealing:

theanolm train \
  model.h5 \
  --training-set training-data.txt.gz \
  --validation-file validation-data.txt.gz \
  --vocabulary vocabulary.classes \
  --vocabulary-format srilm-classes \
  --architecture lstm1500 \
  --learning-rate 1.0 \
  --optimization-method nesterov \
  --stopping-condition no-improvement \
  --validation-frequency 8 \
  --patience 4

Model file

The model will be saved in HDF5 format. During training, TheanoLM will save the model every time a minimum of the validation set cost is found. The file contains the current values of the model parameters and the training hyperparameters. The model can be inspected with command-line tools such as h5dump (hdf5-tools Ubuntu package), and loaded into mathematical computation environments such as MATLAB, Mathematica, and GNU Octave.

If the file exists already when the training starts, and the saved model is compatible with the specified command line arguments, TheanoLM will automatically continue training from the previous state.

Recipes

There are examples for training language models in the recipes directory for two data sets. penn-treebank uses the data distributed with RNNLM basic examples. google uses the WMT 2011 News Crawl data, processed with the scripts provided by the 1 Billion Word Language Modeling Benchmark. The examples demonstrate class-based models, hierarchical softmax, and noise-contrastive estimation.

Applying a language model

Scoring a text corpus

theanolm score command can be used to compute the perplexity of evaluation data, or to rescore an n-best list by computing the probability of each sentence. It takes two positional arguments. These specify the path to the TheanoLM model and the text to be evaluated. Evaluation data is processed identically to training and validation data, i.e. explicit start-of-sentence and end-of-sentence tags are not needed in the beginning and end of each utterance, except when one wants to compute the probability of the empty sentence <s> </s>.

What the command prints can be controlled by the --output parameter. The value can be one of:

perplexity
Compute perplexity and other statistics of the entire corpus.
word-scores
Display log probability scores of each word, in addition to sentence and corpus perplexities.
utterance-scores
Write just the log probability score of each utterance, one per line. This can be used for rescoring n-best lists.

The easiest way to evaluate a model is to compute the perplexity of the model on evaluation data, lower perplexity meaning a better match. Note that perplexity values are meaningful to compare only when the vocabularies are identical. If you want to compare perplexities with back-off model perplexities computed e.g. using SRILM, note that SRILM ignores OOV words when computing the perplexity. You get the same behaviour from TheanoLM, if you use --exclude-unk. TheanoLM includes sentence end tokens in the perplexity computation, so you should look at the ppl value from SRILM output. The example below shows how one can compute the perplexity of a model on evaluation data, while ignoring OOV words:

theanolm score model.h5 test-data.txt --output perplexity --exclude-unk

When the vocabulary of the neural network model is limited to a subset of the words that occur in the training data (called shortlist), it is possible to estimate the probability of the out-of-shortlist words using their unigram frequencies in the training data. This approach is enabled using --shortlist argument, e.g.:

theanolm score model.h5 test-data.txt --output perplexity --shortlist

The probability of the <unk> token is distributed among the out-of-shortlist words that appear in the training data. Words that didn’t appear in the training data will be ignored. For this to work correctly, --exclude-unk shouldn’t be used when training the model.

Probabilities of individual words can be useful for debugging problems. The word-scores output can be compared to the -ppl -debug 2 output of SRILM. While the base chosen to represent log probabilities does not affect perplexity, when comparing log probabilities, the same base has to be chosen. Internally TheanoLM uses the natural logarithm, and by default also prints the log probabilities in the natural base. SRILM prints base 10 log probabilities, so in order to get comparable log probabilities, you should use --log-base 10 with TheanoLM. The example below shows how one can display individual word scores in base 10:

theanolm score model.h5 test-data.txt --output word-scores --log-base 10

Rescoring n-best lists

A typical use of a neural network language model is to rescore n-best lists generated during the first recognition pass. Often a word lattice that represents the search space can be created as a by-product in an ASR decoder. An n-best list can be decoded from a word lattice using lattice-tool from SRILM. Normally there are many utterances, so the lattice files are listed in, say, lattices.txt. The example below reads the lattices in HTK SLF format and writes 100-best lists to the nbest directory:

mkdir nbest
lattice-tool -in-lattice-list lattices.txt -read-htk -nbest-decode 100 \
             -out-nbest-dir nbest

It would be inefficient to call TheanoLM on each n-best list separately. A better approach is to concatenate them into a single file and prefix each line with the utterance ID:

for gz_file in nbest/*.gz
do
    utterance_id=$(basename "${gz_file}" .gz)
    zcat "${gz_file}" | sed "s/^/${utterance_id} /"
done >nbest-all.txt

lattice-tool output includes the acoustic and language model scores. TheanoLM needs only the sentences. You should use --log-base 10 if you’re rescoring an n-best list generated using SRILM:

cut -d' ' -f5- <nbest-all.txt >sentences.txt
theanolm score model.h5 sentences.txt \
    --output-file scores.txt --output utterance-scores \
    --log-base 10

The resulting file scores.txt contains one log probability on each line. These can be simply inserted into the original n-best list, or interpolated with the original language model scores using some weight lambda:

paste -d' ' scores.txt nbest-all.txt |
awk -v "lambda=0.5" \
    '{ nnscore = $1; boscore = $4;
       $1 = ""; $4 = nnscore*lambda + boscore*(1-lambda);
       print }' |
awk '{ $1=$1; print }' >nbest-interpolated.txt

The total score of a sentence can be computed by weighting the language model scores with some value lmscale and adding the acoustic score. The best sentences from each utterance are obtained by sorting by utterance ID and score, and taking the first sentence of each utterance. The fields we have in the n-best file are utterance ID, acoustic score, language model score, and number of words:

awk -v "lmscale=14.0" \
    '{ $2 = $2 + $3*lmscale; $3 = $4 = "";
       print }' <nbest-interpolated.txt |
sort -k1,1 -k2,2gr |
awk '$1 != id { id = $1; $2 = ""; print }' |
awk '{ $1=$1; print }' >1best.ref

Decoding word lattices

theanolm decode command can be used to decode the best paths directly from word lattices using neural network language model probabilities. This is more efficient than creating an intermediate n-best list and rescoring every sentence:

theanolm decode model.h5 \
    --lattice-list lattices.txt --lattice-format slf \
    --output-file 1best.ref --output ref \
    --nnlm-weight 0.5 --lm-scale 14.0

The lattices may be in SLF format (originating from HTK recognizer) or text CompactLattice format used by Kaldi recognizer. The format is selected using the --lattice-format argument (either “slf” or “kaldi”). With Kaldi format you also have to provide a mapping from words to the word IDs used in the lattices, using the --kaldi-vocabulary argument. Typically the file is called “words.txt” and stored in the lang directory.

In principle, the context length is not limited in recurrent neural networks, so an exhaustive search of word lattices would be too expensive. There are a number of parameters that can be used to constrain the search space by pruning unlikely tokens (partial hypotheses). These are:

–max-tokens-per-node : N
Retain at most N tokens at each node. Limiting the number of tokens is very effective in cutting the computational cost. Higher values mean higher probability of finding the best path, but also higher computational cost. A good starting point is 64.
–beam : logprob
Specifies the maximum log probability difference to the best token at a given time. Beam pruning starts to have effect when the beam is smaller than 1000, but the effect on word error rate is small before the beam is smaller than 500.
–recombination-order : N
When two tokens have identical history up to N previous words, keep only the best token. This means effectively that we assume that the influence of a word is limited to the probability of the next N words. Recombination seems to have little effect on word error rate before N is closer to 20.
–prune-relative : R
If this argument is given, the --max-tokens-per-node and --beam parameters will be adjusted relative to the number of tokens in each node. Those parameters will be divided by the number of tokens and multiplied by R. This is especially useful in cases such as character language models.
–abs-min-max-tokens : N
Specifies a minimum value for the maximum number of tokens, when using --prune-relative.
–abs-min-beam : logprob
Specifies a minimum value for the beam, when using --prune-relative.

The work can be divided to several jobs for a compute cluster, each processing the same number of lattices. For example, the following SLURM job script would create an array of 50 jobs. Each would run its own TheanoLM process and decode its own set of lattices, limiting the number of tokens at each node to 10:

#!/bin/sh
#SBATCH --gres=gpu:1
#SBATCH --array=0-49

srun --gres=gpu:1 theanolm decode model.h5 \
    --lattice-list lattices.txt \
    --output-file "${SLURM_ARRAY_TASK_ID}.ref" --output ref \
    --nnlm-weight 0.5 --lm-scale 14.0 \
    --max-tokens-per-node 64 --beam 500 --recombination-order 20 \
    --num-jobs 50 --job "${SLURM_ARRAY_TASK_ID}"

When the vocabulary of the neural network model is limited, but the vocabulary used to create the lattices is larger, the decoder needs to consider how to score the out-of-vocabulary words. The frequency of the OOV words in the training data may easily be so high that the model favors paths that contain many OOV words. It may be better to penalize OOV words by manually setting their log probability using the --unk-penalty argument. It is also possible to distribute the <unk> token probability to out-of-shortlist words using the --shortlist argument, in the same way as with theanolm score command. However, the lattice decoder needs to assign some probability to words that did not exist in the training data, so you may want to combine these two arguments.

By setting --unk-penalty=-inf, paths that contain OOV words will get zero probability. The effect of interpolation weight can be confusing if either the lattice or the neural network model assigns -inf log probability to some word. The result of interpolation will be -inf regardless of the weight, as long as the weight of -inf is greater than zero. If -inf is weighted by zero, it will be ignored and the other probability will be used.

Rescoring word lattices

theanolm decode command can also be used for rescoring and pruning word lattices. Simply select either SLF or Kaldi output using --output slf or --output kaldi. This is beneficial over decoding the best path if lattice information is needed in further steps. The pruning options are identical.

The CompactLattice format of Kaldi is actually a weighted FST. Each arc is associated with an acoustic cost and what is called a graph cost. The graph cost incorporates other things besides the language model probability, including pronunciation, transition, and silence probabilities. In order to compute the effect of those other factors, we can subtract the original LM scores from the graph scores.

Assuming that we want to replace old LM scores with those provided by TheanoLM without interpolation, it is possible to include the rest of the graph score by subtracting the old LM scores, interpolating with weight 0.5, and multiplying the LM scale by 2. Below is an example that does this, using standard Kaldi conventions for submitting a batch job:

${cmd} "JOB=1:${nj}" "${out_dir}/log/lmrescore_theanolm.JOB.log" \
  gunzip -c "${in_dir}/lat.JOB.gz" \| \
  lattice-lmrescore-const-arpa \
    --lm-scale=-1.0 \
    ark:- "${old_lm}" ark,t:- \| \
  theanolm decode ${nnlm} \
    --lattice-format kaldi \
    --kaldi-vocabulary "${lang_dir}/words.txt" \
    --output kaldi \
    --nnlm-weight 0.5 \
    --lm-scale $(perl -e "print 2 * ${lm_scale}") \
    --max-tokens-per-node "${max_tokens_per_node}" \
    --beam "${beam}" \
    --recombination-order "${recombination_order}" \
    "${theanolm_args[@]}" \
    --log-file "${out_dir}/log/theanolm_decode.JOB.log" \
    --log-level debug \| \
  lattice-minimize ark:- ark:- \| \
  gzip -c \>"${out_dir}/lat.JOB.gz"

The downside is that another command is needed for interpolating with the original (n-gram) language model scores. There are two example scripts for Kaldi in the TheanoLM repository. lmrescore_theanolm.sh creates rescored lattices without interpolation. lmrescore_theanolm_nbest.sh creates n-best lists, interpolating the lattice and NNLM probabilities. These can be used in the same manner as the other lattice rescoring steps in the Kaldi recipes, for example:

steps/lmrescore_theanolm.sh \
  --prune-beam 8 \
  --lm-scale 8.0 \
  --beam 600 \
  --recombination-order 20 \
  --max-tokens-per-node 120 \
  --cmd "utils/slurm.pl --mem 20G" \
  data/lang \
  nnlm.h5 \
  model/dev-decode \
  model/dev-rescore
local/score.sh \
  --cmd utils/slurm.pl \
  --min-lmwt 4 \
  data/dev \
  data/lang \
  model/dev-rescore

Generating text

A neural network language model can also be used to generate text, using the theanolm sample command:

theanolm sample model.h5 --num-sentences 10

Calling TheanoLM from Python

Scoring an utterance

You can also call TheanoLM from a Python script to score utterances. Assuming you have trained a neural network and saved it in model.h5, first load the model using Network.from_file:

from theanolm import Network, TextScorer
model = Network.from_file('model.h5')

Then create a text scorer. The constructor takes optional arguments concerning unknown word handling. You might want to ignore unknown words. In that case, use:

scorer = TextScorer(model, ignore_unk=True)

Now you can score the text string utterance using:

score = scorer.score_line(utterance, model.vocabulary)

Start and end of sentence tags (<s> and </s>) will be automatically inserted to the beginning and end of the utterance, if they’re missing. If the utterance is empty, None will be returned. Otherwise the returned value is the log probability of the utterance.

Development

Contributing

You’re welcome to contribute.

  1. Fork the repository on GitHub.
  2. Clone the forked repository into a local directory: git clone my-repository-url
  3. Create a new branch: git checkout -b my-new-feature
  4. Commit your changes: git commit -a
  5. Push to the branch: git push origin my-new-feature
  6. Submit a pull request on GitHub.

Source code packages

theanolm.commands package contains the main scripts for launching the subcommands.

theanolm.network package contains Network class, which constructs the network from layer objects and stores the neural network state (parameters). Each layer type is implemented in its own class that derives from BasicLayer. These classes specify the layer parameters and the mathematical structure using symbolic variables.

theanolm.parsing package contains classes for iterating text and converting it to mini-batches.

theanolm.training package contains Trainer class, which performs the training iterations. It is responsible for cross-validation and learning rate adjustment. It uses one of the optimization classes derived from BasicOptimizer to compute the gradients and adjust the network parameters.

theanolm.scoring package contains the TextScorer class for scoring sentences and LatticeDecoder class for decoding word lattices. TextScorer is used both for cross-validation during training and by the score command for evaluating text.

theanolm.textsampler.TextSampler class is used by the sample command for generating text.

Neural network structure

A Network object contains tensors input_word_ids, input_class_ids, and mask that represent the mini-batch input of the network, i.e. a set of n word sequences, where n is the batch size. These symbolic variables represent two-dimensional matrices. The first dimension is the time step, i.e. the index of a word inside a sequence, and the second dimension is the sequence. The mask indicates which elements are past the sequence end; the output will be ignored if the corresponding mask value is zero. Theano functions that utilize the network have these tensors as inputs. Their values will be read from a text file by a BatchIterator.

Layers receive a list of input layers in the constructor. The constructor creates the initial values of the layer parameters. Every layer implements the create_structure() method that describe its output, given its parameters and the output of its input layers.

The Network constructs the layer objects. First layer object is a NetworkInput, which is not a real layer, but just provides in its output either the word ID or class ID matrix. The first layer following a NetworkInput should be a ProjectionLayer. It maps the integer word IDs into floating point vectors. Thus the projection layer and all the subsequent layers output a three-dimensional tensor, where the third dimension is the activation vector.

_images/batch-processing.png

API Documentation

theanolm package

Subpackages
theanolm.backend package
Submodules
theanolm.backend.classdistribution module
theanolm.backend.debugfunctions module
theanolm.backend.exceptions module
theanolm.backend.filetypes module
theanolm.backend.gpu module
theanolm.backend.matrixfunctions module
theanolm.backend.operations module
theanolm.backend.parameters module
theanolm.backend.probfunctions module
Module contents
theanolm.commands package
Submodules
theanolm.commands.decode module
theanolm.commands.sample module
theanolm.commands.score module
theanolm.commands.train module
theanolm.commands.version module
Module contents
theanolm.network package
Submodules
theanolm.network.additionlayer module
theanolm.network.architecture module
theanolm.network.basiclayer module
theanolm.network.bidirectionallayer module
theanolm.network.dropoutlayer module
theanolm.network.fullyconnectedlayer module
theanolm.network.glulayer module
theanolm.network.grulayer module
theanolm.network.highwaylayer module
theanolm.network.hsoftmaxlayer module
theanolm.network.lstmlayer module
theanolm.network.network module
theanolm.network.networkinput module
theanolm.network.projectionlayer module
theanolm.network.recurrentstate module
theanolm.network.samplingoutputlayer module
theanolm.network.softmaxlayer module
theanolm.network.weightfunctions module
Module contents
theanolm.parsing package
Submodules
theanolm.parsing.batchiterator module
theanolm.parsing.functions module
theanolm.parsing.linearbatchiterator module
theanolm.parsing.scoringbatchiterator module
theanolm.parsing.shufflingbatchiterator module
Module contents
theanolm.scoring package
Submodules
theanolm.scoring.kaldilattice module
theanolm.scoring.lattice module
theanolm.scoring.latticebatch module
theanolm.scoring.latticedecoder module
theanolm.scoring.rescoredlattice module
theanolm.scoring.slflattice module
theanolm.scoring.textscorer module
Module contents
theanolm.training package
Submodules
theanolm.training.adadeltaoptimizer module
theanolm.training.adagradoptimizer module
theanolm.training.adamoptimizer module
theanolm.training.basicoptimizer module
theanolm.training.cost module
theanolm.training.nesterovoptimizer module
theanolm.training.rmspropnesterovoptimizer module
theanolm.training.rmspropsgdoptimizer module
theanolm.training.sgdoptimizer module
theanolm.training.stoppers module
theanolm.training.trainer module
Module contents
theanolm.vocabulary package
Submodules
theanolm.vocabulary.statistics module
theanolm.vocabulary.vocabulary module
theanolm.vocabulary.wordclass module
Module contents
Submodules
theanolm.textsampler module
theanolm.version module
Module contents

wordclasses package

Submodules
wordclasses.bigramoptimizer module
wordclasses.functions module
wordclasses.numpybigramoptimizer module
wordclasses.theanobigramoptimizer module
wordclasses.wctool module
Module contents

Indices and Tables