RNNLIB: Connectionist Temporal Classification and Transcription Layer

CTC is the core concept make it possible to transcribe unsegmented sequence data. RNNLIB implements it in a single layer called Transcription Layer. We go into this particular layer in this post, the main reference is the Graves’ original paper.

The key point for CTC is to use a simple map transforming the RNN output to unsegmented labelling, and construct a new objective function based on the map. This map do not need a precise alignment, thus greatly simplify the task and reduce human expert involvement.

The Name

“Connectionist” is the adjective form of “connectionism”, Connectionism is a terminology in cognitive science, which models mental or behavioural phenomena as the emergent processes of interconnected networks of simple units. The most common forms use neural network models.

In the traditional neural network recipe, we independently model the input sequence in each time-step or frame. This can be referred as framewise classification. Kadous extends the classification paradigm to multivariate time series, and names it as temporal classification. Mathematically, framewise classification models the distribution over output sequences of the same length as the input sequence, nevertheless, temporal classification models the distribution over output sequences of all lengths. With this, we do not have to label every time step in training data set.

Combining RNN and temporal classification, Graves proposes the connectionist temporal classification.

To distinguish from classification, RNNLIB implements the CTC as Transcription Layer, indicating that with CTC we can directly transcribe input sequence(e.g. acoustic signal) into output sequence(e.g. words).

The Theory

List of Symbols

Following the notations in the paper, we first list the symbols.

Symbol Meaning
$L$ (finite) alphabet of labels
$L’$ $L \cup {blank}$
$\mathcal{X}$ $(\mathbb{R}^m)^{*}$, $m$ dimensional input space
$\mathcal{Z}$ $L^{*}$, output space, set of all sequences over the $L$
$\mathcal{D_{X \times Z}}$ underlying distribution of data
$S$ set of training examples supposed to be drawn from $\mathcal{D_{X \times Z}}$
($\mathbf{x},\mathbf{z})$ example in $S$, $\mathbf{x} = (x_1, x_2, \dotsc, x_T)$, $\mathbf{z} = (z_1, z_2, \dotsc, z_U)$ and $U \leq T$
$h:\mathcal{X} \mapsto \mathcal{Z}$ temporal classifier to be trained
$\mathcal{N}_{w}:(R^{m})^{T} \mapsto (R^n)^{T}$ RNN, with $m$ inputs, $n$ outputs and weight vector $w$, as a continuous map
$\mathbf{y} = \mathcal{N}_{w}$ sequence of RNN output
$y_{k}^{t}$ the activation of output unit $k$ at time $t$
$\pi$ path, element of $L’^{T}$
$\mathbf{l} \in L^{\leq T}$ label sequence or labelling
$\mathcal{B}:L’^{T} \mapsto L^{\leq T}$ map from path to labelling
$\mathbf{l}_{a\mathord{:}b}$ sub-sequence of $\mathbf{l}$ from $a$th to $b$th labels
$\mathbf{l}’$ modified label sequence, with blanks added to the beginning and the end and inserted between every pair of labels in $\mathbf{l}$
$\alpha_t(s)$ forward variable, the total probability of $\mathbf{l}_{1:s}$ at time $t$
$\beta_t(s)$ backward variable, the total probability of $\mathbf{l}_{s:|\mathbf{l}’|}$ at time $t$
$\tilde{\beta}_t(s)$ backward variable, the total probability of $\mathbf{l}_{s:|\mathbf{l}’|}$ start at time $t+1$
$O^{ML}(S,\mathcal{N}_{w})$ maximum likelihood objective function
$\delta_{kk’}$ Kronecker delta

Training Procedure

The goal is to use $S$ to train a temporal classifier $h$ to classify previously unseen input sequences in a way that minimises the ML objective function:

To train the network with gradient descent, we need to differentiate \eqref{eq:obj_ml} with respect to the network outputs. Since the training examples are independent we can consider them separately:

Another thing we have to consider is how to map from network outputs to labellings. Use $\mathcal{B}$ to denote such a map. Given a path, we simply removing all blanks and repeated labels and the remaining labels form a labelling(e.g. $\mathcal{B}(a-ab-)=\mathcal{B}(-aa–abb)=aab$).

Then we can define the conditional probability of a labelling,

where, $p(\pi|\mathbf{x})$ is the conditional probability of a path given $\mathbf{x}$, and is defined as:

To allow for blanks in the output paths, we first extend $\mathbf{l}$ to $\mathbf{l}’$ with blanks added to the beginning and the end and inserted between every pair of labels. The length of $\mathbf{l}’$ is therefore $2|\mathbf{l}| + 1$. In calculating the probabilities of prefixes of $\mathbf{l}’$ we allow all transitions between blank and non-blank labels, and also those between any pair of distinct non-blank labels(because of the map $\mathcal{B}$, the repeated labels will be merged).

Then to calculate \eqref{eq:obj}, we can define the forward and backward variable,

where, $\frac{s}{2}$ is rounded down to an integer value. Note that the product of the forward and backward variables at a given $s$ and $t$ is the probability of all the paths corresponding to $\mathbf{l}$ that go through the symbol $\mathbf{l}’_{s}$ at time $t$, i.e.,

Rearranging and substituting in from \eqref{eq:path} gives,

For any $t$, we can therefore sum over all $s$ to get $p(\mathbf{l} | \mathbf{x})$:

Thus to differentiate this with respect to $y_k^t$ , we need only consider those paths going through label $k$ at time $t$ (derivatives of other paths is zero). Noting that the same label (or blank) may be repeated several times for a single labelling $\mathbf{l}$, we define the set of positions where label $k$ occurs as $lab(\mathbf{l},k) = \{s : \mathbf{l}’_s = k \}$, which may be empty.

Differentiating \eqref{eq:fwd_bwd}, we get,

Therefore, by using notation $lab(\mathbf{l}, k)$

At this point, we can set $\mathbf{l} = \mathbf{z}$ and substituting into \eqref{eq:obj}, then get the final gradient,

where, $p(\mathbf{z}|\mathbf{x})$ can be calculated from \eqref{eq:labelling_fwd_bwd}.

Next, we can give the gradient for the unnormalised output $u_k^t$. Recall that derivative of softmax function is,

Then we get,

we write the last step by noting that $\sum_{k’}\sum_{s \in lab(\mathbf{l}, k’)}{(\cdot)} \equiv \sum_{s=1}^{|\mathbf{l}’|}{(\cdot)}$, then, using \eqref{eq:labelling_fwd_bwd}, the $p(\mathbf{z}|\mathbf{x})$ is canceled out.

The CTC Forward-Backward Algorithm

The last thing we have to do is calculating the forward and backward variables. We now show that by define a recursive from, these variables can be calculated efficiently.

We allow all prefixes to start with either a blank ($b$) or the first symbol in $\mathbf{l}$ ($\mathbf{l}_1$).

This gives us the following rules for initialisation

and recursion

Note that $\alpha_t(s) = 0, \forall s < |\mathbf{l}’|-2(T -t)-1$, because these variables correspond to states for which there are not enough time-steps left to complete the sequence.

Here we can get another method to calculate $p(\mathbf{l} | \mathbf{x})$, by adding up all forward variables at time $T$, i.e.,

Similarly, the backward variables can be initalisd as,

and recursion

Note that $\beta_t(s) = 0, \forall s > 2t$.

Following figure illustrate the forward backward algorithm applied to the labelling ‘CAT’(from the paper).

fig:alpha-beta

The Implementation

The TranscriptionLayer class inherits the SoftmaxLayer class(see this post). The feed_forward() and feed_back() methods are the general softmax function, so only need to implement the calculate_errors() method to calculate the $\frac{\partial O}{\partial y_k^t}$. In order to use \eqref{eq:grad} to get output error, first need to calculate the $\alpha$s and $\beta$s. Forward variables are got using \eqref{eq:alpha}.

But backward variables are in another form, given in Graves’ Dissertation. Consider backward variable started from time $t+1$,

Noting that, $\beta$ and $\tilde\beta$ has a simple relationship:

Thus, we can get recursion formula for $\tilde\beta$ by substituting \eqref{eq:bwd_relaion} into \eqref{eq:beta},

Noting that, if $\mathbf{l}’_s \neq blank$, then $\mathbf{l}’_{s+1}$ must be $blank$.

And the gradient for output \eqref{eq:grad} becomes,

where,

Actually, the RNNLIB code computes $p(\mathbf{z}|\mathbf{x})$ using \eqref{eq:porb}.

To wrap up, CTC using a forward-backward algorithm to efficiently compute the RNN output errors, corresponding to a new ML objective function. With these errors, we can use any traditional gradient methods to train the network.

Decoding

Once the network is trained, we would use it to transcribe some unknown input sequence $\mathbf{x}$. Decoding is referred to the task of finding the best labelling $\mathbf{l}^*$,

There are two approximate algorithms.

Best Path Decoding

This method assumes that the most probable path corresponding to the most probable labelling,

where $\pi^* = \mathop{\arg\!\max}\limits_{\pi}{\,p(\pi|\mathbf{x})}$.

This is trivial to compute, simply by concatenating the most active outputs at every time step. But it can lead to errors, because that the map $\mathcal{B}$ is a many-to-one map.

Prefix Search Decoding

By modifying the forward variables, this method can efficiently calculate the probabilities of successive extensions of labelling prefixes.

Prefix search decoding is a best-first search through the tree of labellings, where the children of a given labelling are those that share it as a prefix. At each step the search extends the labelling whose children have the largest cumulative probability (see below figure ).

fig:prefix-search-decoding

Each node either ends ($e$) or extends the prefix at its parent node. The number above an extending node is the total probability of all labellings beginning with that prefix. The number above an end node is the probability of the single labelling ending at its parent. At every iteration the extensions of the most probable remaining prefix are explored. Search ends when a single labelling (here $XY$) is more probable than any remaining prefix.

To extend the tree, we need to compute extended path probability, which can be computed in a recursive way. Let $\gamma_t(\mathbf{p}_n)$ be the probability of the network outputting prefix $\mathbf{p}$ by time $t$ such that a non-blank label is output at $t$. Similarly, let $\gamma_t(\mathbf{p}_b)$ be the probability of the network outputting prefix $\mathbf{p}$ by time $t$ such that the blank label is output at $t$. i.e.

Then for a length $T$ input sequence $\mathbf{x}$, $p(\mathbf{p} | \mathbf{x}) = \gamma_T(\mathbf{p}_n) + \gamma_T(\mathbf{p}_b)$. Also let $p(\mathbf{p}\dots | \mathbf{x})$ be the cumulative probability of all labelling not equal to $\mathbf{p}$ of which $\mathbf{p}$ is a prefix

where $\emptyset$ is the empty sequence. $p(\mathbf{p} \dotsc \mid \mathbf{x})$ is the value for extending node in the prefix tree, and $p(\mathbf{p} \mid \mathbf{x})$ is the value for end node.

In fact, by definition, relation between $\gamma$ and $\alpha$ is,

Using \eqref{eq:alpha}, we get the recursion for $\gamma_t(\mathbf{p}_n)$ given $\gamma_{t-1}(\mathbf{p}_n)$, extending $\mathbf{p}^*$ to $\mathbf{p} = \mathbf{p}^* + k$ with label $k \in L$,

And calculating the path probabilities,

The extension procedure start from $\mathbf{p}^* = \emptyset$, with initialisation,

and iterate util $\max_p p(\mathbf{p} \dotsc \mid \mathbf{x}) < \max_{p’} p(\mathbf{p}’ \mid \mathbf{x})$.

Given enough time, prefix search decoding always finds the most probable labelling. However, the maximum number of prefixes it must expand grows exponentially with the input sequence length. We need further heuristic.

Observing that the outputs of a trained CTC network tend to form a series of spikes separated by strongly predicted blanks, we can divide the output sequence into sections that are very likely to begin and end with a blank. We can do this by choosing boundary points where the probability of observing a blank label is above a certain threshold, then apply the above algorithm to each section individually and concatenate these to get the final transcription.

Last Modified: Wed Nov 18 13:58:52 2015 UTC+0800