# February Data Digest

Here is our choice of academic articles on deep learning published in February. This selection covers diverse topics like half-precision training (two different approaches to achieve 2x times faster deep learning training), style transfer (closed-form solution for the photorealistic style transfer with smoothing) and reinforcement learning (10x times more effective than previous algorithms and above the human level).

## Mixed Precision in Deep Networks Training

It's a commonplace that deep learning benefits from increasing the size of the model and the amount of training data. For example, ResNet model with 20 layers has 0.27M of parameters and grows linearly with the number of layers. To train ResNet on ImageNet dataset, you will need 110 layers and 1.7M of parameters respectively. An inevitable consequence of this is that this increases the memory and computation requirements for the model training. The most straightforward way to mitigate the memory requirements is to use lower precision arithmetic.

Two new articles dedicated to this topic were published on arXiv in February: Mixed Precision Training and Mixed Precision Training of Convolutional Neural Networks using Integer Operations.

#### Mixed Precision Training

A new study propose three techinques for training with half precision FP16 and still matching the model accuracy of single precision:

• single-precision FP32 master copy of weights and updates;
• special loss scaling;
• accumulating half-precision products into single precision

Single-precision master copy of weights
In vanilla mixed precision training all weights, activations, and gradients are stored as FP16. In half precision arithmetic the values smaller than $2^{-24}$ become zero. The authors stress that approximately 5% of weight gradient values are zeroed in FP16 for this reason. To overcome this problem the FP32 master copy of weights is offered. In the proposed scheme we create the FP32 master copy of all weights and perform forward and backward
propagation in FP16 and then update weights stored in the master copy.
Storing an additional copy of weights increases the memory requirements compared to vanilla mixed precision training, but the impact on the total memory usage is not so significant: the overall memory consumptions is approximately halved.

Loss scaling
In FP16 arithmetic small weight updates become zero. To mitigate this problem the authors introduce a constant scaling loss factor ranging from 8 to 32K. In case of overflow (which can be detected by inspecting computed weight gradients) the authors offer to skip the weight update and just move to the next iteration.

Arithmetic precision
The neural network arithmetic operations can be divided into three groups:

• vector dot-products $\displaystyle x^Ty=\sum_{i=1}^nx_iy_i$, where $x, y\in\mathbb{R}^{n\times 1}$;
• reductions $\displaystyle \bar{x}=\frac{1}{n}\sum_{i=1}^nx_i$;
• point-wise operations $\displaystyle \sigma(X)=\left(\sigma(X_1), \ldots,\sigma(X_m) \right)$, where $X\in\mathbb{R}^{m}$ and $\sigma(\cdot): \mathbb{R}\mapsto\mathbb{R}$ - some (usually non-linear) real-valued function.

The authors stress that to maintain model accuracy, some networks require that the FP16 vector dot-product accumulates the partial products into an FP32 value, which is then converted to FP16 before storing.

Large reductions, which come up in batch-normalization and softmax layers should be performed in FP32 and then stored back to FP16.

As far as point-wise operations are memory-bandwidth limited, the arithmetic precision does not affect the speed of these operations and either FP16 or FP32 arithmetic can be used.

Experiments and conclusions
The authors considered different image classification, speech recognition, machine translation and language modelling tasks and showed, that the single precision and half-precision arithmetic with offered techniques achieve comparable accuracies on a wide range of models. On some models trained with Volta GPU they report 2-6x speedups, but in general case the training time decrease depends on library and framework optimizations.

It's worth mentioning that this result was achieved by the research group from Baidu in collaboration with researchers from Nvidia Corp.

#### Note on mixed precision arithmetic in PyTorch

We put here some links on how to perform mixed precision training in PyTorch:

Summarizing: to perform computation in PyTorch with half-precision arithmetic on GPU:

• first cast the model (and inputs) to FP16 model.cuda().half()

• store master copy of model weights in FP32 and define the optimizer, which will update the master copy weights during training:

param_copy = [param.clone().type(torch.cuda.FloatTensor).detach() for param in model.parameters()]
for param in param_copy:
optimizer = torch.optim.SGD(param_copy)

• convert batch normalization layers to FP32 for accumulation otherwise you may have convergence issues
for layer in model.modules():
if isinstance(layer, nn.BatchNorm2d):
layer.float()

• use loss scale factor to mitigate the problem of zeroing small gradient updates loss = loss * scale_factor
• on each optimization step after model.zero_grad() and loss.backward()
• cast computed gradients to FP32 and descale gradients if loss was scaled
• update weights in FP32 and copy updated weights to the model casting them to FP16

#### Mixed Precision Training of Convolutional Neural Networks using Integer Operations

Although the state-of-the-art results in mixed precision training are mostly represented by approaches where FP16 arithmetic is used, the authors of this study offered a new mixed precision training setup which uses Dynamic Fixed Point (DFP) tensors represented by a combination of INT16 tensor and a shared tensor-wide exponent.

The authors defined DFP tensor primitives to facilitate arithmetic operations (summation and multiplication) which applied to two DFP-16 tensors results in one DFP-32 tensor and a new shared exponent and a down-conversion operation, which scales DFP-32 output to DFP-16 tensor.
The efficient implementation of DFP-16 tensor operations primitives are supported through the "prototype 16-bit integer kernels in Intel's MKL-DNN library along with explicit exponent management." The experiments are run on recently introduced Intel XeonPhi Knights-Mill hardware.

The authors stress that this approach doesn't require any hyperparameter tuning (which is necessary for FP16 mixed precision training) and they achieved results comparable to the state-of-the-art results reported for FP32 training with potential 2x savings in computation.

## Style Transfer: Exposition

Deep convolutional neural networks are very effective for image recognition and classification tasks. CNN trained for image recognition learns internal representation of objects, which can be interpreted as content and style features. Content features are used to recognize objects during the classification task. In the article Image Style Transfer Using Convolutional Neural Networks it was shown, that correlations between features in deep layers of CNN encode the visual style of the image.

Suppose we have a content image $\mathbf{c}$ meaning that we want to take the content from this image and another style image $\mathbf{s}$, the style of which we are going to apply to the content $\mathbf{c}$ to produce a new image $\mathbf{x}$.

You can't excplicitly tag features as "content" or "style", but you can define loss functions in a way that will encourage transferring content from $\mathbf{c}$ and style from $\mathbf{s}$.

Define the feature map of the CNN layer $\ell$ as $F_{\ell}[\cdot]\in\mathbb{R}^{N_{\ell}\times D_{\ell}}$, where $N_{\ell}$ - number of filters and $D_{\ell}$ - size if the vectorized feature map on the layer $\ell$.
The common approach to deal with the content features is to use squared error loss:

$$\mathcal{L_c}^{\ell}(\mathbf{c}, \mathbf{x})=\frac{1}{2N_{\ell}D_{\ell}}\sum_{i,j}\left(F_{\ell}[\mathbf{x}]-F_{\ell}[\mathbf{c}]\right)_{i,j}^2$$

Define the Gram matrix $G_{\ell}[\cdot]=F_{\ell}[\cdot]F_{\ell}^T[\cdot]\in\mathbb{R}^{N_{\ell}\times N_{\ell}}$ each element $G_{i,j}^{\ell}$ of which is the inner product of the vectorized feature maps $i$ and $j$ in layer $\ell$. The Gram matrix represents the feature correlations which are used to address the style transfer problem. The loss function for the style transfer is defined as

$$\mathcal{L_s}^{\ell}(\mathbf{s}, \mathbf{x})=\frac{1}{2N_{\ell}^2}\sum_{i,j}\left(G_{\ell}[\mathbf{x}]-G_{\ell}[\mathbf{c}]\right)_{i,j}^2$$

The total loss for the style transfer problem is defined as the weighted sum of content and style losses:

$$\mathcal{L}(\mathbf{c}, \mathbf{s}, \mathbf{x})=\alpha \sum_{\ell}\mathcal{L_c}^{\ell}(\mathbf{c}, \mathbf{x})+\beta\sum_{\ell} w_{\ell}\mathcal{L_s}^{\ell}(\mathbf{s}, \mathbf{x}),$$

where $w_{\ell}$ - weighting factors regulating the contribution of each layer to the total loss.

To address the problem of artefacts on the generated photorealistic images another approach to the style transfer was offered in the article Universal Style Transfer via Feature Transforms: the authors formulate the transfer task as image reconstruction process and apply classic signal whitening and coloring transforms (WCT) to the features extracted in each intermediate layer.

Whitening transform is a decorrelation operation: consider the column vector $\mathbf{x}$ with zero mean and non-singular covariance matrix $C$, then $\mathbf{y}=W\mathbf{x}$, where $W^TW=C^{-1}$ is whitened vector with unit diagonal covariance matrix.

For the style transfer purpose the WCT was implemented as an autoencoder on each of the intermediate layers of the CNN. This approach was shown to be less prone to artifacts when applied to photorealistic images. Nevertheless this approach still generates artifacts causing inconsistent stylization. Another drawback is that training of this model is computationally challenging.

One more step towards photorealistic style transfer has been offered in the recent article, published in February.

#### A Closed-form Solution to Photorealistic Image Stylization

In this paper, authors propose a novel fast photorealistic image style transfer algorithm consisting of two steps: stylization and smoothing.
For both of these steps the closed-form solutions are provided.

The stylization step is based on the improved version of the autoencoder performing whitening and coloring transform (WCT) algorithm and is referred to as the PhotoWCT step.

The problem with the WCT stylized images was that repeating semantically similar patterns can be stylized differently. To address this problem the smoothing step was introduced, pursuing the goals:

• regions with similar content should be stylized similarly;
• the smoothed result shouldn't deviate significantly from the PhotoWCT result.

Motivated by the ranking algorithms used for objects represented as data points lying in the Euclidean space ranked with respect to the intrinsic manifold structure of the data, authors represent all pixels as nodes in a graph and define affinity matrix $W\in\mathbb{R}^{N\times N}$, where $N$ - number of pixels.

The ranking problem is stated as follows: for a given set of points $Y=(y_1,\ldots,y_q,y_{q+1},\ldots, y_N)$ where the first $q$ points are marked as "queries" rank the rest of the points according to their relevance to the query points.

The smoothing step can be solved with the following optimization problem:
$$r^*=\arg\min_r\frac{1}{2}\left(\sum_{i,j=1}^Nw_{ij}|\frac{r_i}{\sqrt{d_{ij}}}-\frac{r_j}{\sqrt{d_{ij}}}|^2+\lambda\sum_{i=1}^N|r_i-y_i|^2\right),$$
where $\displaystyle d_{ij}=\sum_jw_{ij}$ and $y_i$ - is the pixel color in the PhotoWCT-stylized result $Y$ and $r_i$ - is the pixel color in the
desired smoothed output $R$ and $\lambda$ controls the balance of these two terms.

The most wonderful thing is that the above problem has a closed-form solution:
$$R^*=(1-\alpha)(I-\alpha S)^{-1}Y,$$
where $I$ is identity matrix, $\alpha=\frac{1}{1+\lambda}$ and $S=D^{-1/2}WD^{-1/2}\in\mathbb{R}^{N\times N}$.

You can see the impressive photorealistic style transfer in the original article.

## Reinforcement Learning

Reinforcement learning is the branch of machine learning where the model training is based on responses (rewards and penalties) obtained by an agent from the environment. The often used environment is the collection of Atari games. The problem is that agent trained for one task can't apply previously acquired skills for another task.
This problem was partially solved in the article Asynchronous Methods for Deep Reinforcement Learning where the A3C (Asynchronous advantage actor-critic) algorithm has been proposed. In the A3C algorithm individual agents (actors) explore the environment for some time, then the process suspended and they exchange obtained explorations (in terms of gradients of the loss function) with the central component (parameter server or learner), which updates actor's parameters.

#### IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

A new architecture of the asynchronuous actor-critic deep reinforcement learning was called Importance Weighted Actor-Learner Architecture (IMPALA). The two main differences between A3C and IMPALA are that:

• the gradient computations are performed by the learners in IMPALA;
• there can be several learners in IMPALA and these learners can exchange the computed gradients with each other.

The algorithm was trained on a recently published suite of 3D navigation puzzle-solving tasks by DeepMind Lab and was shown to be 10 times more effective than the A3C driven algorithm.