backpropagation

Nowadays one can’t walk through a technical trade show, sit through a presentation, or read through an online article, without running into terms like AI and Deep Neural Network. If you felt like an outsider looking in, but don’t have a lot of time to devote studying this subject, this article is for you.

Historical context…

Neural network has been around since the 70’s, had its first renaissance in the late 80’s to early 90’s, but the promise of this new technology was not fully realized due to a lack of compute power. However, the past 10 years have seen a resurgence of neural network, as researchers further enhanced models developed in the 80’s and applied them to practical applications such as speech recognition and computer vision. This time the technology received wide adoption aided by powerful compute available both at the “edge”, and in the “cloud”.

Back-propagation

The late 80’s saw many useful neural network models, with back-propagation (BKP) being the most prominent. This network model is able to learn the mapping between a set of input/output data, known as the “training set”, much like how curve-fitting fits a set of (x, y) data points to a known function (e.g, polynomial).

Figure 1 – Simplified back-propagation model.

BKP is made of an input layer, an output layer, and one or more hidden layers. Neurons of adjacent layers are generally fully-connected, each connection is weighted, initialized to normalized random values.

BKP operates in two modes: training and inference. During training, the training input vector is presented to the input layer; and the training output vector is presented to the output layer. The input vector multiplies the weight matrix to create a sum, and a bias is added to the sum, and then the result is filtered through the activation function, which is usually a sigmoidal function, also known as softmax, or linear-above-threshold function, also known as ReLU, or hyperbolic tangent function, or tanh for short. The output of one layer becomes the input to the next layer and so on, until the entire network is computed. This forward computation flow is known as forward-propagation.

Figure 2 – Processing in a BKP neuron.

The resulting output vector from the forward-propagation process is then compared with the training output vector, and an error is computed, a computation known as a loss function in modern DNN speak. Loss function typically look something like below, where k iterates over M training data points, and j is the output neuron index.

Once error is computed, the back-propagation process begins. The purpose of back-propagation is to proportionally attribute the prediction error to the neurons from the previous layer by computing how much each of these neuron contributed to correct prediction. Active neurons associated with improved prediction are rewarded by strengthening their connection weights; active neurons associated with worsen prediction performance have their connections weakened.

Figure 3 – Back-propagation flow.

Gradient descent

The aforementioned weight update process is a learning algorithm called Gradient Descent (GD). Given a BKP network model, a training data set, and the initial weight vector, an error surface is equivalently defined. GD learning works by finding the steepest descend on the error surface. Imaging you are on a mountain top ready to get back down, but without a map. One way is to travel down the steepest descending path around you, and do so continuously until you reached to the lowest point. However, as our experience would tell us, this method would not always lead us to the lowest point. Sometimes we can get stuck in a local valley. In optimization speak, this is called a local minima. Similarly, BKP training may converge on a local minima, a sub-optimal solution, as the below two figures depicts. Left is a local minima; right is a global minima.

There are a few ways to improve the chances of converging to more optimal solution. A close relative to GD, the Stochastic Gradient Descent (SGD) algorithm, helps by periodically shuffling the order of how training data are presented to the DNN to improve the chance of finding more optimal convergence. Another way is to add a momentum term to the weight update, allowing learning to “climb hills” on the error surface in order to discover lower “valleys”.

What so “deep” about DNN?

Modern neural networks, dubbed Deep Neural Network (DNN), are variants of BKP with layers of feature detectors added to the front end.

68747470733a2f2f7261772e6769746875622e636f6d2f7175696e6e6c69752f436f6d7075746174696f6e616c4e6575726f736369656e63652f6d61737465722f696d61676573466f724578706c616e6174696f6e2f4d656368616e69737469634d6f64

Figure 2 – Biological receptive field.

These front-end feature-detectors are inspired by receptive fields in biological vision. Low-level detector can only detect simple feature, such as a line segment in a local image neighborhood (e.g., 3×3). If adjacent detectors in a larger region are fed to neurons in an upper layer, these upper layer neurons would be able to detect more complex pattern over the larger region. It follows that, by stacking more layers, more complex features over a larger image region can be detected. DNN are “deep” because it is made of many layers.

Figure 4 – AlexNet showing increasingly complex features over layers.

DNN applications

Application of DNN can be categorized into the following:

Object classification – determine object class given an image
Object detection/tracking – find same object in consecutive images
Object segmentation – find image regions belonging with different objects
Action recognition – recognizing actions or intent of observed subjects
Super-resolution – generate new pixels thereby increasing image resolution
Style transfer/Colorization – modify images based on known artistic styles

For the first three application, the key contribution of DNN is the addition of convolution layers at the front-end of the network for extraction of complex features; the complex feature are learned.

Good DNN performance requires large, well structured training data sets. Researchers often benchmark DNN performance using well-know publicly available data sets, such as:

ImageNet/ILSVRC (images associated to words, i.e., nouns)
PASCAL VOC (images associated to words)
COCO (image database with complex scene with object segmentation)
MINST (hand-writing database often used in DNN introduction)

Organizations hosting these data sets frequently hold competitions using the data set as proving ground for performance (mAP) and speed (frames-per-second).

Notable DNN for object recognition in recent years are:

AlexNet – by Alex Krizhevsky et al, Univ. of Toronto.
VGGNet – by Visual Graphics Group, Univ. of Toronto
Single-Shot Multi-Box Detector (SSD) – by Liu et al, Univ. of Michigan
Region-based Fully Convolution Network (R-FCN) – by Dai et al, Microsoft Research
GoogLeNet (Inception) – cascading Inception module; by Google
You Only Look Once (YOLO) – by Redmond, et al, Univ. Washington.
Residual Network (ResNet) – by He, et al, Microsoft Research

Their relative performance is summarized by the figure below:

Figure – 5: Performance of popular DNN models.

Summary

In this blog post I provided brief history of neural network and an overview of BKP network, which is at the heart of most DNN network today; then quickly covered some popular DNNs for object recognition. At this point, you should be up-to-date on key DNN terminologies and ecosystem. If you wish to dive deeper, follow this blog series; the rest of this series I will dive deeper into DNNs applicable to each application category.

(This article is solely the expressed opinion of the author and does not necessarily reflect the position of his current and past employers or other association)

Larrylisky's Wiki

A Little of Everything

Deep Neural Network in 10 minutes

Historical context…