Read the paper: ImageNet Classification with Deep Convolutional Neural Networks

This article was generated by AlexNet

At ImageNet LSVRC-2010, AlexNet had top-1 and top-5 error rates of 37.5% and 17.0% on a test set of 1.2 million high-resolution images containing 1,000 categories***. top-5 error rate: i.e., if an image is predicted with 5 categories, it is considered correct if one of them is the same as the manually labeled category, otherwise it is considered wrong. Similarly, top-1 predicts only 1 category for an image), and achieved a top-5 error rate of 15.3% in ImageNet LSVRC-2012. AlexNet has 600 million parameters and 650,000 neurons, and consists of 5 convolutional layers, some of which are followed by a max-pooling layer, and 3 fully-connected layers. reduce overfitting, dropout is used in the fully-connected layers, as described in more detail below.

The data comes from ImageNet, the training set contains 1.2 million images, the validation set contains 50,000 images, and the test set contains 150,000 images, which are categorized into 1,000 classes and have different resolutions, but the input requirement of AlexNet is a fixed resolution, so in order to solve the problem, Alex's team used a low sampling rate to reduce the resolution of each image to 25%. resolution of each image down to 256×256, which is done by, given a rectangular image, first re-scaling the image so that the shorter side has a length of 256, and then cropping the image from the center of the resulting image to a size of 256×256.

At the time, the standard neuron activation function was the tanh() function, a saturated nonlinear function that is much slower in gradient descent than an unsaturated nonlinear function, so the ReLU function was used as the activation function in AlexNet. figure1 shows the use of the ReLU function in a 4-layer convolutional network to achieve 25% training on the CIFAR-10 The training error rate of 25% on the CIFAR-10 dataset is achieved 6 times faster than using the tanh function on the same network under the same conditions.

AlexNet is trained in parallel with two GTX 580 3Gs, with half of the kernals or neurons on each GPU, and the GPUs communicating only at specific layers.

The ReLU function does not have a finite range of values like tanh and sigmoid, so it needs to be normalized after ReLU. The idea of LRN comes from a concept in neurobiology called "lateral inhibition", which means that activated neurons inhibit surrounding neurons. Calculation formula:

bi x,y denotes the activation value of the neuron at position (x, y) after the ith convolutional kernel performs convolutional computation and then passes through the ReLU< /p>

ai x,y denotes the value after normalization

n denotes the k convolutional kernels adjacent to the convolutional kernel i. The hyperparameter is usually set to 5

N denotes the total number of convolutional kernels < /p>

α = 10?4, and β = 0.75 two hyperparameters

Overlapping pooling means that there are overlapping portions between adjacent pooling windows, more precisely, the pooling layer can be regarded as consisting of a lattice of pooling cells spaced by s. Each pooling cell summarizes a neighborhood of size z × z centered at the position of the merged cell, i.e., the pooling size is z and the step size is s. When s < z is overlapping pooling. The s = 2, z = 3 was used throughout the network

The output of the last layer of the network (Full8) was fed to a softmax layer containing 1000 units to make predictions for 1000 labels. Response-normalization layers follow the 1st and 2nd convolutional layers, Max-pooling layers follow the Response-normalization layer and the 5th convolutional layer , and ReLU activation functions are applied with all the convolutional layers and the fully connected layer output.

Early on, the most common way to reduce overfitting for image data was to artificially increase the dataset, and there are two ways to increase the amount of data used in AlexNet:

First, mirror reflection and random clipping.

First, mirror reflection is done on the image, and then 227×227 blocks are randomly selected from both the original image and the mirror-reflected image (256×256). By this method, the size of the training set is made 2048 times larger, although the resulting training samples are highly interdependent. But not using this method again leads to severe overfitting, forcing us to use a smaller network. For testing, AlexNet takes 5 blocks of each of the test samples and their mirror reflection maps (10 total **** blocks, four corners and the center) to make a prediction, and the prediction is the average of the softmax blocks of those 10 blocks.

Second, varying the intensity of the RGB channels in the training images

PCA (Principal Component Analysis) was performed on the entire set of RGB pixel values from the ImageNet training set, and for each image, multiples of the principal components found were summed up to a size proportional to the corresponding eigenvalues, multiplied by a random variable plotted with a Gaussian distribution with a mean of zero and a standard deviation of 0.1.

pi and λi are the ith eigenvector and eigenvalue, respectively, of the 3 × 3 covariance matrix of the RGB pixel values, and αi, the random variable mentioned earlier, is plotted only once for each αi for all pixels of a particular training image until that image is used again for training, at which point it is replotted. This scheme approximately captures an important property of natural images, i.e., object identification does not vary with light intensity and color.

The probability of deactivation is set to 0.5 in AlexNet, and during testing, the neurons used are used again but multiplied by 0.5 for all their outputs.

AlexNet uses a stochastic gradient descent algorithm, with a batch size of 128, and with the momentum decay parameter set to 0.9 and the weight decay parameter to 0.0005, where the weights The weight decay is not just a regularizer, but it also reduces the training error of the model, and the updating process of the weights becomes: where is the index of the number of iterations, is the momentum variable, is the learning rate, and is the average value of the gradient in the first batch.

Additionally, in AlexNet, the weights of all layers are initialized to follow a Gaussian distribution with 0 mean and 0.001 standard deviation, and the bias of convolutional layers 2, 4, and 5, as well as the fully-connected layer, are initialized to 1. This has the advantage that it accelerates early learning by giving the ReLU function a positive incentive. The bias for the other layers is initialized to 0.