Convolutional neural network is an effective identification method developed in recent years, which has attracted wide attention. In 1960s, when Hubel and Wiesel studied the neurons in cat cortex used for local sensitivity and direction selection, they found that their unique network structure could effectively reduce the complexity of the feedback neural network, and then proposed a convolutionary Neural Network.
nerval
CNN). Now, CNN has become one of the research hotspots in many scientific fields, especially in the field of pattern classification. Because the network avoids complex image preprocessing and can directly input the original image, it has been widely used.
The new recognition machine proposed by K.Fukushima in 1980 is the first network to realize convolutional neural network. Subsequently, more researchers improved the network. Among them, the representative research achievement is the "improved cognitive machine" proposed by Alexander and Taylor, which integrates the advantages of various improved methods and avoids time-consuming error back propagation.
Generally speaking, the basic structure of CNN includes two layers, one is feature extraction layer, and the input of each neuron is connected with the local receiving domain of the previous layer to extract local features. Once a local feature is extracted, its positional relationship with other features is also determined. The second layer is the feature mapping layer. Each computing layer of the network consists of multiple feature graphs, each feature graph is a plane, and all neurons on the plane have equal weights. In the feature mapping structure, sigmoid function, which has little influence on the kernel of the function, is used as the activation function of the convolutional network, so that the displacement of the feature mapping remains unchanged. In addition, the number of free parameters of the network is reduced because the neurons on the mapping plane have the right. There is a computing layer behind each convolution layer in convolutional neural network, which is used for local average and secondary extraction. This unique structure of secondary feature extraction reduces the feature resolution.
CNN is mainly used to identify displacement, scaling and other forms of distortion-invariant two-dimensional graphics. Because the feature detection layer of CNN learns from the training data, it avoids the feature extraction of the display, and it is implicitly learned from the training data when using CNN. Furthermore, because the weights of neurons on the same feature mapping plane are the same, the network can learn in parallel, which is also a great advantage of convolutional networks over networks with interconnected neurons. Convolutional neural network has unique advantages in speech recognition and image processing because of its special structure of local weight sharing. Its layout is closer to the actual biological neural network, and the weight sharing reduces the complexity of the network. In particular, images of multi-dimensional input vectors can be directly input into the network, which avoids the complexity of data reconstruction in the process of feature extraction and classification.
65438+
Firstly, the neural network is introduced. For details about this step, please refer to the resource 1. Give a brief introduction. Each unit of the neural network is as follows:
The corresponding formula is as follows:
Among them, this unit can also be called logistic regression model. When multiple units are combined and have a hierarchical structure, a neural network model is formed. The following figure shows a neural network with a hidden layer.
The corresponding formula is as follows:
Similarly, it can be extended to 2, 3, 4, 5, … hidden layers.
The training method of neural network is similar to Logistic, but because of its multi-layer, it is necessary to use the chain derivative rule to deduce the nodes in the hidden layer, that is, the gradient descent+chain derivative rule, and the professional name is back propagation. Regarding the training algorithm, this paper will not involve it for the time being.
Convolutional Neural Network
In image processing, an image is often expressed as a vector of pixels, such as an image of 1000× 1000, which can be expressed as a vector of 100000. In the neural network mentioned in the previous section, if the number of hidden layers is the same as the number of input layers, that is, 1000000, then the parameter data from the input layer to the hidden layer is1000000×1000000 =1kloc-0/2, which is too much. Therefore, in order to practice the neural network method in image processing, we must first reduce the parameters and speed up. Like a sword score to ward off evil spirits. Ordinary people practice badly. Once the internal force becomes stronger and the swordsmanship becomes faster, it becomes very awesome.
2. 1 local perception
Convolutional neural networks have two artifacts to reduce the number of parameters. The first artifact is called local perceptual field. It is generally believed that people's cognition of the outside world is from local to global, and the spatial connection of images is also that local pixels are close, while the correlation of distant pixels is weak. Therefore, it is not necessary for every neuron to perceive the global image, but only the local image, and then the local information is synthesized at a higher level to get the global information. The view that the network is partially connected is also inspired by the structure of visual system in biology. Neurons in the visual cortex receive information locally (that is, these neurons only respond to stimuli in certain areas). As shown in the following figure: the left figure shows the full connection, and the right figure shows the local connection.
In the figure on the right, if only 10× 10 pixel values are connected to each neuron, then the weight data is 100000× 100 parameter, which is reduced to one thousandth of the original value. And 10×10 parameters corresponding to10 pixel values are actually equivalent to convolution operation.
2.2 Parameters * * *
But in fact, there are still too many parameters, so start the secondary artifact, that is, enjoy the weight. In the local connection above, each neuron corresponds to 100 parameters, and there are *** 1000000 neurons. If all the parameters of 100000 neurons are equal, then the number of parameters becomes 100.
How to understand the enjoyment of weight? We can regard these 100 parameters (that is, convolution operation) as a way to extract features, regardless of location. The implicit principle is that the statistical characteristics of one part of an image are the same as those of other parts. This also means that the features we learn in this part can also be used in another part, so we can use the same learning features for all positions on this image.
More intuitively, when a small piece of a large-size image is randomly selected as a sample, such as 8×8, and some features are learned from this small piece of sample, then we can learn from it.
Features learned from 8x8 samples are used as detectors and applied anywhere in the image. In particular, we can use 8×8.
The features learned in the sample are convolved with the original large-size image, so that the activation values of different features at any position on the large-size image can be obtained.
As shown in the following figure, the process of convolution of an image of 55 by a convolution kernel 33 is shown. Each convolution is a feature extraction method, just like a sieve, which filters out the qualified parts of the image (the larger the activation value, the more qualified it is).
2.3 Multiple Convolution Kernel
When there are only 100 parameters mentioned above, it means that there are only 1 number of convolution kernels, 100* 100. Obviously, feature extraction is not enough. We can add more convolution kernels, such as 32 convolution kernels, and learn 32 features. When there are multiple convolution kernels, as shown in the following figure:
On the right, different colors represent different convolution kernels. Each convolution kernel generates an image as another image. For example, two convolution kernels can generate two images, which can be regarded as different channels of an image. As shown in the figure below, there is a small mistake in the figure below, which is to change w 1 to w0 and w2 to w 1. In the following, they are still referred to as w 1 and w2.
The following figure shows the convolution operation on four channels, and two channels are generated by two convolution kernels. It should be noted that each of the four channels corresponds to a convolution kernel. W2 is ignored first, and only w 1 is seen. Then add the convolution results at (i, j) on the four channels, and then take the activation function value to get the value at a certain position (i, j) of w 1.
Therefore, in the process of convolution from four channels to two channels in the above figure, the number of parameters is 4×2×2×2, where 4 represents four channels, the first 2 represents the generation of two channels, and the last 2×2 represents the convolution kernel size.
2.4 downward convergence
After obtaining the features by convolution.
After that, we hope to use these features for classification in the next step. Theoretically, people can use all the extracted features to train classifiers, such as softmax.
Classifier, but it faces the challenge of calculation. For example, for 96X96
Pixel image, assuming that we have learned 400 features defined on 8×8 input, each feature and image convolution will get a (96? 8 + 1) × (96 ? 8+ 1) = 792 1 dimensional convolution feature, because there are 400 features, each example will get a convolution feature vector of 892×400 = 3 168400 dimensions. It is very inconvenient to learn a classifier with more than 3 million feature inputs, and it is prone to over-fitting.
To solve this problem, let's recall that we decided to use convolution features because images have "static" properties, which means that features that are useful in one image area are likely to be applicable in another area. Therefore, in order to describe large images, a natural idea is to aggregate statistics of features in different locations. For example, one can calculate the average (or maximum) of specific features in an image area. These abstract statistical features not only have a much lower dimension (compared with all extracted features), but also improve the results (not easy to over-fit). This aggregation operation is called pooling, sometimes called average pooling or maximum pooling (depending on the method of calculating pooling).
So far, the basic structure and principle of convolutional neural network have been expounded.
2.5 Multilayer Convolution
In practical application, multi-layer convolution is often used, and then the fully connected layer is used for training. The purpose of multi-layer convolution is that the features learned by one-layer convolution are often local, and the higher the number of layers, the more global the features learned.
3 ImageNet-20 10 network structure
ImageNetLSVRC is a picture classification competition. Its training set includes 127W+ pictures, the verification set has 5W pictures, and the test set has 15W pictures. In this paper, the CNN structure of alexkrijevsky 20 10 is intercepted for explanation. This structure won the championship at 20 10, and the error rate of top-5 was 15.3%. It is worth mentioning that in this year's ImageNetLSVRC competition, GoogNet, who won the championship, has reached the top 5 error rate of 6.67%. It can be seen that there is still huge room for improvement in deep learning.
The following picture shows Alex's CNN structure. It should be noted that the model adopts 2-GPU parallel structure, that is, the 1, 2,4,5 convolution layers divide the model parameters into two parts for training. Here, further, the parallel structure is divided into data parallelism and model parallelism. Data parallelism means that the model structure is the same on different GPUs, but the training data is segmented, and different models are trained separately, and then the models are fused. In model parallelism, the model parameters of several layers are segmented, and the same data are used for training on different GPUs, and the obtained results are directly connected as the input of the next layer.
The basic parameters of the above model are:
Input: 224×224 size image, 3 channels.
The first layer of convolution: 96 5×5 convolution kernels, 48 on each GPU.
The first layer of Max-Pooling: 2× 2 kernel.
Layer 2 convolution: 256 3×3 convolution kernels, 128 on each GPU.
The largest pool in the second tier: 2× 2 kernel.
The third layer convolution: completely connected with the previous layer, 384 3*3 convolution kernels. Divided into two GPUs, 192.
The fourth layer convolution: there are 384 3×3 convolution kernels, each GPU 192. The connection between this layer and the upper layer does not pass through the pool layer.
The fifth layer convolution: 256 3×3 convolution kernels, and 128 on two GPUs.
The fifth layer Max-Pooling: 2× 2 kernel.
The first layer is fully connected: 4096 dimensions, which connects the output of the fifth layer max-pooling into a one-dimensional vector as the input of this layer.
The second layer is fully connected: 4096 dimension.
Softmax layer: the output is 1000, and each dimension of the output is the probability that the picture belongs to this category.
4 DeepID network structure
The network structure of DeepID is Sun of the Chinese University of Hong Kong.
An easily developed convolutional neural network for learning facial features. Each input surface is represented as a vector of 160 dimension. Other models are used to classify the learned vectors, and the correct rate is 97.45% in the face verification experiment. Further, the original author improved CNN and got the correct rate of 99. 15%.
As shown in the figure below, the structure is similar to the specific parameters of ImageNet, and only the different parts are explained.
The structure above has only one fully connected layer at last, and then the softmax layer. In this paper, the fully connected layer is used as the representation of the image. In the fully connected layer, the convolution of the fourth layer and the output of the max-pooling of the third layer are used as the inputs of the fully connected layer, so that local and global features can be learned.