ImageNet Classification with Deep Convolutional Neural Networks论文翻译[下]

ImageNet Classification with Deep Convolutional Neural Networks论文翻译上

AlexNet实现地址(基于PyTorch): https://github.com/Lornatang/pytorch/blob/master/official/net/alexnet.py

4 Reducing Overﬁtting

4减少过度配合

Our neural network architecture has 60 million parameters. Although the 1000 classes of ILSVRC make each training example impose 10 bits of constraint on the mapping from image to label, this turns out to be insufﬁcient to learn so many parameters without considerable overﬁtting. Below, we describe the two primary ways in which we combat overﬁtting.

我们的神经网络架构拥有6000万个参数。尽管ILSVRC的1000个类别使得每个训练示例对从图像到标签的映射施加10比特的约束，但事实证明没有足够的过度配合学习如此多的参数是不够的。在下面，我们描述了我们与配合作战的两种主要方式。

4.1 Data Augmentation

4.1数据增强

The easiest and most common method to reduce overﬁtting on image data is to artiﬁcially enlarge the dataset using label-preserving transformations (e.g., [25, 4, 5]). We employ two distinct forms of data augmentation, both of which allow transformed images to be produced from the original images with very little computation, so the transformed images do not need to be stored on disk. In our implementation, the transformed images are generated in Python code on the CPU while the GPU is training on the previous batch of images. So these data augmentation schemes are, in effect, computationally free.

减少图像数据过度拟合的最简单和最常见的方法是使用标签保留变换（例如[25,4,5]）人工放大数据集。我们采用两种不同形式的数据增强，这两种形式都允许通过很少的计算从原始图像生成变换图像，所以变换后的图像不需要存储在磁盘上。在我们的实现中，转换后的图像是在CPU上的Python代码中生成的，而GPU正在训练上一批图像。所以这些数据增强方案实际上是计算上免费的。

The ﬁrst form of data augmentation consists of generating image translations and horizontal reﬂections. We do this by extracting random

image

patches (and their horizontal reﬂections) from the

image

images and training our network on these extracted patches4. This increases the size of our training set by a factor of 2048, though the resulting training examples are, of course, highly interdependent. Without this scheme, our network suffers from substantial overﬁtting, which would have forced us to use much smaller networks. At test time, the network makes a prediction by extracting ﬁve

image

patches (the four corner patches and the center patch) as well as their horizontal reﬂections (hence ten patches in all), and averaging the predictions made by the network’s softmax layer on the ten patches.

数据增强的第一种形式包括生成图像翻译和水平反射。我们通过从

image

图像中提取随机

image

补丁（及其水平反射）并在这些提取的补丁上训练我们的网络来实现。这使我们的训练集的规模增加了2048倍，尽管由此产生的训练样例当然是高度相互依赖的。如果没有这个方案，我们的网络会遭受实质性的过度配置，这将迫使我们使用更小的网络。在测试时间，网络通过提取五个

image

补丁（四个角补丁和中心补丁）以及它们的水平反射（因此总共十个补丁），并且对网络的softmax层对十个补丁。

The second form of data augmentation consists of altering the intensities of the RGB channels in training images. Speciﬁcally, we perform PCA on the set of RGB pixel values throughout the ImageNet training set. To each training image, we add multiples of the found principal components,

第二种形式的数据增强包括改变训练图像中RGB通道的强度。具体而言，我们在整个ImageNet训练集的RGB像素值集上执行PCA。对于每个训练图像，我们添加多个找到的主要组件，

image

with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and standard deviation 0.1. Therefore to each RGB image pixel

image

we add the following quantity:

其大小与对应的特征值成比例，乘以从均值为零且标准偏差为0.1的高斯绘制的随机变量。因此，对于每个RGB图像像素

image

，我们添加以下数量：

image

where

image

and

image

are ith eigenvector and eigenvalue of the

image

covariance matrix of RGB pixel values, respectively, and

image

is the aforementioned random variable. Each

image

is drawn only once for all the pixels of a particular training image until that image is used for training again, at which point it is re-drawn. This scheme approximately captures an important property of natural images, namely, that object identity is invariant to changes in the intensity and color of the illumination. This scheme reduces the top-1 error rate by over 1%.

image

和

image

分别是RGB像素值的

image

协方差矩阵的第i个特征向量和特征值，而

image

是上述随机变量。对于特定训练图像的所有像素，每个

image

只绘制一次，直到该图像再次用于训练，此时它将被重新绘制。该方案近似捕捉自然图像的重要属性，即对象身份对于照度的强度和颜色的变化是不变的。该方案将前1个错误率降低1％以上。

4.2 Dropout

4.2辍学

Combining the predictions of many different models is a very successful way to reduce test errors [1, 3], but it appears to be too expensive for big neural networks that already take several days to train. There is, however, a very efﬁcient version of model combination that only costs about a factor of two during training. The recently-introduced technique, called “dropout” [10], consists of setting to zero the output of each hidden neuron with probability 0.5. The neurons which are “dropped out” in this way do not contribute to the forward pass and do not participate in backpropagation. So every time an input is presented, the neural network samples a different architecture, but all these architectures share weights. This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. At test time, we use all the neurons but multiply their outputs by 0.5, which is a reasonable approximation to taking the geometric mean of the predictive distributions produced by the exponentially-many dropout networks.

结合许多不同模型的预测是减少测试错误的一种非常成功的方法[1,3]，但对于已经花费数天时间训练的大型神经网络来说，它似乎太昂贵了。然而，有一个非常有效的模型组合版本，在培训期间，其成本只有两倍。最近引入的称为“丢失”[10]的技术包括将每个隐藏神经元的输出设置为零，概率为0.5。以这种方式“退出”的神经元对前向传球没有贡献，并且不参与反向传播。所以每次提交输入时，神经网络都会采样不同的体系结构，但所有这些体系结构共享权重。这种技术减少了神经元的复杂适应性，因为神经元不能依赖特定其他神经元的存在。因此，它被迫学习更强大的功能，这些功能可以与其他神经元的许多不同的随机子集结合使用。在测试时间，我们使用所有的神经元，但将它们的输出乘以0.5，这对于采用指数衰减网络产生的预测分布的几何平均数是一个合理的近似。

We use dropout in the ﬁrst two fully-connected layers of Figure 2. Without dropout, our network exhibits substantial overﬁtting. Dropout roughly doubles the number of iterations required to converge.

我们在图2的前两个完全连接层中使用了丢失。没有辍学，我们的网络展示了大量的过度配合。辍学率大约是收敛所需的迭代次数的两倍。

image

5 Details of learning

5学习细节

We trained our models using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. We found that this small amount of weight decay was important for the model to learn. In Figure 3: 96 convolutional kernels of size other words, weight decay here is not merely a regularizer:

image

learned by the ﬁrst convolutional it reduces the model’s training error. The update rule for layer on the

image

input images. The weight w was top 48 kernels were learned on GPU 1 while

我们使用随机梯度下降训练我们的模型，批量大小为128个例子，动量为0.9，重量衰减为0.0005。我们发现这种少量的体重衰减对模型学习很重要。在图3中：大小不同的96个卷积核，这里的权重衰减不仅仅是一个正规化器：

image

通过第一次卷积学习，它减少了模型的训练误差。

image

输入图像上图层的更新规则。重量w是在GPU 1上学习的前48个内核

image

where i is the iteration index, v is the momentum variable, ϵ is the learning rate, and

image

其中i是迭代指数，v是动量变量，ε是学习率，

image

是

image

the average over the ith batch

image

of the derivative of the objective with respect to w, evaluated at

image

是第i批次

image

相对于w的目标导数的平均值，在

image

中进行了评估。

We initialized the weights in each layer from a zero-mean Gaussian distribution with standard deviation 0.01. We initialized the neuron biases in the second, fourth, and ﬁfth convolutional layers, as well as in the fully-connected hidden layers, with the constant 1. This initialization accelerates the early stages of learning by providing the ReLUs with positive inputs. We initialized the neuron biases in the remaining layers with the constant 0.

我们用标准偏差为0.01的零均值高斯分布初始化各层的权重。我们初始化了第二，第四和第五卷积层以及完全连通的隐层中的神经元偏差，其常数为1。这种初始化通过向ReLU提供正输入来加速学习的早期阶段。我们用常数0初始化剩余层中的神经元偏差。

We used an equal learning rate for all layers, which we adjusted manually throughout training. The heuristic which we followed was to divide the learning rate by 10 when the validation error rate stopped improving with the current learning rate. The learning rate was initialized at 0.01 and reduced three times prior to termination. We trained the network for roughly 90 cycles through the training set of 1.2 million images, which took ﬁve to six days on two NVIDIA GTX 580 3GB GPUs.

我们对所有图层使用相同的学习率，我们在整个培训过程中手动进行了调整。我们遵循的启发式是当验证错误率以当前学习速率停止改进时，将学习速率除以10。学习率初始化为0.01，在终止前减少三次。我们通过120万张图像的训练集对网络进行了大约90个周期的训练，这些图像在两个NVIDIA GTX 580 3GB GPU上花费了五天时间。

6 Results

6结果

Our results on ILSVRC-2010 are summarized in Table 1. Our network achieves top-1 and top-5 test set error rates of 37.5% and 17.0%5. The best performance achieved during the ILSVRC2010 competition was 47.1% and 28.2% with an approach that averages the predictions produced from six sparse-coding models trained on different features [2], and since then the best published results are 45.7% and 25.7% with an approach that averages the predictions of two classiﬁers trained on Fisher Vectors (FVs) computed from two types of densely-sampled features [24].

表1总结了我们在ILSVRC-2010上的结果。我们的网络达到37.5％和17.0％的前1名和前5名测试集错误率5。在ILSVRC2010大赛中获得的最佳成绩是47.1％和28.2％，其中一种方法是对六种针对不同特征进行训练的稀疏编码模型所产生的预测进行平均[2]，此后最佳公布结果分别为45.7％和25.7％，一种方法是对两种经过密集采样的特征计算得到的Fisher矢量（FV）进行训练的两个分类器的预测值的平均值[24]。

image

We also entered our model in the ILSVRC-2012 competition and report our results in Table 2. Since the ILSVRC-2012 test set labels are not publicly available, we cannot report test error rates for all the models that we tried. In the remainder of this paragraph, we use validation and test error rates interchangeably because Table 1: Comparison of results on ILSVRCin our experience they do not differ by more than 0.1% 2010 test set. In italics are best results (see Table 2). The CNN described in this paper achieves achieved by others. a top-5 error rate of 18.2%. Averaging the predictions of ﬁve similar CNNs gives an error rate of 16.4%. Training one CNN, with an extra sixth convolutional layer over the last pooling layer, to classify the entire ImageNet Fall 2011 release (15M images, 22K categories), and then “ﬁne-tuning” it on ILSVRC-2012 gives an error rate of 16.6%. Averaging the predictions of two CNNs that were pre-trained on the entire Fall 2011 release with the aforementioned ﬁve CNNs gives an error rate of 15.3%. The second-best contest entry achieved an error rate of 26.2% with an approach that averages the predictions of several classiﬁers trained on FVs computed from different types of densely-sampled features [7].

我们也参加了ILSVRC-2012比赛的模型，并在表2中报告了我们的结果。由于ILSVRC-2012测试集标签没有公开提供，因此我们无法报告我们尝试的所有模型的测试错误率。在这段的其余部分中,我们使用验证和测试错误率互换,因为表1:比较结果ILSVRCin我们的经验他们不相差超过0.1% 2010测试集。在斜体是最好的结果(见表2)。本文中描述的CNN获得通过。前五名的错误率是18.2%。对五个相似CNN的预测进行平均，得出错误率为16.4％。培训一个CNN，在最后一个池层上增加一个额外的第六个卷积层，对整个ImageNet 2011秋季发布（15M图像，22K种类）进行分类，然后在ILSVRC-2012上“微调”它，错误率为16.6 ％。对整个2011年秋季发行版预先训练的两个CNN与上述五个CNN进行平均，得出错误率为15.3％。第二好的比赛录入达到了26.2％的错误率，其方法是对从不同类型的密集采样特征计算出的FV进行训练的几个分类器的预测进行平均[7]。

Finally, we also report our error rates on the Fall 2009 version of ImageNet with 10,184 categories and 8.9 million images. On this dataset we follow the convention in the literature of using half of the images for training and half for testing. Since there is no established test set, our split necesTable 2: Comparison of error rates on ILSVRC-2012 validation and sarily differs from the splits used test sets. In italics are best results achieved by others. Models with an by previous authors, but this does asterisk* were “pre-trained” to classify the entire ImageNet 2011 Fall not affect the results appreciably.

最后，我们还在2009年秋季版ImageNet上报告了10,184个类别和890万个图像的错误率。在这个数据集中，我们遵循文献中使用一半图像进行训练和一半进行测试的惯例。由于没有建立测试集，因此我们的拆分必要性表2：ILSVRC-2012验证错误率的比较与测试集的拆分不同。斜体是其他人取得的最好结果。有以前作者的模型，但是星号*是“预先训练好的”，用于对整个ImageNet 2011进行分类，不会对结果产生显着影响。

image

release. See Section 6 for details.

发布。详情请参阅第6节。

Our top-1 and top-5 error rates on this dataset are 67.4% and 40.9%, attained by the net described above but with an additional, sixth convolutional layer over the last pooling layer. The best published results on this dataset are 78.1% and 60.9% [19].

在这个数据集中，我们的前1和前5的错误率分别是67.4％和40.9％，通过上面描述的网络获得，但是在最后的池化层上具有额外的第六卷积层。该数据集的最佳公布结果是78.1％和60.9％[19]。

6.1 Qualitative Evaluations

6.1定性评估

Figure 3 shows the convolutional kernels learned by the network’s two data-connected layers. The network has learned a variety of frequency- and orientation-selective kernels, as well as various colored blobs. Notice the specialization exhibited by the two GPUs, a result of the restricted connectivity described in Section 3.5. The kernels on GPU 1 are largely color-agnostic, while the kernels on on GPU 2 are largely color-speciﬁc. This kind of specialization occurs during every run and is independent of any particular random weight initialization (modulo a renumbering of the GPUs).

图3显示了网络的两个数据连接层学习的卷积核。该网络已经学习了各种频率和方向选择内核，以及各种彩色斑点。注意两个GPU所展示的专业化，这是第3.5节中描述的受限连接的结果。GPU 1上的内核在很大程度上与颜色无关，而GPU 2上的内核在很大程度上是颜色特定的。这种专业化发生在每次运行期间，并且独立于任何特定的随机权重初始化（以GPU的重新编号为模）。

5The error rates without averaging predictions over ten patches as described in Section 4.1 are 39.0% and 18.3%.

5如第4.1节所述，没有平均预测超过10个补丁的错误率分别为39.0％和18.3％。

image

Figure 4: (Left) Eight ILSVRC-2010 test images and the ﬁve labels considered most probable by our model.

图4 :(左）八个ILSVRC-2010测试图像和我们模型最可能考虑的五个标签。

The correct label is written under each image, and the probability assigned to the correct label is also shown

正确的标签写在每张图像下，并且还显示了分配给正确标签的概率

with a red bar (if it happens to be in the top 5). (Right) Five ILSVRC-2010 test images in the ﬁrst column. The

与一个红色酒吧（如果它碰巧在前五）。（右）第一列有5个ILSVRC-2010测试图像。该

remaining columns show the six training images that produce feature vectors in the last hidden layer with the

剩下的列显示了六个训练图像，它们在最后一个隐藏层中产生特征向量

smallest Euclidean distance from the feature vector for the test image.

测试图像的特征向量的最小欧几里得距离。

In the left panel of Figure 4 we qualitatively assess what the network has learned by computing its top-5 predictions on eight test images. Notice that even off-center objects, such as the mite in the top-left, can be recognized by the net. Most of the top-5 labels appear reasonable. For example, only other types of cat are considered plausible labels for the leopard. In some cases (grille, cherry) there is genuine ambiguity about the intended focus of the photograph.

在图4的左侧面板中，我们通过计算八个测试图像的前5个预测来定性评估网络学到的内容。请注意，即使偏心的物体，如左上角的螨虫，也可以被网络识别。大多数前五名的标签显得合理。例如，只有其他类型的猫被认为是豹的合理标签。在某些情况下（格栅，樱桃），照片的预期焦点存在真正的模糊性。

Another way to probe the network’s visual knowledge is to consider the feature activations induced by an image at the last, 4096-dimensional hidden layer. If two images produce feature activation vectors with a small Euclidean separation, we can say that the higher levels of the neural network consider them to be similar. Figure 4 shows ﬁve images from the test set and the six images from the training set that are most similar to each of them according to this measure. Notice that at the pixel level, the retrieved training images are generally not close in L2 to the query images in the ﬁrst column. For example, the retrieved dogs and elephants appear in a variety of poses. We present the results for many more test images in the supplementary material.

探测网络视觉知识的另一种方法是考虑由最后一个4096维隐藏层中的图像引发的特征激活。如果两幅图像产生具有较小欧几里德分离的特征激活向量，我们可以说更高层次的神经网络认为它们是相似的。图4显示了来自测试集的五幅图像以及来自训练集的六幅图像，这些图像根据此测量与它们中的每一幅最相似。请注意，在像素级别，检索到的训练图像一般不会在第一列的查询图像中靠近L2。例如，检索到的狗和大象出现在各种姿势中。我们在补充材料中提供更多测试图像的结果。

Computing similarity by using Euclidean distance between two 4096-dimensional, real-valued vectors is inefﬁcient, but it could be made efﬁcient by training an auto-encoder to compress these vectors to short binary codes. This should produce a much better image retrieval method than applying autoencoders to the raw pixels [14], which does not make use of image labels and hence has a tendency to retrieve images with similar patterns of edges, whether or not they are semantically similar.

通过使用两个4096维实值向量之间的欧几里得距离来计算相似性是不够的，但是通过训练一个自动编码器将这些向量压缩为短二进制代码可以提高效率。与应用自编码器到原始像素[14]相比，这应该产生更好的图像检索方法，它不利用图像标签，因此倾向于检索具有相似图案边缘的图像，而不管它们在语义上是否相似。

7 Discussion

7讨论

Our results show that a large, deep convolutional neural network is capable of achieving recordbreaking results on a highly challenging dataset using purely supervised learning. It is notable that our network’s performance degrades if a single convolutional layer is removed. For example, removing any of the middle layers results in a loss of about 2% for the top-1 performance of the network. So the depth really is important for achieving our results.

我们的研究结果表明，一个庞大的深层卷积神经网络能够使用纯监督学习在高度具有挑战性的数据集上实现破纪录的结果。值得注意的是，如果单个卷积层被删除，我们的网络性能会下降。例如，删除任何中间层导致网络性能前1的性能损失约2％。所以深度对于实现我们的结果真的很重要。

To simplify our experiments, we did not use any unsupervised pre-training even though we expect that it will help, especially if we obtain enough computational power to signiﬁcantly increase the size of the network without obtaining a corresponding increase in the amount of labeled data. Thus far, our results have improved as we have made our network larger and trained it longer but we still have many orders of magnitude to go in order to match the infero-temporal pathway of the human visual system. Ultimately we would like to use very large and deep convolutional nets on video sequences where the temporal structure provides very helpful information that is missing or far less obvious in static images.

为了简化我们的实验，我们没有使用任何无监督的预训练，即使我们预计它会有帮助，特别是如果我们获得足够的计算能力来显着增加网络的大小而不会相应增加标记数据的数量。到目前为止，我们的结果已经改善，因为我们已经使我们的网络更大并且训练了它更长的时间，但为了匹配人类视觉系统的时间 - 地球路径，我们仍然有许多数量级要去。最终，我们希望对视频序列使用非常大而深的卷积网络，其中时间结构提供了非常有用的信息，这些信息在静态图像中丢失或不太明显。

References

参考

[1] R.M. Bell and Y. Koren. Lessons from the netﬂix prize challenge. ACM SIGKDD Explorations Newsletter, 9(2):75–79, 2007.

[1] R.M.贝尔和Y.科伦。来自净流量挑战的教训。 ACM SIGKDD探索通讯，9（2）：75-79，2007。

[2] A. Berg, J. Deng, and L. Fei-Fei. Large scale visual recognition challenge 2010. www.imagenet.org/challenges. 2010.

[3] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

[3] L. Breiman。随机森林。机器学习，45（1）：5-32，2001。

[4] D. Cire¸san, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classiﬁcation. Arxiv preprint arXiv:1202.2745, 2012.

[4] D.Cire¸san，U. Meier和J. Schmidhuber。用于图像分类的多列深度神经网络。 Arxiv预印本arXiv：1202.2745,2012。

[5] D.C. Cire¸san, U. Meier, J. Masci, L.M. Gambardella, and J. Schmidhuber. High-performance neural networks for visual object classiﬁcation. Arxiv preprint arXiv:1102.0183, 2011.

[5] D.C.Cire¸san，U. Meier，J. Masci，L.M. Gambardella和J. Schmidhuber。用于视觉对象分类的高性能神经网络。 Arxiv预印本arXiv：1102.0183,2011年。

[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.

邓俊杰，董文伟，R. Socher, l - j。Li, K. Li，和L. Fei-Fei。ImageNet:一个大型的分级图像数据库。2009年CVPR09。

[7] J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei. ILSVRC-2012, 2012. URL http://www.image-net.org/challenges/LSVRC/2012/.

[8] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106(1):59–70, 2007.

[8] L.菲菲，R. Fergus和P. Perona。从少数训练实例学习生成视觉模型：在101个对象类别上测试的增量贝叶斯方法。计算机视觉和图像理解，106（1）：59-70,2007。

[9] G. Grifﬁn, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical Report 7694, California Institute of Technology, 2007. URL http://authors.library.caltech.edu/7694.

[9] G. Grif fi n，A。霍勒布和P.佩罗娜。 Caltech-256对象类别数据集。技术报告7694，加州理工学院，2007年。URL http://authors.library.caltech.edu/7694。

[10] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

[10] G.E. Hinton，N. Srivastava，A. Krizhevsky，I. Sutskever和R.R. Salakhutdinov。通过防止特征检测器的共同适应来改进神经网络。 arXiv预印本arXiv：1207.0580,2012。

[11] K. Jarrett, K. Kavukcuoglu, M. A. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? In International Conference on Computer Vision, pages 2146–2153. IEEE, 2009.

[11] K. Jarrett，K. Kavukcuoglu，M. A. Ranzato和Y. LeCun。什么是物体识别的最佳多阶段体系结构？在计算机视觉国际会议上，第2146-2153页。 IEEE，2009。

[12] A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.

[12] A. Krizhevsky。从小图像中学习多层功能。硕士论文，多伦多大学计算机科学系，2009年。

[13] A. Krizhevsky. Convolutional deep belief networks on cifar-10. Unpublished manuscript, 2010.

[13] A. Krizhevsky。 Cifar-10上的卷积深层信念网络。未发表的手稿，2010年。

[14] A. Krizhevsky and G.E. Hinton. Using very deep autoencoders for content-based image retrieval. In ESANN, 2011.

[14] A. Krizhevsky和G.E.韩丁。使用非常深的自动编码器进行基于内容的图像检索。在ESANN，2011年。

[15] Y. Le Cun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel, et al. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, 1990.

[15] Y. Le Cun，B. Boser，J.S. Denker，D. Henderson，R.E.霍华德，W.哈伯德，L.D. Jackel等人带反向传播网络的手写数字识别。在神经信息处理系统的进展，1990。

[16] Y. LeCun, F.J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–97. IEEE, 2004.

[16] Y. LeCun，F.J. Huang和L. Bottou。通用对象识别的学习方法，具有不变姿态和照明。在计算机视觉和模式识别，2004.CVPR 2004.会议2004年IEEE计算机协会会议，第2卷，第II-97页。 IEEE，2004。

[17] Y. LeCun, K. Kavukcuoglu, and C. Farabet. Convolutional networks and applications in vision. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, pages 253–256. IEEE, 2010.

[17] Y. LeCun，K. Kavukcuoglu和C. Farabet。卷积网络和视觉应用。在Circuits and Systems（ISCAS），Proceedings of 2010 IEEE International Symposium on，253-256页。 IEEE，2010。

[18] H. Lee, R. Grosse, R. Ranganath, and A.Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 609–616. ACM, 2009.

[18] H. Lee，R. Grosse，R. Ranganath和A.Y.伍。卷积深度信念网络的可扩展无监督学习的分层表示。在第26届国际机器学习会议论文集，第609-616页。 ACM，2009。

[19] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Metric Learning for Large Scale Image Classiﬁcation: Generalizing to New Classes at Near-Zero Cost. In ECCV - European Conference on Computer Vision, Florence, Italy, October 2012.

[19] T. Mensink，J. Verbeek，F. Perronnin和G. Csurka。用于大规模图像分类的度量学习：推广到近零成本的新类。ECCV - 欧洲计算机视觉会议，意大利佛罗伦萨，2012年10月。

[20] V. Nair and G. E. Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In Proc. 27th International Conference on Machine Learning, 2010.

[20] V. Nair和G. E. Hinton。整型线性单元改进了受限玻尔兹曼机器。在Proc。第27届国际机器学习会议，2010。

[21] N. Pinto, D.D. Cox, and J.J. DiCarlo. Why is real-world visual object recognition hard? PLoS computational biology, 4(1):e27, 2008.

[21] N. Pinto，D.D.考克斯和J.J.迪卡洛。为什么现实世界的视觉对象识别很难？ PLoS计算生物学，4（1）：e27,2008。

[22] N. Pinto, D. Doukhan, J.J. DiCarlo, and D.D. Cox. A high-throughput screening approach to discovering good forms of biologically inspired visual representation. PLoS computational biology, 5(11):e1000579, 2009.

[22] N. Pinto，D. Doukhan，J.J. DiCarlo和D.D.考克斯。一种高通量筛选方法，用于发现良好形式的生物启发式视觉表现。 PLoS计算生物学，5（11）：e1000579,2009。

[23] B.C. Russell, A. Torralba, K.P. Murphy, and W.T. Freeman. Labelme: a database and web-based tool for image annotation. International journal of computer vision, 77(1):157–173, 2008.

[23] B.C.拉塞尔，A.托拉尔巴，K.P.墨菲和W.T.弗里曼。 Labelme：一种用于图像标注的数据库和基于网络的工具。国际计算机视觉杂志，77（1）：157-173,2008。

[24] J. Sánchez and F. Perronnin. High-dimensional signature compression for large-scale image classiﬁcation. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1665–1672. IEEE, 2011.

桑切斯和佩罗宁。用于大规模图像分类的高维签名压缩。在计算机视觉和模式识别（CVPR），2011 IEEE会议上，第1665-1672页。 IEEE，2011。

[25] P.Y. Simard, D. Steinkraus, and J.C. Platt. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of the Seventh International Conference on Document Analysis and Recognition, volume 2, pages 958–962, 2003.

[25] P.Y. Simard，D. Steinkraus和J.C. Platt。用于可视化文档分析的卷积神经网络的最佳实践。在Proceedings of the Seventh International Conference on Document Analysis and Recognition，第2卷，第958-962页，2003中。

[26] S.C. Turaga, J.F. Murray, V. Jain, F. Roth, M. Helmstaedter, K. Briggman, W. Denk, and H.S. Seung. Convolutional networks can learn to generate afﬁnity graphs for image segmentation. Neural Computation, 22(2):511–538, 2010.

[26] S.C. Turaga，J.F. Murray，V. Jain，F. Roth，M. Helmstaedter，K. Briggman，W. Denk和H.升。卷积网络可以学习为图像分割生成亲和度图。 Neural Computation，22（2）：511-538，2010。

文章引用于 http://tongtianta.site/paper/1954
编辑 Lornatang
校准 Lornatang