Bengio Representation Learning (2012)  

 Y. Bengio et al.  Representation Learning: A Review and New Perspectives. Technical report, arXiv:1206.558v2, 1-30, 2012

2016.8.20  更新2016.8.29

 ベンジオ (Bengio) のこのレビュー論文は、よく引用されるので、読んでみたいと思う人も多いと思い、ここに翻訳を掲載します。




また、英文と和文では、語順がちがうので、直訳で S|A|B|、は、V|C|D|。 となっている場合は、

日本語的には、B A S は、D C V。 となります。









6. Probabilistic Models  確率モデル

7. Directly Learning A Parametric Map from Input to Representation  入力から表現へパラメトリックマップを直接的に学習する

8. Representation Learning as Manifold Learning  多様体学習としての表現学習

9. Connections between Probabilistic and Direct Encoding models  確率モデルと直接符号化モデルとの結びつき

10. Global Training of Deep Models  深層モデルの全地球訓練

11. Building-In Invariance  不変の組み込み

12. Conclusion  結論


Abstract  要約

The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. 

Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. 

This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. 

This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning.
このことは、動機付けします|長期にわたって解答の得られなかった疑問を|適切な目的に関する|よい表現を学習するための|表現 (すなわち推論) を計算するため、そして、幾何学的な連結を計算するため|表現学習、密度の概算計算、多様体学習の間の|。


●The performance of machine learning methods is heavily dependent on the choice of data representation (or features) on which they are applied. 

For that reason, much of the actual effort in deploying machine learning algorithms goes into the design of preprocessing pipelines and data transformations that result in a representation of the data that can support effective machine learning. 

Such feature engineering is important but labor-intensive and highlights the weakness of current learning algorithms: their inability to extract and organize the discriminative information from the data. 

Feature engineering is a way to take advantage of human ingenuity and prior knowledge to compensate for that weakness. 

In order to expand the scope and ease of applicability of machine learning, it would be highly desirable to make learning algorithms less dependent on feature engineering, so that novel applications could be constructed faster, and more importantly, to make progress towards Artificial Intelligence (AI). 
機械学習のスコープ(作用範囲)や適用容易さを拡大するために、高度に望ましいことは、学習アルゴリズムを特徴工学により低依存にすることと (新規の応用がより速く構築できるために)、もっと重要なことですが、人工知能に向かって前進することです。

An AI must fundamentally understand the world around us, and we argue that this can only be achieved if it can learn to identify and disentangle the underlying explanatory factors hidden in the observed milieu of low-level sensory data. 

●This paper is about representation learning, i.e., learning representations of the data that make it easier to extract useful information when building classifiers or other predictors. 

In the case of probabilistic models, a good representation is often one that captures the posterior distribution of the underlying explanatory factors for the observed input. 

A good representation is also one that is useful as input to a supervised predictor. 

Among the various ways of learning representations, this paper focuses on deep learning methods: those that are formed by the composition of multiple non-linear transformations, with the goal of yielding more abstract - and ultimately more useful - representations. 

Here we survey this rapidly developing area with special emphasis on recent progress. 

We consider some of the fundamental questions that have been driving research in this area. 

Specifically, what makes one representation better than another? 

Given an example, how should we compute its representation, i.e. perform feature extraction? 

Also, what are appropriate objectives for learning good representations? 


Representation learning has become a field in itself in the machine learning community, with regular workshops at the leading conferences such as NIPS and ICML, and a new conference dedicated to it, ICLR1, sometimes under the header of Deep Learning or Feature Learning. 

Although depth is an important part of the story, many other priors are interesting and can be conveniently captured when the problem is cast as one of learning a representation, as discussed in the next section. 

The rapid increase in scientific activity on representation learning has been accompanied and nourished by a remarkable string of empirical successes both in academia and in industry. 

Below, we briefly highlight some of these high points. 

Speech Recognition and Signal Processing  スピーチ認識と信号処理

●Speech was one of the early applications of neural networks, in particular convolutional (or time-delay) neural networks.

The recent revival of interest in neural networks, deep learning, and representation learning has had a strong impact in the area of speech recognition, with breakthrough results (Dahl et al., 2010; Deng et al., 2010; Seide et al., 2011a; Mohamed et al., 2012; Dahl et al., 2012; Hinton et al., 2012) obtained by several academics as well as researchers at industrial labs bringing these algorithms to a larger scale and into products. 

For example, Microsoft has released in 2012 a new version of their MAVIS (Microsoft Audio Video Indexing Service) speech system based on deep learning (Seide et al., 2011a). 
例えば、マイクロソフトは、リリースしました|2012年に、新バージョンのMAVIS スピーチシステムを|深層学習に基づいて

These authors managed to reduce the word error rate on four major benchmarks by about 30% (e.g. from 27.4% to 18.5% on RT03S) compared to state-of-the-art models based on Gaussian mixtures for the acoustic modeling and trained on the same amount of data (309 hours of speech). 

The relative improvement in error rate obtained by Dahl et al. (2012) on a smaller large-vocabulary speech recognition benchmark (Bing mobile business search dataset, with 40 hours of speech) is between 16% and 23%. 
相対的な改善 ((Dahlが得た誤差率における)) は、もう少し小さい大規模語彙スピーチ認識ベンチマークにおいて、16%から23%の間です。

●Representation-learning algorithms have also been applied to music, substantially beating the state-of-the-art in polyphonic transcription (Boulanger-Lewandowski et al., 2012), with relative error improvement between 5% and 30% on a standard benchmark of 4 datasets. 

Deep learning also helped to win MIREX (Music Information Retrieval) competitions, e.g. in 2011 on audio tagging (Hamel et al., 2011). 
深層学習は、にも貢献しました|MIREX競技会での勝利、例えば、2011年のaudio tagging

Object Recognition  オブジェクト認識

●The beginnings of deep learning in 2006 have focused on the MNIST digit image classification problem (Hinton et al., 2006; Bengio et al., 2007), breaking the supremacy of SVMs (1.4% error) on this dataset. 

The latest records are still held by deep networks: Ciresan et al. (2012) currently claims the title of state-of-the-art for the unconstrained version of the task (e.g., using a convolutional architecture), with 0.27% error, and Rifai et al. (2011c) is state-of-the-art for the knowledge free version of MNIST, with 0.81% error. 

●In the last few years, deep learning has moved from digits to object recognition in natural images, and the latest breakthrough has been achieved on the ImageNet dataset4 bringing down the state-of-the-art error rate from 26.1% to 15.3% (Krizhevsky et al., 2012). 

Natural Language Processing  自然言語処理

●Besides speech recognition, there are many other Natural Language Processing (NLP) applications of representation learning. 

Distributed representations for symbolic data were introduced by Hinton (1986), and first developed in the context of statistical language modeling by Bengio et al. (2003) in so-called neural net language models (Bengio, 2008). 
記号データの分布表現は、導入されました|Hinton (1986) によって。そして、最初に発展させらせました|統計的言語モデリングのコンテキストにおいて|Bengio et al.(2003) によって|いわゆるニューラルネット言語モデルにおいて

They are all based on learning a distributed representation for each word, called a word embedding. 

Adding a convolutional architecture, Collobert et al. (2011) developed the SENNA system that shares representations across the tasks of language modeling, part-of-speech tagging, chunking, named entity recognition, semantic role labeling and syntactic parsing. 
畳み込みアーキテクチャを加えて、Collobert et al.(2011) は、開発しました|SENNA システムを|それは共有します|表現を|言語モデリングや、part-of-speech taggingや、chunkingや、named entity認識や、sematic role labelingや、syntactic parsingなどのタスクを横断して

SENNA approaches or surpasses the state-of-the-art on these tasks but is simpler and much faster than traditional predictors. 

Learning word embeddings can be combined with learning image representations in a way that allow to associate text and images. 

This approach has been used successfully to build Google's image search, exploiting huge quantities of data to map images and queries in the same space (Weston et al., 2010) and it has recently been extended to deeper multi-modal representations (Srivastava and Salakhutdinov, 2012). 

●The neural net language model was also improved by adding recurrence to the hidden layers (Mikolov et al., 2011), allowing it to beat the state-of-the-art (smoothed n-gram models) not only in terms of perplexity (exponential of the average negative log-likelihood of predicting the right next word, going down from 140 to 102) but also in terms of word error rate in speech recognition (since the language model is an important component of a speech recognition system), decreasing it from 17.2% (KN5 baseline) or 16.9% (discriminative language model) to 14.4% on the Wall Street Journal benchmark task. 

Similar models have been applied in statistical machine translation (Schwenk et al., 2012; Le et al., 2013), improving perplexity and BLEU scores.

Recursive auto-encoders (which generalize recurrent networks) have also been used to beat the state-of-the-art in full sentence paraphrase detection (Socheretal.,2011a) almost doubling the F1 score for paraphrase detection. 
再帰自己符号化器 (再帰ネットワークを生成します) は、使用されました|最先端を打破するために|full sentence paraphrase 検出において

Representation learning can also be used to perform word sense disambiguation (Bordes et al., 2012), bringing up the accuracy from 67.8% to 70.2% on the subset of Senseval-3 where the system could be applied (with subject-verb-object sentences). 
表現学習は、使用できます|word sense disambiguationを実施するためにも

Finally, it has also been successfully used to surpass the state-of-the-art in sentiment analysis (Glorot et al., 2011b; Socher et al., 2011b).

Multi-Task and Transfer Learning, Domain Adaptation  多重タスクと転移学習、領域適合

●Transfer learning is the ability of a learning algorithm to exploit commonalities between different learning tasks in order to share statistical strength, and transfer knowledge across tasks. 

As discussed below, we hypothesize that representation learning algorithms have an advantage for such tasks because they learn representations that capture underlying factors, a subset of which may be relevant for each particular task, as illustrated in Figure 1. 

This hypothesis seems confirmed by a number of empirical results showing the strengths of representation learning algorithms in transfer learning scenarios.

●Most impressive are the two transfer learning challenges held in 2011 and won by representation learning algorithms. 

First, the Transfer Learning Challenge, presented at an ICML 2011 workshop of the same name, was won using unsupervised layer-wise pre-training (Bengio, 2011; Mesnil et al., 2011). 

A second Transfer Learning Challenge was held the same year and won by Goodfellow et al. (2011). 
二つ目のトランスファ学習チャレンジは、同年に開催され、Goodfellow et al.(2011)が勝利しました。

Results were presented at NIPS 2011’s Challenges in Learning Hierarchical Models Workshop. 

In the related domain adaptation setup, the target remains the same but the input distribution changes (Glorot et al., 2011b; Chen et al., 2012). 

In the multi-task learning setup, representation learning has also been found advantageous Krizhevsky et al. (2012); Collobert et al. (2011), because of shared factors across tasks. 


3.1 Priors for Representation Learning in AI  人工知能における表現学習のためのプライア

In Bengio and LeCun (2007), one of us introduced the notion of AI-tasks, which are challenging for current machine learning algorithms, and involve complex but highly structured dependencies. 
Bengio and LeCum (2007) において、我々の一人は導入しました|AIタスクの概念を

One reason why explicitly dealing with representations is interesting is because they can be convenient to express many general priors about the world around us, i.e., priors that are not task-specific but would be likely to be useful for a learning machine to solve AI-tasks. 

Examples of such general-purpose priors are the following: 

Smoothness: assumes the function to be learned f is s.t. x ≒ y generally implies f(x) ≒ f(y). 
滑らかさ: 学習されるべき関数は、x ≒ yであれば一般に、f(x) ≒ f(y) であると仮定する。

This most basic prior is present in most machine learning, but is insufficient to get around the curse of dimensionality, see Section 3.2. 

Multiple explanatory factors: the data generating distribution is generated by different underlying factors, and for the most part what one learns about one factor generalizes in many configurations of the other factors. 
多重説明因子: データを生成する分布は、生成されます|様々な潜在する因子によって

The objective to recover or at least disentangle these underlying factors of variation is discussed in Section 3.5. 

This assumption is behind the idea of distributed representations, discussed in Section 3.3 below. 

A hierarchical organization of explanatory factors: the concepts that are useful for describing the world around us can be defined in terms of other concepts, in a hierarchy, with more abstract concepts higher in the hierarchy, defined in terms of less abstract ones. 

This assumption is exploited with deep representations, elaborated in Section 3.4 below. 

Semi-supervised learning: with inputs X and target Y to predict, a subset of the factors explaining X’s distribution explain much of Y , given X. 

Hence representations that are useful for P(X) tend to be useful when learning P(Y|X), allowing sharing of statistical strength between the unsupervised and supervised learning tasks, see Section 4. 

Shared factors across tasks: with many Y'’s of interest or many learning tasks in general, tasks (e.g., the corresponding P(Y|X,task)) are explained by factors that are shared with other tasks, allowing sharing of statistical strengths across tasks, as discussed in the previous section (Multi-Task and Transfer Learning, Domain Adaptation). 

Manifolds: probability mass concentrates near regions that have a much smaller dimensionality than the original space where the data lives. 

This is explicitly exploited in some of the auto-encoder algorithms and other manifold-inspired algorithms described respectively in Sections 7.2 and 8. 

Natural clustering: different values of categorical variables such as object classes are associated with separate manifolds. 

More precisely, the local variations on the manifold tend to preserve the value of a category, and a linear interpolation between examples of different classes in general involves going through a low density region, i.e., P(X|Y = i) for different i tend to be well separated and not overlap much. 

For example, this is exploited in the Manifold Tangent Classifier discussed in Section 8.3. 

This hypothesis is consistent with the idea that humans have named categories and classes because of such statistical structure (discovered by their brain and propagated by their culture), and machine learning tasks often involves predicting such categorical variables. 

Temporal and spatial coherence: consecutive (from a sequence) or spatially nearby observations tend to be associated with the same value of relevant categorical concepts, or result in a small move on the surface of the high-density manifold. 

More generally, different factors change at different temporal and spatial scales, and many categorical concepts of interest change slowly. 

When attempting to capture such categorical variables, this prior can be enforced by making the associated representations slowly changing, i.e., penalizing changes in values over time or space. This prior was introduced in Becker and Hinton (1992) and is discussed in Section 11.3. 

Sparsity: for any given observation x, only a small fraction of the possible factors are relevant. 

In terms of representation, this could be represented by features that are often zero (as initially proposed by Olshausen and Field (1996)), or by the fact that most of the extracted features are insensitive to small variations of x. 

This can be achieved with certain forms of priors on latent variables (peaked at 0), or by using a nonlinearity whose value is often flat at 0 (i.e., 0 and with a 0 derivative), or simply by penalizing the magnitude of the Jacobian matrix (of derivatives) of the function mapping input to representation. 

This is discussed in Sections 6.1.1 and 7.2.   このことは議論します|6.1.1節と7.2節で

Simplicity of Factor Dependencies: in good high-level representations, the factors are related to each other through simple, typically linear dependencies. 
因子依存状態の単純性: 高レベルの良い表現において、因子は互いに関係しています|単純で、典型的には線形な依存性を通して

This can be seen in many laws of physics, and is assumed when plugging a linear predictor on top of a learned representation. 

●We can view many of the above priors as ways to help the learner discover and disentangle some of the underlying (and a priori unknown) factors of variation that the data may reveal. 

This idea is pursued further in Sections 3.5 and 11.4. 

3.2 Smoothness and the Curse of Dimensionality  なめらかさと、次元性ののろい

●For AI-tasks, such as vision and NLP, it seems hopeless to rely only on simple parametric models (such as linear models) because they cannot capture enough of the complexity of interest unless provided with the appropriate feature space. 
 ビジョンやNLPのようなAIのタスクにとって、単純なパラメトリックモデル (線形モデルのような) にのみ依存することは、望みがないように思えます。なぜなら、それらは、興味対象の複雑性を十分に捕獲できないからです|適切な特徴空間が与えられない限り

Conversely, machine learning researchers have sought flexibility in local non-parametric learners such as kernel machines with a fixed generic local-response kernel (such as the Gaussian kernel). 

Unfortunately, as argued at length by Bengio and Monperrus (2005); Bengio et al. (2006a); Bengio and LeCun (2007); Bengio (2009); Bengio et al. (2010), most of these algorithms only exploit the principle of local generalization, i.e., the assumption that the target function (to be learned) is smooth enough, so they rely on examples to explicitly map out the wrinkles of the target function. 

Generalization is mostly achieved by a form of local interpolation between neighboring training examples. 

Although smoothness can be a useful assumption, it is insufficient to deal with the curse of dimensionality, because the number of such wrinkles (ups and downs of the target function) may grow exponentially with the number of relevant interacting factors, when the data are represented in raw input space. 
滑らかさは、有用な仮定でありえますが、それは、次元性の呪いに対処するには不十分です。なぜなら、そのようなしわの数 (目標関数の上昇や下降) は、指数関数的に成長するかもしれません|適切な相互作用因子の数が増えるとともに|データが生入力空間に表示されるときに

We advocate learning algorithms that are flexible and non-parametric but do not rely exclusively on the smoothness assumption. 

Instead, we propose to incorporate generic priors such as those enumerated above into representation-learning algorithms. 

Smoothness based learners (such as kernel machines) and linear models can still be useful on top of such learned representations. 
滑らかさをベースにした学習者たち (カーネル機械など) や線形モデルたちは、そのような学習された表現のトップになおも有用でありえます。

In fact, the combination of learning a representation and kernel machine is equivalent to learning the kernel, i.e., the feature space. 

Kernel machines are useful, but they depend on a prior definition of a suitable similarity metric, or a feature space in which naive similarity metrics suffice. 

We would like to use the data, along with very generic priors, to discover those features, or equivalently, a similarity function. 

3.3 Distributed representations  分布表現

●Good representations are expressive, meaning that a reasonably-sized learned representation can capture a huge number of possible input configurations. 

A simple counting argument helps us to assess the expressiveness of a model producing a representation: how many parameters does it require compared to the number of input regions (or configurations) it can distinguish? 
簡単な counting argument は、我々が評価することを助けます|ある表現を生成するモデルの表現性を:すなわち、何個のパラメタをそれは要求するのか|それが区別できる入力領域 (または配置)の数と比較して

Learners of one-hot representations, such as traditional clustering algorithms, Gaussian mixtures, nearest-neighbor algorithms, decision trees, or Gaussian SVMs all require O(N) parameters (and/or O(N) examples) to distinguish O(N) input regions. 
One-hot 表現、例えば、伝統的クラスタリング・アルゴリズム、ガウシアン混合、最近傍アルゴリズム、決定木、ガウシアンSVM、などの学習者は、みな、必要とします|O(N) パラメタ (and/or O(N)例) が O(N) 入力空間を区別することを

One could naively believe that one cannot do better. 

However, RBMs, sparse coding, auto-encoders or multi-layer neural networks can all represent up to O(2k) input regions using only O(N) parameters (with k the number of non-zero elements in a sparse representation, and k = N in non-sparse RBMs and other dense representations). 
しかし、RBM, 疎コーディング、自己符号化器、多重層ニューラルネットワークは、みな、O(N)まで入力領域を表します|O(N)のパラメタを使いながら

These are all distributed or sparse representations. 

The generalization of clustering to distributed representations is multi-clustering, where either several clusterings take place in parallel or the same clustering is applied on different parts of the input, such as in the very popular hierarchical feature extraction for object recognition based on a histogram of cluster categories detected in different patches of an image (Lazebnik et al., 2006; Coates and Ng, 2011a). 

The exponential gain from distributed or sparse representations is discussed further in section 3.2 (and Figure 3.2) of Bengio (2009). 

It comes about because each parameter (e.g. the parameters of one of the units in a sparse code, or one of the units in a Restricted Boltzmann Machine) can be re-used in many examples that are not simply near neighbors of each other, whereas with local generalization, different regions in input space are basically associated with their own private set of parameters, e.g., as in decision trees, nearest-neighbors, Gaussian SVMs, etc. 

In a distributed representation, an exponentially large number of possible subsets of features or hidden units can be activated in response to a given input. 

In a single-layer model, each feature is typically associated with a preferred input direction, corresponding to a hyperplane in input space, and the code or representation associated with that input is precisely the pattern of activation (which features respond to the input, and how much). 

This is in contrast with a non-distributed representation such as the one learned by most clustering algorithms, e.g., k-means, in which the representation of a given input vector is a one-hot code identifying which one of a small number of cluster centroids best represents the input 10. 
これは、対照的です|非分布表現 ((ほとんどのクラスタリング・アルゴリズムで学習されたものとか)) とは

3.4 Depth and abstraction   深さと抽象作用

●Depth is a key aspect to representation learning strategies we consider in this paper. 
 深さは、主要なアスペクト (様相) です|表現学習戦略の|この論文で考察する

As we will discuss, deep architectures are often challenging to train effectively and this has been the subject of much recent research and progress. 

However, despite these challenges, they carry two significant advantages that motivate our long-term interest in discovering successful training strategies for deep architectures. 

These advantages are: (1) deep architectures promote the re-use of features, and (2) deep architectures can potentially lead to progressively more abstract features at higher layers of representations (more removed from the data). 
これらの利点とは:(1) 深層の建築様式は特徴の再利用を促進すること、そして (2) 深層の建築様式は潜在的に、導くことができること|漸進的により抽象的な特徴に|表現のより高位の層において (データからより離れて)

Feature re-use.   特徴の再利用

●The notion of re-use, which explains the power of distributed representations, is also at the heart of the theoretical advantages behind deep learning, i.e., constructing multiple levels of representation or learning a hierarchy of features. 

The depth of a circuit is the length of the longest path from an input node of the circuit to an output node of the circuit. 

The crucial property of a deep circuit is that its number of paths, i.e., ways to re-use different parts, can grow exponentially with its depth. 

Formally, one can change the depth of a given circuit by changing the definition of what each node can compute, but only by a constant factor. 

The typical computations we allow in each node include: weighted sum, product, artificial neuron model (such as a monotone nonlinearity on top of an affine transformation), computation of a kernel, or logic gates. 
典型的な計算|各モードにおいて私たちが許す|、は、含みます|重み付き総和、積、人工ニューロンモデル (アフィン変換の頂点での単調非線形のような)、カーネルの計算、または、論理ゲートを|。

Theoretical results clearly show families of functions where a deep representation can be exponentially more efficient than one that is insufficiently deep (Hastad, 1986; Hastad and Goldmann, 1991; Bengio et al., 2006a; Bengio and LeCun, 2007; Bengio and Delalleau, 2011). 

If the same family of functions can be represented with fewer parameters (or more precisely with a smaller VC-dimension), learning theory would suggest that it can be learned with fewer examples, yielding improvements in both computational efficiency (less nodes to visit) and statistical efficiency (less parameters to learn, and re-use of these parameters over many different kinds of inputs). 
もし同じ関数族が、表現されうるのであれば|より少ないパラメタで (またはより小さいVC-次元でより正確に) |、学習理論は、示唆するでしょう|その関数族は学習されうることを|より少ない例で|改良をもたらしながら|計算効率 (訪問するノードの減少) と統計的効率 (学習するパラメタ数の減少と、たくさんの異なる種類の入力にわたってこれらのパラメタを再利用すること) の両方において|。

Abstraction and invariance.  抽象作用と不変

●Deep architectures can lead to abstract representations because more abstract concepts can often be constructed in terms of less abstract ones. 

In some cases, such as in the convolutional neural network (LeCun et al., 1998b), we build this abstraction in explicitly via a pooling mechanism (see section 11.2). 

More abstract concepts are generally invariant to most local changes of the input. 

That makes the representations that capture these concepts generally highly non-linear functions of the raw input. 

This is obviously true of categorical concepts, where more abstract representations detect categories that cover more varied phenomena (e.g. larger manifolds with more wrinkles) and thus they potentially have greater predictive power. 
これは、あきらかに範疇の概念には当てはまります。そこでは、より抽象的な表現が、検出し|範疇を|より変動ある現象 (例えば、より多くのシワを持つより大きな多様体) をカバーする|、従って、潜在的により大きな予言能力をもちます。

Abstraction can also appear in high-level continuous-valued attributes that are only sensitive to some very specific types of changes in the input. 

Learning these sorts of invariant features has been a long-standing goal in pattern recognition. 

3.5 Disentangling Factors of Variation   変動の因子のもつれ解き

●Beyond being distributed and invariant, we would like our representations to disentangle the factors of variation. 

Different explanatory factors of the data tend to change independently of each other in the input distribution, and only a few at a time tend to change when one considers a sequence of consecutive real-world inputs. 

●Complex data arise from the rich interaction of many sources. 

These factors interact in a complex web that can complicate AI-related tasks such as object classification. 

For example, an image is composed of the interaction between one or more light sources, the object shapes and the material properties of the various surfaces present in the image. 

Shadows from objects in the scene can fall on each other in complex patterns, creating the illusion of object boundaries where there are none and dramatically effect the perceived object shape. 

How can we cope with these complex interactions? 

How can we disentangle the objects and their shadows? 

Ultimately, we believe the approach we adopt for overcoming these challenges must leverage the data itself, using vast quantities of unlabeled examples, to learn representations that separate the various explanatory sources. 

Doing so should give rise to a representation significantly more robust to the complex and richly structured variations extant in natural data sources for AI-related tasks. 

●It is important to distinguish between the related but distinct goals of learning invariant features and learning to disentangle explanatory factors. 

The central difference is the preservation of information. 

Invariant features, by definition, have reduced sensitivity in the direction of invariance. 

This is the goal of building features that are insensitive to variation in the data that are uninformative to the task at hand. 

Unfortunately, it is often difficult to determine a priori which set of features and variations will ultimately be relevant to the task at hand.

Further, as is often the case in the context of deep learning methods, the feature set being trained may be destined to be used in multiple tasks that may have distinct subsets of relevant features. 

Considerations such as these lead us to the conclusion that the most robust approach to feature learning is to disentangle as many factors as possible, discarding as little information about the data as is practical. 

If some form of dimensionality reduction is desirable, then we hypothesize that the local directions of variation least represented in the training data should be first to be pruned out (as in PCA, for example, which does it globally instead of around each example). 
もしなんらかの次元縮小が必要でしたら、そのとき私たちは仮定します|訓練データに殆ど表現されていない変動の局所方向は、最初に刈り取られるべきであると (例えば、各例の周りではなくグローバルにそれを行うPCA(主成分解析)におけけるように)|

3.6 Good criteria for learning representations?   表現学習のための良い基準は?

●One of the challenges of representation learning that distinguishes it from other machine learning tasks such as classification is the difficulty in establishing a clear objective, or target for training. 

In the case of classification, the objective is (at least conceptually) obvious, we want to minimize the number of misclassifications on the training dataset. 
分類の場合、目的は (少なくとも観念てきには) 明白です。私たちは、最小化することを欲します|分類失敗の数を|データセットを訓練するときの|。

In the case of representation learning, our objective is far-removed from the ultimate objective, which is typically learning a classifier or some other predictor. 

Our problem is reminiscent of the credit assignment problem encountered in reinforcement learning. 

We have proposed that a good representation is one that disentangles the underlying factors of variation, but how do we translate that into appropriate training criteria? 

Is it even necessary to do anything but maximize likelihood under a good model or can we introduce priors such as those enumerated above (possibly data-dependent ones) that help the representation better do this disentangling? 
必要ですらあるのでしょうか|尤度を決して最大化しないことは|良いモデルのもとで? または、私たちは、導入するのでしょうか|先に列挙したようなプライア (多分データ依存するもの) を|表現がこのもつれ解きをよりより行うように手伝ってくれる|。

This question remains clearly open but is discussed in more detail in Sections 3.5 and 11.4. 


●In 2006, a breakthrough in feature learning and deep learning was initiated by Geoff Hinton and quickly followed up in the same year (Hinton et al., 2006; Bengio et al., 2007; Ranzato et al., 2007), and soon after by Lee et al. (2008) and many more later. 

It has been extensively reviewed and discussed in Bengio (2009). 
それは、Bengio (2009) で広範囲にレビューされ、議論されました。

A central idea, referred to as greedy layerwise unsupervised pre-training, was to learn a hierarchy of features one level at a time, using unsupervised feature learning to learn a new transformation at each level to be composed with the previously learned transformations; essentially, each iteration of unsupervised feature learning adds one layer of weights to a deep neural network. 

Finally, the set of layers could be combined to initialize a deep supervised predictor, such as a neural network classifier, or a deep generative model, such as a Deep Boltzmann Machine (Salakhutdinov and Hinton, 2009). 

●This paper is mostly about feature learning algorithms that can be used to form deep architectures. 

In particular, it was empirically observed that layerwise stacking of feature extraction often yielded better representations, e.g., in terms of classification error (Larochelle et al., 2009; Erhan et al., 2010b), quality of the samples generated by a probabilistic model (Salakhutdinov and Hinton, 2009) or in terms of the invariance properties of the learned features (Goodfellow et al., 2009).

Whereas this section focuses on the idea of stacking single-layer models, Section 10 follows up with a discussion on joint training of all the layers. 

●After greedy layerwise unsuperivsed pre-training, the resulting deep features can be used either as input to a standard supervised machine learning predictor (such as an SVM) or as initialization for a deep supervised neural network (e.g., by appending a logistic regression layer or purely supervised layers of a multi-layer neural network). 

The layerwise procedure can also be applied in a purely supervised setting, called the greedy layerwise supervised pre-training (Bengio et al., 2007). 

For example, after the first one-hidden-layer MLP is trained, its output layer is discarded and another one-hidden-layer MLP can be stacked on top of it, etc. 

Although results reported in Bengio et al. (2007) were not as good as for unsupervised pre-training, they were nonetheless better than without pretraining at all. 

Alternatively, the outputs of the previous layer can be fed as extra inputs for the next layer (in addition to the raw input), as successfully done in Yu et al. (2010). 
別の方法として、前の層の出力は、与えることができます|次の層への追加入力として (生入力への追加として) |Yuが成功裏に実施したように|。

Another variant (Seide et al., 2011b) pre-trains in a supervised way all the previously added layers at each step of the iteration, and in their experiments this discriminant variant yielded better results than unsupervised pre-training. 

●Whereas combining single layers into a supervised model is straightforward, it is less clear how layers pre-trained by unsupervised learning should be combined to form a better unsupervised model. 

We cover here some of the approaches to do so, but no clear winner emerges and much work has to be done to validate existing proposals or improve them. 

●The first proposal was to stack pre-trained RBMs into a Deep Belief Network (Hinton et al., 2006) or DBN, where the top layer is interpreted as an RBM and the lower layers as a directed sigmoid belief network. 

However, it is not clear how to approximate maximum likelihood training to further optimize this generative model. 

One option is the wake-sleep algorithm (Hinton et al., 2006) but more work should be done to assess the efficiency of this procedure in terms of improving the generative model. 

●The second approach that has been put forward is to combine the RBM parameters into a Deep Boltzmann Machine (DBM), by basically halving the RBM weights to obtain the DBM weights (Salakhutdinov and Hinton, 2009). 

The DBM can then be trained by approximate maximum likelihood as discussed in more details later (Section 10.2). 

This joint training has brought substantial improvements, both in terms of likelihood and in terms of classification performance of the resulting deep feature learner (Salakhutdinov and Hinton, 2009). 

●Another early approach was to stack RBMs or autoencoders into a deep auto-encoder (Hinton and Salakhutdinov, 2006). 

If we have a series of encoder-decoder pairs (f(i)(・),g(i)(・)), then the overall encoder is the composition of the encoders, f(N)(...f(2)(f(1)(・))), and the overall decoder is its “transpose” (often with transposed weight matrices as well), g(1)(g(2)(...f(N)(・))). 
もし一連の自己符号化器対があれば、全体の符号化器は、符号化器 f(N)(...f(2)(f(1)(・))) の合成であり、全体のデコーダー(復号器)は、そのトランスポーズ(転置)です。

The deep auto-encoder (or its regularized version, as discussed in Section 7.2) can then be jointly trained, with all the parameters optimized with respect to a global reconstruction error criterion. 
深層の自己符号化器 (もしくはその正規化版、7.2節で説明) は、共同で学習され得ます|グローバル再構築誤差基準に関して最適化されたすべてのパラメタとともに|。

More work on this avenue clearly needs to be done, and it was probably avoided by fear of the challenges in training deep feedforward networks, discussed in the Section 10 along with very encouraging recent results. 

●Yet another recently proposed approach to training deep architectures (Ngiam et al., 2011) is to consider the iterative construction of a free energy function (i.e., with no explicit latent variables, except possibly for a top-level layer of hidden units) for a deep architecture as the composition of transformations associated with lower layers, followed by top-level hidden units. 

The question is then how to train a model defined by an arbitrary parametrized (free) energy function. 

Ngiam et al. (2011) have used Hybrid Monte Carlo (Neal, 1993), but other options include contrastive divergence (Hinton, 1999; Hinton et al., 2006), score matching (Hyv¨arinen, 2005; Hyv¨arinen, 2008), denoising score matching (Kingma and LeCun, 2010; Vincent, 2011), ratio-matching (Hyv¨arinen, 2007) and noisecontrastive estimation (Gutmann and Hyvarinen, 2010). 


●Within the community of researchers interested in representation learning, there has developed two broad parallel lines of inquiry: one rooted in probabilistic graphical models and one rooted in neural networks. 
 表現学習に興味をもつ研究者コミュニティのなかでは、二本の幅広い平行線の探求がおこりました: ひとつは、確率グラフィカルモデルに根付いていて、もう一つはニューラルネットワークに根付いています。

Fundamentally, the difference between these two paradigms is whether the layered architecture of a deep learning model is to be interpreted as describing a probabilistic graphical model or as describing a computation graph. 

In short, are hidden units considered latent random variables or as computational nodes? 

●To date, the dichotomy between these two paradigms has remained in the background, perhaps because they appear to have more characteristics in common than separating them. 

We suggest that this is likely a function of the fact that much recent progress in both of these areas has focused on single-layer greedy learning modules and the similarities between the types of single-layer models that have been explored: mainly, the restricted Boltzmann machine (RBM) on the probabilistic side, and the auto-encoder variants on the neural network side. 

Indeed, as shown by one of us (Vincent, 2011) and others (Swersky et al., 2011), in the case of the restricted Boltzmann machine, training the model via an inductive principle known as score matching (Hyvarinen, 2005) (to be discussed in sec. 6.4.3) is essentially identical to applying a regularized reconstruction objective to an autoencoder. 

Another strong link between pairs of models on both sides of this divide is when the computational graph for computing representation in the neural network model corresponds exactly to the computational graph that corresponds to inference in the probabilistic model, and this happens to also correspond to the structure of graphical model itself (e.g., as in the RBM). 

●The connection between these two paradigms becomes more tenuous when we consider deeper models where, in the case of a probabilistic model, exact inference typically becomes intractable. 

In the case of deep models, the computational graph diverges from the structure of the model. 

For example, in the case of a deep Boltzmann machine, unrolling variational (approximate) inference into a computational graph results in a recurrent graph structure. 

We have performed preliminary exploration (Savard, 2011) of deterministic variants of deep auto-encoders whose computational graph is similar to that of a deep Boltzmann machine (in fact very close to the mean-field variational approximations associated with the Boltzmann machine), and that is one interesting intermediate point to explore (between the deterministic approaches and the graphical model approaches). 

●In the next few sections we will review the major developments in single-layer training modules used to support feature learning and particularly deep learning.

We divide these sections between (Section 6) the probabilistic models, with inference and training schemes that directly parametrize the generative - or decoding - pathway and (Section 7) the typically neural network-based models that directly parametrize the encoding pathway. 

Interestingly, some models, like Predictive Sparse Decomposition (PSD) (Kavukcuoglu et al., 2008) inherit both properties, and will also be discussed (Section 7.2.4). 

We then present a different view of representation learning, based on the associated geometry and the manifold assumption, in Section 8. 

●First, let us consider an unsupervised single-layer representation learning algorithm spaning all three views: probabilistic, auto-encoder, and manifold learning. 

Principal Components Analysis  主成分解析

●We will use probably the oldest feature extraction algorithm, principal components analysis (PCA), to illustrate the probabilistic, auto-encoder and manifold views of representation learning. 

PCA learns a linear transformation h = f(x) = WTx + b of input x ∈ Rdx, where the columns of dx x dh matrix W form an orthogonal basis for the dh orthogonal directions of greatest variance in the training data. 
PCAは、学習します|線形変換 h = f(x) = WTx + b を|。

The result is dh features (the components of representation h) that are decorrelated. 
結果は、dh 特徴です。

The three interpretations of PCA are the following:   PCAの三つの解釈は以下のとおりです。

a) it is related to probabilistic models (Section 6) such as probabilistic PCA, factor analysis and the traditional multivariate Gaussian distribution (the leading eigenvectors of the covariance matrix are the principal components); 
a) それは、関係しています|確率モデルに|確率PCAや、因子解析や、伝統的な多変数ガウス分布(共分散行列の主要固有ベクトルは主成分です)|。

b) the representation it learns is essentially the same as that learned by a basic linear auto-encoder (Section 7.2); and 
b) それが学習する表現は、基本的な線形自己符号化器が学習するものと本質的に同じものです。

c) it can be viewed as a simple linear form of linear manifold learning (Section 8), i.e., characterizing a lower-dimensional region in input space near which the data density is peaked. 
c) それは、みなすことができます|単純線形形であると|線形多様体学習の|。

Thus, PCA may be in the back of the reader’s mind as a common thread relating these various viewpoints. 

Unfortunately the expressive power of linear features is very limited: they cannot be stacked to form deeper, more abstract representations since the composition of linear operations yields another linear operation. 

Here, we focus on recent algorithms that have been developed to extract non-linear features, which can be stacked in the construction of deep networks, although some authors simply insert a non-linearity between learned single-layer linear projections (Le et al., 2011c; Chen et al., 2012). 

●Another rich family of feature extraction techniques that this review does not cover in any detail due to space constraints is Independent Component Analysis or ICA (Jutten and Herault, 1991; Bell and Sejnowski, 1997). 

Instead, we refer the reader to Hyvarinen et al. (2001a); Hyv¨arinen et al. (2009). 

Note that, while in the simplest case (complete, noise-free) ICA yields linear features, in the more general case it can be equated with a linear generative model with non-Gaussian independent latent variables, similar to sparse coding (section 6.1.1), which result in non-linear features.

Therefore, ICA and its variants like Independent and Topographic ICA (Hyv¨arinen et al., 2001b) can and have been used to build deep networks (Le et al., 2010, 2011c): see section 11.2. 
ですから、ICAや、その変形 (独立およびトポグラフィックなICAのような) は、使用でき、使用されてきました|深層ネットワーク構築するために|。

The notion of obtaining independent components also appears similar to our stated goal of disentangling underlying explanatory factors through deep networks. 

However, for complex real-world distributions, it is doubtful that the relationship between truly independent underlying factors and the observed high-dimensional data can be adequately characterized by a linear transformation. 

6. Probabilistic Models  確率モデル

6.1 Directed Graphical Models

6.2 Undirected Graphical Models

6.3 Generalizations of the RBM to Real-valued data

6.4 RBM parameter estimation

7. Directly Learning A Parametric Map from Input to Representation  入力から表現へパラメトリックマップを直接的に学習する

7.1 Auto-Encoders

7.2 Regulated Auto-Encoders

8. Representation Learning as Manifold Learning  多様体学習としての表現学習

8.1 Learning a parametric mapping based on a neighborhood graph

8.2 Learning to represent non-linear manifolds

8.3 Leveraging the modeled tangent spaces

9. Connections between Probabilistic and Direct Encoding models  確率モデルと直接符号化モデルとの結びつき

9.1 PSD: a probabilistic interpretation

9.2 Regularized Auto-Encoders Capture Local Structure of the Density

9.3 Learning Approximate Inference

9.4 Sampling Challenges

9.5 Evaluating and Monitoring Performance

10. Global Training of Deep Models  深層モデルの全地球訓練

10.1 The Challenge of Training Deep Architectures

10.2 Joint Training of Deep Boltzmann Machines

11. Building-In Invariance  不変の組み込み

11.1 Generating transformed examples

11.2 Convolution and pooling

11.3 Temporal coherence and slow features

11.4 Algorithms to Disentangle Factors of Variation

12. Conclusion  結論





