Bengio Representation Learning (2012)

　Y. Bengio et al. Representation Learning: A Review and New Perspectives. Technical report, arXiv:1206.558v2, 1-30, 2012

2016.8.20　　更新2016.8.29

　ベンジオ (Bengio) のこのレビュー論文は、よく引用されるので、読んでみたいと思う人も多いと思い、ここに翻訳を掲載します。

　ベンジオのこの論文は、インターネット上で流通していて、簡単に入手できるので、その英文自体をここに掲載しても、

経済的な損得は発生せず、著作権的にも大きな問題はないと考えていますが、問題がある場合は、ご指摘ください。

　翻訳は、直訳を心がけました。単語の訳も、できるだけ、単語のもともとの意味を残した訳としました。

また、英文と和文では、語順がちがうので、直訳で　S｜Ａ｜Ｂ｜、は、V｜Ｃ｜Ｄ｜。　となっている場合は、

日本語的には、B　A　S は、Ｄ　Ｃ　V。　となります。

　翻訳を解読して、意味を理解するだけでなく、その後、英文を読んで、意味がわかるようにしてください。

　また、私の専門の分野ではありませんので、訳語が的確でないものがあります。翻訳を進める間に、全体的な調整をとりたいと思います。

1 INTRODUCTION 　はじめに

2 WHY SHOULD WE CARE ABOUT LEARNING REPRESENTATIONS? 　なぜ私たちは表現学習を気にしなければならないのか

3 WHAT MAKES A REPRESENTATION GOOD? 　何が表現をよくするのか？

4 BUILDING DEEP REPRESENTATIONS 　　深層表現の構築

5 SINGLE-LAYER LEARNING MODULES　　単一層学習モジュール

6. Probabilistic Models　　確率モデル

7. Directly Learning A Parametric Map from Input to Representation　　入力から表現へパラメトリックマップを直接的に学習する

8. Representation Learning as Manifold Learning　　多様体学習としての表現学習

9. Connections between Probabilistic and Direct Encoding models　　確率モデルと直接符号化モデルとの結びつき

10. Global Training of Deep Models　　深層モデルの全地球訓練

11. Building-In Invariance　　不変の組み込み

12. Conclusion　　結論

Abstract　　要約

The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data.
機械学習アルゴリズムの成功は、一般的に、依存しています｜データ表現に｜。私たちは、仮定します｜これは、であると｜様々な異なる表現が、もつれさせたり隠したりできるから｜多かれ少なかれ様々な変動の説明的因子を｜データの背後にある｜。

Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors.
特定の領域知識は、表現をデザインすることを助けるために使うことはできますが、総称プライアによる学習も使うことができます。そして人工知能の探求は、動機付けします｜より強力な表現-学習アルゴリズムをデザインすることに｜そのようなプライアを実装しながら｜。

This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks.
この論文は、レビューします｜最近の研究を｜教師付き特徴学習と深層学習の分野における｜。そしてカバーします｜確率モデル、自己符号化器、多様体学習、深層ネットワークにおける進展を｜。

This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning.
このことは、動機付けします｜長期にわたって解答の得られなかった疑問を｜適切な目的に関する｜よい表現を学習するための｜表現 (すなわち推論) を計算するため、そして、幾何学的な連結を計算するため｜表現学習、密度の概算計算、多様体学習の間の｜。

1 INTRODUCTION 　はじめに

●The performance of machine learning methods is heavily dependent on the choice of data representation (or features) on which they are applied.
　機械学習の種々の方法のはたらきは、強く依存しています｜それらが適用されるデータ表現(ないしはデータ特徴)の選択に｜。

For that reason, much of the actual effort in deploying machine learning algorithms goes into the design of preprocessing pipelines and data transformations that result in a representation of the data that can support effective machine learning.
その理由のために、たくさんの実際の努力｜機械学習のアルゴリズムの展開における｜、は、パイプラインの前処理やデータ変換のデザインに進みます。そして、その結果、有効な機械学習をサポートすることのできるデータの表現が得られます。

Such feature engineering is important but labor-intensive and highlights the weakness of current learning algorithms: their inability to extract and organize the discriminative information from the data.
このような特徴工学は、重要ですが、労働集約型でもあり、強調します｜現在の学習アルゴリズムの弱さ、すなわち、データから特徴的な情報を抽出し組織化することが不可能であることを｜。

Feature engineering is a way to take advantage of human ingenuity and prior knowledge to compensate for that weakness.
特徴工学は、道なのです｜人間の創意工夫能力や予備知識を利活用する｜この弱点を克服するために｜。

In order to expand the scope and ease of applicability of machine learning, it would be highly desirable to make learning algorithms less dependent on feature engineering, so that novel applications could be constructed faster, and more importantly, to make progress towards Artificial Intelligence (AI).
機械学習のスコープ(作用範囲)や適用容易さを拡大するために、高度に望ましいことは、学習アルゴリズムを特徴工学により低依存にすることと (新規の応用がより速く構築できるために)、もっと重要なことですが、人工知能に向かって前進することです。

An AI must fundamentally understand the world around us, and we argue that this can only be achieved if it can learn to identify and disentangle the underlying explanatory factors hidden in the observed milieu of low-level sensory data.
人工知能は、基本的に私たちの周りの世界を理解しなければなりません。そして私たちは議論します｜このことは初めて実現できると｜人工知能が特定し解明できことによって｜潜在する説明的な因子を｜観測された低レベルの感覚データの環境の中に隠されている｜。

●This paper is about representation learning, i.e., learning representations of the data that make it easier to extract useful information when building classifiers or other predictors.
　この論文は、扱います｜表現学習を、すなわち、有用な情報を抽出することを容易にしてくれるデータの学習表現を｜分類器もしくは他の予測子を構築する際に｜。

In the case of probabilistic models, a good representation is often one that captures the posterior distribution of the underlying explanatory factors for the observed input.
確率モデルの場合には、良い表現はしばしば、です｜表現｜捕獲する｜事後分布を｜潜在する説明的な因子の｜観測された入力の｜。

A good representation is also one that is useful as input to a supervised predictor.
よい表現は、また、でもあります｜教師付き予測子への入力として有用な表現｜。

Among the various ways of learning representations, this paper focuses on deep learning methods: those that are formed by the composition of multiple non-linear transformations, with the goal of yielding more abstract - and ultimately more useful - representations.
表現の学習には様々な方法がありますが、この論文は、深層学習の方法に焦点をあてます。それらは、形成されます｜多重の非線形変換を合成することによって｜より抽象的で、そして究極的にはより有用な表現を産み出すことを目標にして｜。

Here we survey this rapidly developing area with special emphasis on recent progress.
ここで私たちは調査します｜この急速に発展している分野を｜最近の進展を特別に強調して｜。

We consider some of the fundamental questions that have been driving research in this area.
私たちは考察します｜基本的な疑問のいくつかを｜それは、この分野の駆動的な研究でした｜。

Specifically, what makes one representation better than another?
とくに、ある表現を他の表現よりもよくするものは何でしょうか。

Given an example, how should we compute its representation, i.e. perform feature extraction?
例を与えられて、どうやってその表現を計算するのでしょうか、どうやつて特徴抽出を行うのでしょうか。

Also, what are appropriate objectives for learning good representations?
また、何が適切な目標なのでしょうか｜よい表現を学習するための｜。

2 WHY SHOULD WE CARE ABOUT LEARNING REPRESENTATIONS? 　なぜ私たちは表現学習を気にしなければならないのか

Representation learning has become a field in itself in the machine learning community, with regular workshops at the leading conferences such as NIPS and ICML, and a new conference dedicated to it, ICLR1, sometimes under the header of Deep Learning or Feature Learning.
表現学習は、なりました｜それ自身で一つの分野に｜機械学習のコミュニティにおいて。定例のワークショップが開かれます｜主要な学会において｜NIPSやICMLや機械学習に特化した新しい学会であるICLR1などにおいて。しばしば深層学習とか、特徴学習というヘッダータイトルのもとに｜。

Although depth is an important part of the story, many other priors are interesting and can be conveniently captured when the problem is cast as one of learning a representation, as discussed in the next section.
深さは、この物語の重要な部分ではありますが、他の多くのプライアも重要で、都合よく捕獲することができます｜問題が提示されたとき｜表現の学習の一つとして｜次の節で議論するように｜。

The rapid increase in scientific activity on representation learning has been accompanied and nourished by a remarkable string of empirical successes both in academia and in industry.
急速な増加｜表現学習の学問活動における｜、は、伴っていて増進しています｜一連の並外れた経験的成功を｜学会や産業の双方における｜。

Below, we briefly highlight some of these high points.
以下に、短くこれらのハイな点のいくつかについて、ハイライトをあてます。

Speech Recognition and Signal Processing 　スピーチ認識と信号処理

●Speech was one of the early applications of neural networks, in particular convolutional (or time-delay) neural networks.
　スピーチは、一つでした｜ニューラルネットの初期の応用の、特に、畳み込み(または時間遅延)ニューラルネットワークの｜。

The recent revival of interest in neural networks, deep learning, and representation learning has had a strong impact in the area of speech recognition, with breakthrough results (Dahl et al., 2010; Deng et al., 2010; Seide et al., 2011a; Mohamed et al., 2012; Dahl et al., 2012; Hinton et al., 2012) obtained by several academics as well as researchers at industrial labs bringing these algorithms to a larger scale and into products.
最近の復活｜ニューラルネットや深層学習や表現学習における興味の｜、は、スピーチ認識の分野に強いインパクトを与えました。突破口的な結果が、いくつかの教育研究機関や産業界の研究所の研究者たちによって達成され、得られたアルゴリズムは、大規模に、そして製品に使われました。

For example, Microsoft has released in 2012 a new version of their MAVIS (Microsoft Audio Video Indexing Service) speech system based on deep learning (Seide et al., 2011a).
例えば、マイクロソフトは、リリースしました｜2012年に、新バージョンのMAVIS スピーチシステムを｜深層学習に基づいて｜。

These authors managed to reduce the word error rate on four major benchmarks by about 30% (e.g. from 27.4% to 18.5% on RT03S) compared to state-of-the-art models based on Gaussian mixtures for the acoustic modeling and trained on the same amount of data (309 hours of speech).
これらの著者は、成し遂げました｜単語の誤差率を提言することを｜４つの主要なベンチマークにおいて｜約30％だけ｜最新モデルと比較して｜音響モデリングのガウシアン混合に基づき、同じ量のデータにより訓練された｜。

The relative improvement in error rate obtained by Dahl et al. (2012) on a smaller large-vocabulary speech recognition benchmark (Bing mobile business search dataset, with 40 hours of speech) is between 16% and 23%.
相対的な改善 ((Dahlが得た誤差率における)) は、もう少し小さい大規模語彙スピーチ認識ベンチマークにおいて、16％から23％の間です。

●Representation-learning algorithms have also been applied to music, substantially beating the state-of-the-art in polyphonic transcription (Boulanger-Lewandowski et al., 2012), with relative error improvement between 5% and 30% on a standard benchmark of 4 datasets.
　表現学習アルゴリズムは、音楽にも適用され、実質的にポリフォニーの編曲の最新技術を打倒しました。

Deep learning also helped to win MIREX (Music Information Retrieval) competitions, e.g. in 2011 on audio tagging (Hamel et al., 2011).
深層学習は、にも貢献しました｜MIREX競技会での勝利、例えば、2011年のaudio tagging｜。

Object Recognition 　オブジェクト認識

●The beginnings of deep learning in 2006 have focused on the MNIST digit image classification problem (Hinton et al., 2006; Bengio et al., 2007), breaking the supremacy of SVMs (1.4% error) on this dataset.
　2006年の深層学習の開始は、MNIST数字イメージ分類問題に着目し、SVMの優位性を打倒しました。

The latest records are still held by deep networks: Ciresan et al. (2012) currently claims the title of state-of-the-art for the unconstrained version of the task (e.g., using a convolutional architecture), with 0.27% error, and Rifai et al. (2011c) is state-of-the-art for the knowledge free version of MNIST, with 0.81% error.
最新の記録も深層ネットワークが保持しています。

●In the last few years, deep learning has moved from digits to object recognition in natural images, and the latest breakthrough has been achieved on the ImageNet dataset4 bringing down the state-of-the-art error rate from 26.1% to 15.3% (Krizhevsky et al., 2012).
　この二三年、深層学習は数字から、自然イメードのオブジェクト認識に移行しました。

Natural Language Processing 　自然言語処理

●Besides speech recognition, there are many other Natural Language Processing (NLP) applications of representation learning.
　スピーチ認識のほかにも、あります｜他の多くの自然言語処理の応用が｜表現学習の｜。

Distributed representations for symbolic data were introduced by Hinton (1986), and first developed in the context of statistical language modeling by Bengio et al. (2003) in so-called neural net language models (Bengio, 2008).
記号データの分布表現は、導入されました｜Hinton (1986) によって。そして、最初に発展させらせました｜統計的言語モデリングのコンテキストにおいて｜Bengio et al.(2003) によって｜いわゆるニューラルネット言語モデルにおいて｜。

They are all based on learning a distributed representation for each word, called a word embedding.
それらはすべて、基づいています｜各々の言葉の分布表現の学習に｜言葉の埋め込みと呼ばれている｜。

Adding a convolutional architecture, Collobert et al. (2011) developed the SENNA system that shares representations across the tasks of language modeling, part-of-speech tagging, chunking, named entity recognition, semantic role labeling and syntactic parsing.
畳み込みアーキテクチャを加えて、Collobert et al.(2011) は、開発しました｜SENNA システムを｜それは共有します｜表現を｜言語モデリングや、part-of-speech taggingや、chunkingや、named entity認識や、sematic role labelingや、syntactic parsingなどのタスクを横断して｜。

SENNA approaches or surpasses the state-of-the-art on these tasks but is simpler and much faster than traditional predictors.
SENNAは、最先端に追いつき追い越します｜これらのタスクにおいて｜。しかし、伝統的な予測子よりもより簡単で高速です。

Learning word embeddings can be combined with learning image representations in a way that allow to associate text and images.
言葉の埋め込みの学習は、結びつけられることが可能です｜イメージ表現の学習に｜テキストとイメージを関係付けることを許す方法によって｜。

This approach has been used successfully to build Google's image search, exploiting huge quantities of data to map images and queries in the same space (Weston et al., 2010) and it has recently been extended to deeper multi-modal representations (Srivastava and Salakhutdinov, 2012).
この方法は、成功裏に用いられました｜グーグルのイメージ探索を構築するために｜大規模のデータを活用して｜イメードとクエリー(探索要求)を同じ空間にマップするために｜。そして、最近、より深層の多重モード表現に拡張されました。

●The neural net language model was also improved by adding recurrence to the hidden layers (Mikolov et al., 2011), allowing it to beat the state-of-the-art (smoothed n-gram models) not only in terms of perplexity (exponential of the average negative log-likelihood of predicting the right next word, going down from 140 to 102) but also in terms of word error rate in speech recognition (since the language model is an important component of a speech recognition system), decreasing it from 17.2% (KN5 baseline) or 16.9% (discriminative language model) to 14.4% on the Wall Street Journal benchmark task.
　ニューラルネット言語モデルは、改良されました｜付加することにより｜隠れ層への再帰を。それにより最先端(平滑化n-gramモデル)を打破しました｜

Similar models have been applied in statistical machine translation (Schwenk et al., 2012; Le et al., 2013), improving perplexity and BLEU scores.
同様なモデルは、適用されました｜統計的機械翻訳に｜。そして、perplexityと、BLEU得点を改良しました。

Recursive auto-encoders (which generalize recurrent networks) have also been used to beat the state-of-the-art in full sentence paraphrase detection (Socheretal.,2011a) almost doubling the F1 score for paraphrase detection.
再帰自己符号化器 (再帰ネットワークを生成します) は、使用されました｜最先端を打破するために｜full sentence paraphrase 検出において｜。殆どF1スコアを倍増しました。

Representation learning can also be used to perform word sense disambiguation (Bordes et al., 2012), bringing up the accuracy from 67.8% to 70.2% on the subset of Senseval-3 where the system could be applied (with subject-verb-object sentences).
表現学習は、使用できます｜word sense disambiguationを実施するためにも｜。

Finally, it has also been successfully used to surpass the state-of-the-art in sentiment analysis (Glorot et al., 2011b; Socher et al., 2011b).
ついに、それは成功裏に使用されました｜sentiment解析において最先端を凌駕するために｜。

Multi-Task and Transfer Learning, Domain Adaptation 多重タスクと転移学習、領域適合

●Transfer learning is the ability of a learning algorithm to exploit commonalities between different learning tasks in order to share statistical strength, and transfer knowledge across tasks.
　トランスファ学習は、アルゴリズムを学習する能力です｜共有性を利活用するための｜タスクの種々の学習のための｜統計的な強さを共有し、タスク間に知識をトランスファするために｜。

As discussed below, we hypothesize that representation learning algorithms have an advantage for such tasks because they learn representations that capture underlying factors, a subset of which may be relevant for each particular task, as illustrated in Figure 1.
以下に議論するように、私たちは仮定します｜表現学習アルゴリズムは、利点を有すと｜そのようなタスクにとって｜なぜならそれらは伏在する因子を捉える表現を学習するので｜。

This hypothesis seems confirmed by a number of empirical results showing the strengths of representation learning algorithms in transfer learning scenarios.
この仮定は、確定されるようにみえます｜多くの経験的な結果によって｜転移学習のシナリオにおける表現学習アルゴリズムの強さを示す｜。

●Most impressive are the two transfer learning challenges held in 2011 and won by representation learning algorithms.
　最も印象的なのは、2011年に行われた二つのトランスファ学習の戦いで、表現学習アルゴリズムが勝利しました。

First, the Transfer Learning Challenge, presented at an ICML 2011 workshop of the same name, was won using unsupervised layer-wise pre-training (Bengio, 2011; Mesnil et al., 2011).
最初は、ICML2011の同名のワークショップで提示されたトランスファ学習チャレンジで、教師なしの層ごとの事前訓練が勝ちました。

A second Transfer Learning Challenge was held the same year and won by Goodfellow et al. (2011).
二つ目のトランスファ学習チャレンジは、同年に開催され、Goodfellow et al.(2011)が勝利しました。

Results were presented at NIPS 2011’s Challenges in Learning Hierarchical Models Workshop.
その結果は、提示されました｜NIPS2011の階層的モデル学習の挑戦ワークショップにおいて｜。

In the related domain adaptation setup, the target remains the same but the input distribution changes (Glorot et al., 2011b; Chen et al., 2012).
関係する領域適合配置において、ターゲットは、変化しませんが、入力分布が変化します。

In the multi-task learning setup, representation learning has also been found advantageous Krizhevsky et al. (2012); Collobert et al. (2011), because of shared factors across tasks.
多重タスク学習配置においては、表現学習も有利であることがわかりました。

3 WHAT MAKES A REPRESENTATION GOOD? 　何が表現をよくするのか？

3.1 Priors for Representation Learning in AI 　人工知能における表現学習のためのプライア

In Bengio and LeCun (2007), one of us introduced the notion of AI-tasks, which are challenging for current machine learning algorithms, and involve complex but highly structured dependencies.
Bengio and LeCum (2007) において、我々の一人は導入しました｜AIタスクの概念を｜。それは、現在最新の機械学習アルゴリズムを求めて挑戦し、複雑ではあるが高度に構造化された依存関係を巻き込みます。

One reason why explicitly dealing with representations is interesting is because they can be convenient to express many general priors about the world around us, i.e., priors that are not task-specific but would be likely to be useful for a learning machine to solve AI-tasks.
何故明確に表現を取り扱うことが面白いのかの理由の一つは、それらが、我々の周りの世界についての沢山の一般プライアを表現するのに都合がよくなりえるからです。

Examples of such general-purpose priors are the following:
そのような一般目的のプライアの例を以下に示します。

Smoothness: assumes the function to be learned f is s.t. x ≒ y generally implies f(x) ≒ f(y).
滑らかさ：　学習されるべき関数は、x ≒ yであれば一般に、f(x) ≒ f(y) であると仮定する。

This most basic prior is present in most machine learning, but is insufficient to get around the curse of dimensionality, see Section 3.2.
この最も基本的なプライアは、殆どの機械学習に存在するが、次元性ののろいを迂回するには不十分です。

Multiple explanatory factors: the data generating distribution is generated by different underlying factors, and for the most part what one learns about one factor generalizes in many configurations of the other factors.
多重説明因子：データを生成する分布は、生成されます｜様々な潜在する因子によって｜。そして、一つの因子について学んだものは、他の因子の多くの配置構造においても一般的です。

The objective to recover or at least disentangle these underlying factors of variation is discussed in Section 3.5.
これらの伏在する変動因子を復元したり、少なくとももつれを解くという目標は、3.5節で議論します。

This assumption is behind the idea of distributed representations, discussed in Section 3.3 below.
この仮定は、分布表現のアイデアの背後にあります、以下の3.3節で議論します。

A hierarchical organization of explanatory factors: the concepts that are useful for describing the world around us can be defined in terms of other concepts, in a hierarchy, with more abstract concepts higher in the hierarchy, defined in terms of less abstract ones.
説明因子の階層組織：

This assumption is exploited with deep representations, elaborated in Section 3.4 below.

Semi-supervised learning: with inputs X and target Y to predict, a subset of the factors explaining X’s distribution explain much of Y , given X.
半教師付き学習：

Hence representations that are useful for P(X) tend to be useful when learning P(Y|X), allowing sharing of statistical strength between the unsupervised and supervised learning tasks, see Section 4.

Shared factors across tasks: with many Y'’s of interest or many learning tasks in general, tasks (e.g., the corresponding P(Y|X,task)) are explained by factors that are shared with other tasks, allowing sharing of statistical strengths across tasks, as discussed in the previous section (Multi-Task and Transfer Learning, Domain Adaptation).
タスクを横切る共有因子：

Manifolds: probability mass concentrates near regions that have a much smaller dimensionality than the original space where the data lives.
多様体：

This is explicitly exploited in some of the auto-encoder algorithms and other manifold-inspired algorithms described respectively in Sections 7.2 and 8.

Natural clustering: different values of categorical variables such as object classes are associated with separate manifolds.
自然のクラスタリング：

More precisely, the local variations on the manifold tend to preserve the value of a category, and a linear interpolation between examples of different classes in general involves going through a low density region, i.e., P(X|Y = i) for different i tend to be well separated and not overlap much.

For example, this is exploited in the Manifold Tangent Classifier discussed in Section 8.3.

This hypothesis is consistent with the idea that humans have named categories and classes because of such statistical structure (discovered by their brain and propagated by their culture), and machine learning tasks often involves predicting such categorical variables.

Temporal and spatial coherence: consecutive (from a sequence) or spatially nearby observations tend to be associated with the same value of relevant categorical concepts, or result in a small move on the surface of the high-density manifold.
時間的・空間的コヒーレンス：

More generally, different factors change at different temporal and spatial scales, and many categorical concepts of interest change slowly.

When attempting to capture such categorical variables, this prior can be enforced by making the associated representations slowly changing, i.e., penalizing changes in values over time or space. This prior was introduced in Becker and Hinton (1992) and is discussed in Section 11.3.

Sparsity: for any given observation x, only a small fraction of the possible factors are relevant.
希薄性：

In terms of representation, this could be represented by features that are often zero (as initially proposed by Olshausen and Field (1996)), or by the fact that most of the extracted features are insensitive to small variations of x.

This can be achieved with certain forms of priors on latent variables (peaked at 0), or by using a nonlinearity whose value is often flat at 0 (i.e., 0 and with a 0 derivative), or simply by penalizing the magnitude of the Jacobian matrix (of derivatives) of the function mapping input to representation.

This is discussed in Sections 6.1.1 and 7.2. 　　このことは議論します｜6.1.1節と7.2節で｜。

Simplicity of Factor Dependencies: in good high-level representations, the factors are related to each other through simple, typically linear dependencies.
因子依存状態の単純性：高レベルの良い表現において、因子は互いに関係しています｜単純で、典型的には線形な依存性を通して｜。

This can be seen in many laws of physics, and is assumed when plugging a linear predictor on top of a learned representation.
このことは、物理学の多くの法則に見ることができます。そして、仮定します｜学習された表現のトップに線形予測子をつなぐときに｜。

●We can view many of the above priors as ways to help the learner discover and disentangle some of the underlying (and a priori unknown) factors of variation that the data may reveal.
　私たちはみなすことができます｜以上のpriorsが、方法であると｜学習者が発見しもつれを解くための｜データが示す変動の伏在していて(アプリオリには知られていない)因子を｜。

This idea is pursued further in Sections 3.5 and 11.4.
このアイデアは、3.5節と11.4節でさらに追求します。

3.2 Smoothness and the Curse of Dimensionality 　なめらかさと、次元性ののろい

●For AI-tasks, such as vision and NLP, it seems hopeless to rely only on simple parametric models (such as linear models) because they cannot capture enough of the complexity of interest unless provided with the appropriate feature space.
　ビジョンやNLPのようなAIのタスクにとって、単純なパラメトリックモデル (線形モデルのような) にのみ依存することは、望みがないように思えます。なぜなら、それらは、興味対象の複雑性を十分に捕獲できないからです｜適切な特徴空間が与えられない限り｜。

Conversely, machine learning researchers have sought flexibility in local non-parametric learners such as kernel machines with a fixed generic local-response kernel (such as the Gaussian kernel).
逆に、機械学習の研究者たちは、

Unfortunately, as argued at length by Bengio and Monperrus (2005); Bengio et al. (2006a); Bengio and LeCun (2007); Bengio (2009); Bengio et al. (2010), most of these algorithms only exploit the principle of local generalization, i.e., the assumption that the target function (to be learned) is smooth enough, so they rely on examples to explicitly map out the wrinkles of the target function.
不幸なことに、これらのアルゴリズムの殆どは、開発するだけです｜局所一般化の原理を｜すなわち、目標関数は十分滑らかで、目標関数のしわを陽的にマップする例示に依存するという仮定を｜。

Generalization is mostly achieved by a form of local interpolation between neighboring training examples.
一般化は、大抵は達成されます｜局所的内挿の形によって｜隣接する練習例の間の｜。

Although smoothness can be a useful assumption, it is insufficient to deal with the curse of dimensionality, because the number of such wrinkles (ups and downs of the target function) may grow exponentially with the number of relevant interacting factors, when the data are represented in raw input space.
滑らかさは、有用な仮定でありえますが、それは、次元性の呪いに対処するには不十分です。なぜなら、そのようなしわの数 (目標関数の上昇や下降) は、指数関数的に成長するかもしれません｜適切な相互作用因子の数が増えるとともに｜データが生入力空間に表示されるときに｜。

We advocate learning algorithms that are flexible and non-parametric but do not rely exclusively on the smoothness assumption.
私たちは推奨します｜学習アルゴリズムを｜フレキシブルでノンパラメトリックであるけれど、滑らかさ仮定には排他的には依存しないところの｜。

Instead, we propose to incorporate generic priors such as those enumerated above into representation-learning algorithms.
それよむしろ、私たちは、提案します｜総称プライアを組み込むことを｜表現-学習アルゴリズムに数え上げられたような｜。

Smoothness based learners (such as kernel machines) and linear models can still be useful on top of such learned representations.
滑らかさをベースにした学習者たち (カーネル機械など) や線形モデルたちは、そのような学習された表現のトップになおも有用でありえます。

In fact, the combination of learning a representation and kernel machine is equivalent to learning the kernel, i.e., the feature space.
実際、表現を学習することと、カーネル機械の組み合わせは、等価です｜核、すなわち、特徴空間を学ぶことに｜。

Kernel machines are useful, but they depend on a prior definition of a suitable similarity metric, or a feature space in which naive similarity metrics suffice.
カーネル機械は有用です。しかし、それらは依存します｜適切な類似性メトリックの先見的定義に、もしくは、ナイーブな類似性メトリックで十分な特徴空間に｜。

We would like to use the data, along with very generic priors, to discover those features, or equivalently, a similarity function.
私たちは、これらのデータを使って、非常に総称プライアとともに、発見します｜これらの特徴や、等価的に、類似性関数を｜。

3.3 Distributed representations　　分布表現

●Good representations are expressive, meaning that a reasonably-sized learned representation can capture a huge number of possible input configurations.
　良い表現は、表情豊かです。それは、意味します｜合理的な大きさの学習された表現は、莫大な数の可能な入力配置を捕獲できるということを｜。

A simple counting argument helps us to assess the expressiveness of a model producing a representation: how many parameters does it require compared to the number of input regions (or configurations) it can distinguish?
簡単な counting argument は、我々が評価することを助けます｜ある表現を生成するモデルの表現性を：すなわち、何個のパラメタをそれは要求するのか｜それが区別できる入力領域 (または配置)の数と比較して｜。

Learners of one-hot representations, such as traditional clustering algorithms, Gaussian mixtures, nearest-neighbor algorithms, decision trees, or Gaussian SVMs all require O(N) parameters (and/or O(N) examples) to distinguish O(N) input regions.
One-hot 表現、例えば、伝統的クラスタリング･アルゴリズム、ガウシアン混合、最近傍アルゴリズム、決定木、ガウシアンSVM、などの学習者は、みな、必要とします｜O(N) パラメタ (and/or O(N)例) が O(N) 入力空間を区別することを｜。

One could naively believe that one cannot do better.
人は無邪気に信じます｜これよりもよくはできないことを｜。

However, RBMs, sparse coding, auto-encoders or multi-layer neural networks can all represent up to O(2k) input regions using only O(N) parameters (with k the number of non-zero elements in a sparse representation, and k = N in non-sparse RBMs and other dense representations).
しかし、RBM, 疎コーディング、自己符号化器、多重層ニューラルネットワークは、みな、O(N)まで入力領域を表します｜O(N)のパラメタを使いながら｜。

These are all distributed or sparse representations.
これらはみな分布表現か、疎表現です。

The generalization of clustering to distributed representations is multi-clustering, where either several clusterings take place in parallel or the same clustering is applied on different parts of the input, such as in the very popular hierarchical feature extraction for object recognition based on a histogram of cluster categories detected in different patches of an image (Lazebnik et al., 2006; Coates and Ng, 2011a).
クラスタリングの分布表現への一般化は、多重クラスタリングです。

The exponential gain from distributed or sparse representations is discussed further in section 3.2 (and Figure 3.2) of Bengio (2009).
分布表現や疎表現の指数ゲインは、さらに3.2節で議論します。

It comes about because each parameter (e.g. the parameters of one of the units in a sparse code, or one of the units in a Restricted Boltzmann Machine) can be re-used in many examples that are not simply near neighbors of each other, whereas with local generalization, different regions in input space are basically associated with their own private set of parameters, e.g., as in decision trees, nearest-neighbors, Gaussian SVMs, etc.
それは起こります。なぜなら、各パラメタは、多くの例において再利用できるからです。

In a distributed representation, an exponentially large number of possible subsets of features or hidden units can be activated in response to a given input.
分布表現において、特徴や隠れユニットの指数関数的に多数の可能サブセットは、活性化できます｜与えられた入力に応答して｜。

In a single-layer model, each feature is typically associated with a preferred input direction, corresponding to a hyperplane in input space, and the code or representation associated with that input is precisely the pattern of activation (which features respond to the input, and how much).
単一層モデルにおいて、各特徴は、典型的に関係つけられています｜入力空間におけるハイパー空間に対応して｜。そして、

This is in contrast with a non-distributed representation such as the one learned by most clustering algorithms, e.g., k-means, in which the representation of a given input vector is a one-hot code identifying which one of a small number of cluster centroids best represents the input 10.
これは、対照的です｜非分布表現 ((ほとんどのクラスタリング･アルゴリズムで学習されたものとか)) とは｜。

3.4 Depth and abstraction 　　深さと抽象作用

●Depth is a key aspect to representation learning strategies we consider in this paper.
　深さは、主要なアスペクト (様相) です｜表現学習戦略の｜この論文で考察する｜。

As we will discuss, deep architectures are often challenging to train effectively and this has been the subject of much recent research and progress.
これから議論するように、深層アーキテクチャは、しばしば挑戦的です｜効率的に訓練することに。そしてこれは主題でした｜多くの最近の研究やその成果の｜

However, despite these challenges, they carry two significant advantages that motivate our long-term interest in discovering successful training strategies for deep architectures.
しかし、これらの挑戦にもかかわらず、これらは持っています｜二つの顕著な利点を｜私たちの長期的な興味を動機つけてくれる｜成功する訓練方策をみつけることに｜深層の建築様式にとって｜。

These advantages are: (1) deep architectures promote the re-use of features, and (2) deep architectures can potentially lead to progressively more abstract features at higher layers of representations (more removed from the data).
これらの利点とは：(1) 深層の建築様式は特徴の再利用を促進すること、そして (2) 深層の建築様式は潜在的に、導くことができること｜漸進的により抽象的な特徴に｜表現のより高位の層において (データからより離れて)｜。

Feature re-use. 　特徴の再利用

●The notion of re-use, which explains the power of distributed representations, is also at the heart of the theoretical advantages behind deep learning, i.e., constructing multiple levels of representation or learning a hierarchy of features.
　再利用の概念｜分布表現の能力を説明してくれる｜、は、あります｜理論的優位点の心髄に｜の背後の｜深層学習、すなわち、多重レベルの表現を構築したり特徴の階層を学習することの｜。

The depth of a circuit is the length of the longest path from an input node of the circuit to an output node of the circuit.
回路の深度は、最長パスの長さです｜回路の入力ノードから回路の出力ノードまでの｜。

The crucial property of a deep circuit is that its number of paths, i.e., ways to re-use different parts, can grow exponentially with its depth.
深層回路の決定的な特性は、パスの数、すなわち、様々なパーツを再利用する仕方、がその深度とともに指数関数的に増加することです。

Formally, one can change the depth of a given circuit by changing the definition of what each node can compute, but only by a constant factor.
形式的には、人は変えることができます｜与えられた回路の深度を｜変更することにより｜各ノードが計算できるものの定義を、一定の係数倍だけですが｜。

The typical computations we allow in each node include: weighted sum, product, artificial neuron model (such as a monotone nonlinearity on top of an affine transformation), computation of a kernel, or logic gates.
典型的な計算｜各モードにおいて私たちが許す｜、は、含みます｜重み付き総和、積、人工ニューロンモデル (アフィン変換の頂点での単調非線形のような)、カーネルの計算、または、論理ゲートを｜。

Theoretical results clearly show families of functions where a deep representation can be exponentially more efficient than one that is insufficiently deep (Hastad, 1986; Hastad and Goldmann, 1991; Bengio et al., 2006a; Bengio and LeCun, 2007; Bengio and Delalleau, 2011).
理論的結論は、明確に示します｜関数族を｜深層表現が指数関数的により効果的になりえる｜不十分に深層な関数よりも｜。

If the same family of functions can be represented with fewer parameters (or more precisely with a smaller VC-dimension), learning theory would suggest that it can be learned with fewer examples, yielding improvements in both computational efficiency (less nodes to visit) and statistical efficiency (less parameters to learn, and re-use of these parameters over many different kinds of inputs).
もし同じ関数族が、表現されうるのであれば｜より少ないパラメタで (またはより小さいVC-次元でより正確に) ｜、学習理論は、示唆するでしょう｜その関数族は学習されうることを｜より少ない例で｜改良をもたらしながら｜計算効率 (訪問するノードの減少) と統計的効率 (学習するパラメタ数の減少と、たくさんの異なる種類の入力にわたってこれらのパラメタを再利用すること) の両方において｜。

Abstraction and invariance. 　抽象作用と不変

●Deep architectures can lead to abstract representations because more abstract concepts can often be constructed in terms of less abstract ones.
　深層アーキテクチャは、導くことができます｜抽象表現に｜より多く抽象的な概念は、より少なく抽象的な概念をもって構築することができるので｜。

In some cases, such as in the convolutional neural network (LeCun et al., 1998b), we build this abstraction in explicitly via a pooling mechanism (see section 11.2).
いくつかの場合、畳み込みニューラルネットワークにおけるような場合には、私たちは組み込みます｜この抽象作用を｜プーリングメカニズムを通して明示的に｜。

More abstract concepts are generally invariant to most local changes of the input.
より抽象的な概念は、一般的に、入力のほとんどの局所変化に対して不変です。

That makes the representations that capture these concepts generally highly non-linear functions of the raw input.
それは、します｜これらの概念を捕獲する表現を、一般的に高度に非線形な生入力の関数に｜。

This is obviously true of categorical concepts, where more abstract representations detect categories that cover more varied phenomena (e.g. larger manifolds with more wrinkles) and thus they potentially have greater predictive power.
これは、あきらかに範疇の概念には当てはまります。そこでは、より抽象的な表現が、検出し｜範疇を｜より変動ある現象 (例えば、より多くのシワを持つより大きな多様体) をカバーする｜、従って、潜在的により大きな予言能力をもちます。

Abstraction can also appear in high-level continuous-valued attributes that are only sensitive to some very specific types of changes in the input.
抽象作用は、また現れ得ます｜高レベルの連続値のアトリビュート(特性)に｜ある非常に特別な変化型にのみ敏感な｜入力において｜。

Learning these sorts of invariant features has been a long-standing goal in pattern recognition.
これらの種類の不変特徴を学習することは、長年にわたる目標でした｜パターン認識における｜。

3.5 Disentangling Factors of Variation 　　変動の因子のもつれ解き

●Beyond being distributed and invariant, we would like our representations to disentangle the factors of variation.
　分布し、不変であることを超えて、私たちは、望みます｜私たちの表現が変動因子のもつれを解くことを｜。

Different explanatory factors of the data tend to change independently of each other in the input distribution, and only a few at a time tend to change when one considers a sequence of consecutive real-world inputs.
種々の説明因子｜データのの｜、は、変動する傾向にあります｜互いに独立に｜入力方向に｜、そして一度に二三回だけ、変動する傾向にあります｜人が一連の連続する実世界入力を検討するときに｜。

●Complex data arise from the rich interaction of many sources.
　複雑なデータは、起因します｜多くの源の盛んな相互作用から｜。

These factors interact in a complex web that can complicate AI-related tasks such as object classification.
これらの因子は相互作用します｜複雑なウェブ(網)において｜オブジェクト分類のようなＡＩ関係のタスクを複雑化する｜。

For example, an image is composed of the interaction between one or more light sources, the object shapes and the material properties of the various surfaces present in the image.
例えば、一つのイメージは、成ります｜一つまたは多数の光源の相互作用や、そのイメージにある様々な表面の形や材質から｜。

Shadows from objects in the scene can fall on each other in complex patterns, creating the illusion of object boundaries where there are none and dramatically effect the perceived object shape.
オブジェクトからの影｜そのシーンにおける｜、は、落ちます｜お互いのうえに、複雑なパターンで｜錯覚をつくりながら｜オブジェクト境界の｜境界など無いところに｜、そして、劇的に、もたらします｛知覚されるオブジェクトの形を｜。

How can we cope with these complex interactions?
どうして対処できるのでしょう｜こんな複雑な相互作用に｜？

How can we disentangle the objects and their shadows?
どうしてもつれを解くことができるのでしょう｜オブジェクトとその影の｜？

Ultimately, we believe the approach we adopt for overcoming these challenges must leverage the data itself, using vast quantities of unlabeled examples, to learn representations that separate the various explanatory sources.
結局、私たちは、信じます｜これらの挑戦を克服するために私たちが採用する方法は、データ自身を利活用しなければならないことを｜大量の名前のついていない例を使いながら｜様々な説明的源を分離する表現を学習するために｜。

Doing so should give rise to a representation significantly more robust to the complex and richly structured variations extant in natural data sources for AI-related tasks.
そうすることは、うみだすはずです｜著しくより頑強な表現を｜複雑で豊かに構造された変動に対して｜自然のデータ源に現存している｜ＡＩ関係のタスクのために｜。

●It is important to distinguish between the related but distinct goals of learning invariant features and learning to disentangle explanatory factors.
　重要です｜区別することは｜不変の特徴の学習の関係はあるがはっきり異なる目標と、説明的因子のもつれを解くことを学習することを｜。

The central difference is the preservation of information.
中心的違いは、情報の保存です。

Invariant features, by definition, have reduced sensitivity in the direction of invariance.
不変の特徴は、定義により、感度を減少させました｜不変の方向の｜。

This is the goal of building features that are insensitive to variation in the data that are uninformative to the task at hand.
これは、特徴を構築することの目標です｜データの変動に感度の無い｜てもとのタスクには役に立たない｜。

Unfortunately, it is often difficult to determine a priori which set of features and variations will ultimately be relevant to the task at hand.
不幸なことに、しばしば困難です｜先見的に決定することは｜特徴や変動のどのセットが、究極的な、てもとのタスクに適しているかを｜。

Further, as is often the case in the context of deep learning methods, the feature set being trained may be destined to be used in multiple tasks that may have distinct subsets of relevant features.
さらに、深層学習の方法の文脈においてよくあることですが、訓練される特徴のセットは、運命づけられています｜多重のタスクで用いられるように｜

Considerations such as these lead us to the conclusion that the most robust approach to feature learning is to disentangle as many factors as possible, discarding as little information about the data as is practical.
これらのような考察は、私たちを導きます｜結論に｜特徴学習の最も頑強な方法は、なるべく沢山の因子のもつれを解くことであるという｜実際的になるべく少しの情報しか切り捨てないで｜。

If some form of dimensionality reduction is desirable, then we hypothesize that the local directions of variation least represented in the training data should be first to be pruned out (as in PCA, for example, which does it globally instead of around each example).
もしなんらかの次元縮小が必要でしたら、そのとき私たちは仮定します｜訓練データに殆ど表現されていない変動の局所方向は、最初に刈り取られるべきであると (例えば、各例の周りではなくグローバルにそれを行うPCA(主成分解析)におけけるように)｜

3.6 Good criteria for learning representations? 　　表現学習のための良い基準は？

●One of the challenges of representation learning that distinguishes it from other machine learning tasks such as classification is the difficulty in establishing a clear objective, or target for training.
　表現学習の挑戦の一つ｜それを分類のような他の機械学習タスクから区別する｜、は、です｜明白な目的、すなわち訓練の目標を確率することの困難さ｜。

In the case of classification, the objective is (at least conceptually) obvious, we want to minimize the number of misclassifications on the training dataset.
分類の場合、目的は (少なくとも観念てきには) 明白です。私たちは、最小化することを欲します｜分類失敗の数を｜データセットを訓練するときの｜。

In the case of representation learning, our objective is far-removed from the ultimate objective, which is typically learning a classifier or some other predictor.
表現学習の場合、私たちの目標は、遠く離れます｜究極目的から｜典型的には、分類子や予測子を学習するという｜。

Our problem is reminiscent of the credit assignment problem encountered in reinforcement learning.
私たちの問題は、思い起こさせます｜クレジット･アサインメント問題を｜強化学習における｜。

We have proposed that a good representation is one that disentangles the underlying factors of variation, but how do we translate that into appropriate training criteria?
私たちは、提案しました｜良い表現は、表現です｜潜在する変動因子のもつれを解く｜。しかしどうして私たちは、それを適切な訓練クライテリアに変換するのでしょうか？

Is it even necessary to do anything but maximize likelihood under a good model or can we introduce priors such as those enumerated above (possibly data-dependent ones) that help the representation better do this disentangling?
必要ですらあるのでしょうか｜尤度を決して最大化しないことは｜良いモデルのもとで？または、私たちは、導入するのでしょうか｜先に列挙したようなプライア (多分データ依存するもの) を｜表現がこのもつれ解きをよりより行うように手伝ってくれる｜。

This question remains clearly open but is discussed in more detail in Sections 3.5 and 11.4.
この問題は、明白に未決のまま残ります。しかし、3.5節と11.4節でより詳しく議論します。

4 BUILDING DEEP REPRESENTATIONS 　　深層表現の構築

●In 2006, a breakthrough in feature learning and deep learning was initiated by Geoff Hinton and quickly followed up in the same year (Hinton et al., 2006; Bengio et al., 2007; Ranzato et al., 2007), and soon after by Lee et al. (2008) and many more later.
　2006年に、特徴学習や深層学習におけるブレークスルーが、開始されました｜ジェフ・ヒントンによって｜。

It has been extensively reviewed and discussed in Bengio (2009).
それは、Bengio (2009) で広範囲にレビューされ、議論されました。

A central idea, referred to as greedy layerwise unsupervised pre-training, was to learn a hierarchy of features one level at a time, using unsupervised feature learning to learn a new transformation at each level to be composed with the previously learned transformations; essentially, each iteration of unsupervised feature learning adds one layer of weights to a deep neural network.
中心的アイデア｜貪欲な層毎の教師なし事前学習と名づけられた｜、は、学習することでした｜特徴の階層を｜一度に一レベルずつ｜教師なしの特徴学習を使いながら｜各レベルで新しい変形を学習するために｜以前に学習された変形を用いて構成するために｜。本質的に、教師なし特徴学習の各イタレーションは、一層の重みを深層ニューラルネックワークに付加します。

Finally, the set of layers could be combined to initialize a deep supervised predictor, such as a neural network classifier, or a deep generative model, such as a Deep Boltzmann Machine (Salakhutdinov and Hinton, 2009).
最後に、層のセットは、組み合わされて初期化します｜深層の教師つき予測子(ニューラルネットワークの分類器のような)を、または、深層の生成的モデル(深層のボルツマンマシンのような)を｜。

●This paper is mostly about feature learning algorithms that can be used to form deep architectures.
　この論文は、大部分、です｜特徴学習アルゴリズムについて｜深層アーキテクチャを形成するために使うことができる｜。

In particular, it was empirically observed that layerwise stacking of feature extraction often yielded better representations, e.g., in terms of classification error (Larochelle et al., 2009; Erhan et al., 2010b), quality of the samples generated by a probabilistic model (Salakhutdinov and Hinton, 2009) or in terms of the invariance properties of the learned features (Goodfellow et al., 2009).
特に、経験的に、観察されます｜特徴抽出の層ごとのスタッキングは、しばしばより良い表現を生成することが｜

Whereas this section focuses on the idea of stacking single-layer models, Section 10 follows up with a discussion on joint training of all the layers.
この節は、単一層モデルのスタッキングというアイデアに焦点をあてますが、10節はフォローアップで、全ての層の連合訓練の議論します。

●After greedy layerwise unsuperivsed pre-training, the resulting deep features can be used either as input to a standard supervised machine learning predictor (such as an SVM) or as initialization for a deep supervised neural network (e.g., by appending a logistic regression layer or purely supervised layers of a multi-layer neural network).
　貪欲な層毎の教師なし事前学習の後、得られた深層特徴は、使うことができます｜どちらかに｜標準の教師つき機械学習予測子(例えばSVM)としてか、もしくは、教師つき深層ニューラルネットの初期化としてか｜。

The layerwise procedure can also be applied in a purely supervised setting, called the greedy layerwise supervised pre-training (Bengio et al., 2007).
層ごとの手続きは、適用できます｜純粋に教師つきの場面でも｜貪欲層ごと教師つき事前学習とよばれる｜。

For example, after the first one-hidden-layer MLP is trained, its output layer is discarded and another one-hidden-layer MLP can be stacked on top of it, etc.
例えば、最初の１隠れ層MLPを訓練したあと、その出力層を切り取り、別の１隠れ層MLPをその上に積み重ねる、などなど。

Although results reported in Bengio et al. (2007) were not as good as for unsupervised pre-training, they were nonetheless better than without pretraining at all.
報告された結果は、教師なし事前学習としては余りよくないけれども、事前学習がない場合よりはよりよくなっています。

Alternatively, the outputs of the previous layer can be fed as extra inputs for the next layer (in addition to the raw input), as successfully done in Yu et al. (2010).
別の方法として、前の層の出力は、与えることができます｜次の層への追加入力として (生入力への追加として) ｜Yuが成功裏に実施したように｜。

Another variant (Seide et al., 2011b) pre-trains in a supervised way all the previously added layers at each step of the iteration, and in their experiments this discriminant variant yielded better results than unsupervised pre-training.
もう一つの変種は、事前学習します｜教師つきやり方で、すべての以前に追加した層をイタレーションの書くステップにおいて｜、そして、彼らの実験では、この判別変種は、教師なし事前学習よりもより良い結果を与えました。

●Whereas combining single layers into a supervised model is straightforward, it is less clear how layers pre-trained by unsupervised learning should be combined to form a better unsupervised model.
　単一層を教師つきモデルに結合させることは、単純ですが、教師なし学習で事前学習した層をいかにして結合して教師なしのより良いモデルになるのかは、そんなに明白ではありません。

We cover here some of the approaches to do so, but no clear winner emerges and much work has to be done to validate existing proposals or improve them.
そうするためにいくつかの方法をためしてみますが、明白な勝者はでてきません。存在している提案を実証したり、改良したりするには、多くの研究が必要です。

●The first proposal was to stack pre-trained RBMs into a Deep Belief Network (Hinton et al., 2006) or DBN, where the top layer is interpreted as an RBM and the lower layers as a directed sigmoid belief network.
　最初の提案は、事前学習したRBMを積み重ねてDBNにすることでした。DBNでは、一番上の層はRBMと解釈され、その下の層は有向のシグモイド信念ネットワークと解釈されます。

However, it is not clear how to approximate maximum likelihood training to further optimize this generative model.
しかし、明確ではありません｜いかにして最大尤度訓練を近似してこの生成モデルをさらに最適化するかが｜。

One option is the wake-sleep algorithm (Hinton et al., 2006) but more work should be done to assess the efficiency of this procedure in terms of improving the generative model.
一つの選択は、起きる-眠るアルゴリズムですが、さらなる研究が必要です｜この手続きの有効性を評価するためには｜生成モデルを改良することに関して｜。

●The second approach that has been put forward is to combine the RBM parameters into a Deep Boltzmann Machine (DBM), by basically halving the RBM weights to obtain the DBM weights (Salakhutdinov and Hinton, 2009).
　第二の方法｜提出された｜、は、RBMパラメタをDBMに結びつけることです｜基本的にRBM重みを半分にすることによって｜DBM重みを得るために｜。

The DBM can then be trained by approximate maximum likelihood as discussed in more details later (Section 10.2).
DBMは、学習できます｜近似最大尤度によって｜のちほどより詳しく議論するように｜。

This joint training has brought substantial improvements, both in terms of likelihood and in terms of classification performance of the resulting deep feature learner (Salakhutdinov and Hinton, 2009).
この共同学習は、かなりの改良をもたらしました｜尤度に関して、および、結果として生じる特徴学習者の分類性能に関して｜。

●Another early approach was to stack RBMs or autoencoders into a deep auto-encoder (Hinton and Salakhutdinov, 2006).
　初期の別のアプローチは、でした｜RBMまたは自己符号化器を積み重ねて深層の自己符号化器にすること｜。

If we have a series of encoder-decoder pairs (f⁽ⁱ⁾(・),g⁽ⁱ⁾(・)), then the overall encoder is the composition of the encoders, f^(N)(...f⁽²⁾(f⁽¹⁾(・))), and the overall decoder is its “transpose” (often with transposed weight matrices as well), g(1)(g(2)(...f(N)(・))).
もし一連の自己符号化器対があれば、全体の符号化器は、符号化器 f^(N)(...f⁽²⁾(f⁽¹⁾(・))) の合成であり、全体のデコーダー(復号器)は、そのトランスポーズ(転置)です。

The deep auto-encoder (or its regularized version, as discussed in Section 7.2) can then be jointly trained, with all the parameters optimized with respect to a global reconstruction error criterion.
深層の自己符号化器 (もしくはその正規化版、7.2節で説明) は、共同で学習され得ます｜グローバル再構築誤差基準に関して最適化されたすべてのパラメタとともに｜。

More work on this avenue clearly needs to be done, and it was probably avoided by fear of the challenges in training deep feedforward networks, discussed in the Section 10 along with very encouraging recent results.
この道に沿ってもっと研究がなされるべきです、そして、それは避けられてきました｜深層のフィードフォワードのネットワークを学習するという挑戦の恐れによって｜。

●Yet another recently proposed approach to training deep architectures (Ngiam et al., 2011) is to consider the iterative construction of a free energy function (i.e., with no explicit latent variables, except possibly for a top-level layer of hidden units) for a deep architecture as the composition of transformations associated with lower layers, followed by top-level hidden units.
　それでも、深層アーキテクチャを学習する最近提案されたもう一つの方法は、自由エネルギー関数の反復組み立てを考えることです。

The question is then how to train a model defined by an arbitrary parametrized (free) energy function.
問題は、どうやって学習させるかということです｜任意にパラメタ化された(自由)エネルギー関数によって定義されたモデルによって｜。

Ngiam et al. (2011) have used Hybrid Monte Carlo (Neal, 1993), but other options include contrastive divergence (Hinton, 1999; Hinton et al., 2006), score matching (Hyv¨arinen, 2005; Hyv¨arinen, 2008), denoising score matching (Kingma and LeCun, 2010; Vincent, 2011), ratio-matching (Hyv¨arinen, 2007) and noisecontrastive estimation (Gutmann and Hyvarinen, 2010).
Ngiamらは、ハイブリッド･モンテカルロを使いましたが、その他のオプションは、対照発散、スコア･マツチング、ノイズ除去スコアマッチング、レイショマッチング、ノイズ対照計算などを含みます。

5 SINGLE-LAYER LEARNING MODULES　　単一層学習モジュール

●Within the community of researchers interested in representation learning, there has developed two broad parallel lines of inquiry: one rooted in probabilistic graphical models and one rooted in neural networks.
　表現学習に興味をもつ研究者コミュニティのなかでは、二本の幅広い平行線の探求がおこりました：ひとつは、確率グラフィカルモデルに根付いていて、もう一つはニューラルネットワークに根付いています。

Fundamentally, the difference between these two paradigms is whether the layered architecture of a deep learning model is to be interpreted as describing a probabilistic graphical model or as describing a computation graph.
根本的に、これら二つのパラダイムの間の違いは、深層学習モデルの層状アーキテクチャが、確率グラフィカルモデルを記述するとして、もしくは、計算グラフを記述するとして解釈できるかどうかということです。

In short, are hidden units considered latent random variables or as computational nodes?
簡単に言うと、隠れたユニットは、潜在ランダム変数か、計算ノードとみなされるのでしょうか。

●To date, the dichotomy between these two paradigms has remained in the background, perhaps because they appear to have more characteristics in common than separating them.
　今日まで、これら二つのパラダイムの対立は、背景に生き残ってきました｜多分、それらがもつ特徴は、両者を分けるといよりは、両者に共通のようにみうるからでしょう。

We suggest that this is likely a function of the fact that much recent progress in both of these areas has focused on single-layer greedy learning modules and the similarities between the types of single-layer models that have been explored: mainly, the restricted Boltzmann machine (RBM) on the probabilistic side, and the auto-encoder variants on the neural network side.
私たちは提案します｜これは多分事実の関数であろうと｜両方のエリアにおける最近の研究が単一層貪欲学習モジュールと、単一層モデルのタイプの間の近似性に焦点をあてているという｜。主として、確率側は限定ボルツマンマシンに、ニューラルネット側は、自己符号化器バリアントに。

Indeed, as shown by one of us (Vincent, 2011) and others (Swersky et al., 2011), in the case of the restricted Boltzmann machine, training the model via an inductive principle known as score matching (Hyvarinen, 2005) (to be discussed in sec. 6.4.3) is essentially identical to applying a regularized reconstruction objective to an autoencoder.
実際、限定ボルツマンマシンの場合、スコア･マッチングとして知られている誘導(帰納)原理を通してモデルを訓練することは、自己符号化器を目標とする正規化最構築を適用することに本質的に同一なのです。

Another strong link between pairs of models on both sides of this divide is when the computational graph for computing representation in the neural network model corresponds exactly to the computational graph that corresponds to inference in the probabilistic model, and this happens to also correspond to the structure of graphical model itself (e.g., as in the RBM).
もう一つの強いリンク｜二つのモデルの間の｜この分離の両側の｜、は、です、とき｜（｜計算グラフ｜表現を計算するための｜ニューラルネットワークモデルの｜、が、厳密に対応する｜計算グラフに｜確率モデルにおける推論に対応する｜）、そして、これは、対応することもあります｜グラフィカルモデル自身の構造に｜RBMにおけるように｜。

●The connection between these two paradigms becomes more tenuous when we consider deeper models where, in the case of a probabilistic model, exact inference typically becomes intractable.
　これら二つのパラダイムの間の結合は、さらに希薄になります｜より深層のモデルを考える時に｜確率モデルの場合で、厳密な推論ができなくなるような｜。

In the case of deep models, the computational graph diverges from the structure of the model.
深層モデルの場合、計算グラフは、モデルの構造から発散します。

For example, in the case of a deep Boltzmann machine, unrolling variational (approximate) inference into a computational graph results in a recurrent graph structure.
例えば、深層ボルツマンマシンの場合、変分(近似)推論を計算グラフに開くことは、反復グラフ構造になります。

We have performed preliminary exploration (Savard, 2011) of deterministic variants of deep auto-encoders whose computational graph is similar to that of a deep Boltzmann machine (in fact very close to the mean-field variational approximations associated with the Boltzmann machine), and that is one interesting intermediate point to explore (between the deterministic approaches and the graphical model approaches).
私たちは、実施しました｜予備的な探求を｜深層自己符号化器の決定論的バリアント(変量)の｜計算グラフが深層ボルツマンマシンのそれに似ている｜。そしてそれは、です｜ある興味ある中間点｜(決定論的アプローチとグラフィカルモデル・アプローチの間の)探求の｜。

●In the next few sections we will review the major developments in single-layer training modules used to support feature learning and particularly deep learning.
　以後の節において、私たちは、レビューします｜単一層学習モデルの主要な発展を｜特徴学習や深層学習をサポートするために｜。

We divide these sections between (Section 6) the probabilistic models, with inference and training schemes that directly parametrize the generative - or decoding - pathway and (Section 7) the typically neural network-based models that directly parametrize the encoding pathway.
私たちはこれらの節を分割します｜6節の確率モデルと、7節のニューラルネットに基づいたモデルに｜。

Interestingly, some models, like Predictive Sparse Decomposition (PSD) (Kavukcuoglu et al., 2008) inherit both properties, and will also be discussed (Section 7.2.4).
面白いことに、いくつかのモデルは、両方の特徴を引き継いでいます。

We then present a different view of representation learning, based on the associated geometry and the manifold assumption, in Section 8.
私たちは、そこで、提示します｜表現学習の異なる見解を｜。

●First, let us consider an unsupervised single-layer representation learning algorithm spaning all three views: probabilistic, auto-encoder, and manifold learning.
　最初に、考察しましょう｜教師なしの単一層表現学習アルゴリズムを｜これら三つの見方をまたぐ｜確率、自己符号化器、多様体学習という｜。

Principal Components Analysis 　主成分解析

●We will use probably the oldest feature extraction algorithm, principal components analysis (PCA), to illustrate the probabilistic, auto-encoder and manifold views of representation learning.
私たちは、用います｜多分、最も古い特徴抽出アルゴリズムである主成分解析(PCA)を｜説明するために｜表現学習の確率的、自己符号化器的、多様体的見方を｜。

PCA learns a linear transformation h = f(x) = W^Tx + b of input x ∈ R^dx, where the columns of dx x dh matrix W form an orthogonal basis for the dh orthogonal directions of greatest variance in the training data.
PCAは、学習します｜線形変換 h = f(x) = W^Tx + b を｜。

The result is dh features (the components of representation h) that are decorrelated.
結果は、dh 特徴です。

The three interpretations of PCA are the following: PCAの三つの解釈は以下のとおりです。

a) it is related to probabilistic models (Section 6) such as probabilistic PCA, factor analysis and the traditional multivariate Gaussian distribution (the leading eigenvectors of the covariance matrix are the principal components);
a) それは、関係しています｜確率モデルに｜確率PCAや、因子解析や、伝統的な多変数ガウス分布(共分散行列の主要固有ベクトルは主成分です)｜。

b) the representation it learns is essentially the same as that learned by a basic linear auto-encoder (Section 7.2); and
b) それが学習する表現は、基本的な線形自己符号化器が学習するものと本質的に同じものです。

c) it can be viewed as a simple linear form of linear manifold learning (Section 8), i.e., characterizing a lower-dimensional region in input space near which the data density is peaked.
c) それは、みなすことができます｜単純線形形であると｜線形多様体学習の｜。

Thus, PCA may be in the back of the reader’s mind as a common thread relating these various viewpoints.
かくして、PCAは、読者の心の後にいるかもしれません｜これら種々の観点に関係する共通の糸として｜。

Unfortunately the expressive power of linear features is very limited: they cannot be stacked to form deeper, more abstract representations since the composition of linear operations yields another linear operation.
不幸にも、線形特徴の表現力は非常に限られています：それらは積み重ねることができません｜より深層で、より抽象的な表現を構成するために｜線形作用を結合しても、別の線形作用しか生まれないので｜。

Here, we focus on recent algorithms that have been developed to extract non-linear features, which can be stacked in the construction of deep networks, although some authors simply insert a non-linearity between learned single-layer linear projections (Le et al., 2011c; Chen et al., 2012).
ここに、私たちは、焦点をあてます｜最近のアルゴリズムに｜非線形特徴を抽出するために開発された｜深層ネットワークを構築するために積み重ねることもできる｜著者のいくにんかは、学習された単一層線形投影の間に非線形性を単純に挿入するだけですが｜。

●Another rich family of feature extraction techniques that this review does not cover in any detail due to space constraints is Independent Component Analysis or ICA (Jutten and Herault, 1991; Bell and Sejnowski, 1997).
　もう一つの特徴抽出の豊かな族｜このレビューが、スペース制限のため詳しくはカバーしない｜、は、独立成分分析です。

Instead, we refer the reader to Hyvarinen et al. (2001a); Hyv¨arinen et al. (2009).
そのかわり、以下の文献を参照してください。

Note that, while in the simplest case (complete, noise-free) ICA yields linear features, in the more general case it can be equated with a linear generative model with non-Gaussian independent latent variables, similar to sparse coding (section 6.1.1), which result in non-linear features.
注意してください。最も単純な場合、ICAは、線形特徴を産みますが、より一般的な場合、それは、疎コーディングに似た、非ガウシアン独立潜在変数の線形生成モデルに等しくなるので、非線形な特徴をうみます。

Therefore, ICA and its variants like Independent and Topographic ICA (Hyv¨arinen et al., 2001b) can and have been used to build deep networks (Le et al., 2010, 2011c): see section 11.2.
ですから、ICAや、その変形 (独立およびトポグラフィックなICAのような) は、使用でき、使用されてきました｜深層ネットワーク構築するために｜。

The notion of obtaining independent components also appears similar to our stated goal of disentangling underlying explanatory factors through deep networks.
独立成分を得ようとする考えは、似ているようにみえます｜潜在する説明因子のもつれを解くという私たちの明言した目標に｜深層ネットワークをとおして｜。

However, for complex real-world distributions, it is doubtful that the relationship between truly independent underlying factors and the observed high-dimensional data can be adequately characterized by a linear transformation.
しかし、複雑な実世界分布にとって、疑わしいです｜真に独立な潜在因子と観測される高次元データの間の関係が線形変換によって適切に特徴づけることができるということは｜。

6. Probabilistic Models　　確率モデル

6.1 Directed Graphical Models

6.2 Undirected Graphical Models

6.3 Generalizations of the RBM to Real-valued data

6.4 RBM parameter estimation

7. Directly Learning A Parametric Map from Input to Representation　　入力から表現へパラメトリックマップを直接的に学習する

7.1 Auto-Encoders

7.2 Regulated Auto-Encoders

8. Representation Learning as Manifold Learning　　多様体学習としての表現学習

8.1 Learning a parametric mapping based on a neighborhood graph

8.2 Learning to represent non-linear manifolds

8.3 Leveraging the modeled tangent spaces

9. Connections between Probabilistic and Direct Encoding models　　確率モデルと直接符号化モデルとの結びつき

9.1 PSD: a probabilistic interpretation

9.2 Regularized Auto-Encoders Capture Local Structure of the Density

9.3 Learning Approximate Inference

9.4 Sampling Challenges

9.5 Evaluating and Monitoring Performance

10. Global Training of Deep Models　　深層モデルの全地球訓練

10.1 The Challenge of Training Deep Architectures

10.2 Joint Training of Deep Boltzmann Machines

11. Building-In Invariance　　不変の組み込み

11.1 Generating transformed examples

11.2 Convolution and pooling

11.3 Temporal coherence and slow features

11.4 Algorithms to Disentangle Factors of Variation

12. Conclusion　　結論

以下省略

ホームページアドレス: http://www.geocities.jp/think_leisurely/

自分のホームページを作成しようと思っていますか？

Yahoo!ジオシティーズに参加