作者:Thomas Dietterich, Editor Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, Associate Editors 页数:579 出版社:The MIT Press |
建立能够适应环境并从经验中学习的系统的目标吸引了来自许多领域的研究人员,包括计算机科学、工程、数学、物理学、神经科学和认知科学。从这项研究中产生了各种各样的学习技术,正在改变许多工业和科学领域。最近,一些研究团体已经聚集在一组常见的问题周围监督,半监督,无监督,强化学习问题。麻省理工学院出版的关于自适应计算和机器学习的系列文章试图统一机器学习研究的许多不同的分支,并培养高质量的研究和创新的应用。麻省理工学院出版社非常高兴地出版了第二版的《艾瑟姆·阿尔帕伊德》ı约翰的入门教材。这本书提出了一个可读性和简明的介绍机器学习,反映了这些不同的研究线索,同时提供了一个统一的处理领域。这本书涵盖了所有的主要问题公式,并介绍了最重要的算法和技术,包括从计算机科学,神经计算,信息理论和统计方法。第二版扩展并更新了几个领域的覆盖范围,特别是在过去五年中快速发展的内核机器和图形模型。这项更新的工作继续是一个令人信服的教科书,介绍课程在机器学习在本科和研究生开始的水平。
目录
系列前言十七
图十九
表二十二
前言三十一
确认书xxxiii
第二版注释xxxv
公证书xxxix
1引言1
1.1机器学习是什么?1
1.2机器学习应用示例4
1.2.1学习协会4
1.2.2分类5
1.2.3回归9
1.2.4无监督学习11
1.2.5强化学习13
1.3注释14
1.4相关资源16
1.5练习18
1.6参考文件19
2监督学习21
2.1从示例21学习一门课
八、目录
2.2 Vapnik Chervonenkis(VC)尺寸27
2.3可能大致正确(PAC)学习29
2.4噪声30
2.5学习多个班32
2.6回归34
2.7模型选择和泛化37
2.8监督机器学习算法的尺寸41
2.9注释42
2.10练习43
2.11参考文献44
3贝叶斯决策理论47
3.1简介47
3.2分类49
3.3损失和风险51
3.4判别函数53
3.5效用理论54
3.6协会规则55
3.7注释58
3.8练习58
3.9参考文件59
4参数化方法61
4.1简介61
4.2最大似然估计62
4.2.1伯努利密度63
4.2.2多项式密度64
4.2.3高斯(正常)密度64
4.3评估估计器:偏差和方差65
4.4贝叶斯估计66
4.5参数分类69
4.6回归73
4.7调谐模型复杂性:偏差/方差困境76
4.8选型程序80
4.9注释84
4.10练习84
4.11参考文件85
5多元方法87
5.1多元数据87
目录九
5.2参数估计88
5.3缺失值估计89
5.4多元正态分布90
5.5多元分类94
5.6调谐复杂性99
5.7离散特征102
5.8多元回归103
5.9注释105
5.10练习106
5.11参考107
6尺寸减少109
6.1简介109
6.2子集选择110
6.3主要成分分析113
6.4因素分析120
6.5多维缩放125
6.6线性判别分析128
6.7 Isomap 133
6.8局部线性嵌入135
6.9注释138
6.10练习139
6.11参考文献140
7群集143
7.1简介143
7.2混合料密度144
7.3 k-均值聚类145
7.4期望最大化算法149
7.5潜在变量模型的混合物154
7.6聚类后的监督学习155
7.7分层聚类157
7.8选择集群数量158
7.9注释160
7.10练习160
7.11参考文件161
8非参数方法163
8.1简介163
8.2非参数密度估计165
x目录
8.2.1直方图估计器165
8.2.2核估计器167
8.2.3 k-最近邻估计器168
8.3多元数据的泛化170
8.4非参数分类171
8.5凝聚最近邻172
8.6非参数回归:平滑模型174
8.6.1运行平均平滑度175
8.6.2内核平滑器176
8.6.3运行线平滑器177
8.7如何选择平滑参数178
8.8注释180
8.9练习181
8.10参考182
9决策树185
9.1简介185
9.2单变量树187
9.2.1分类树188
9.2.2回归树192
9.3修剪194
9.4树木中的规则提取197
9.5数据学习规则198
9.6多元树202
9.7注释204
9.8练习207
9.9参考207
10线性判别209
10.1简介209
10.2线性模型211的推广
10.3线性判别器212的几何
10.3.1两类212
10.3.2多个类别214
10.4成对分离216
10.5参数歧视重新审视217
10.6坡度下降218
10.7后勤歧视220
10.7.1两个等级220
内容席
10.7.2多个类别224
10.8回归判别228
10.9注释230
10.10练习230
10.11参考文件231
11个多层感知器233
11.1简介233
11.1.1了解大脑234
11.1.2神经网络作为并行的范例
处理235
11.2感知器237
11.3培训感知器240
11.4学习布尔函数243
11.5多层感知器245
11.6 MLP作为通用近似器248
11.7反向传播算法249
11.7.1非线性回归250
11.7.2两类歧视252
11.7.3多类别鉴别254
11.7.4多个隐藏层256
11.8培训程序256
11.8.1改进收敛性256
11.8.2过度降雨257
11.8.3网络结构258
11.8.4提示261
11.9调整网络大小263
11.10贝叶斯学习观266
11.11尺寸减少267
11.12学习时间270
11.12.1延时神经网络270
11.12.2经常性网络271
11.13注释272
11.14练习274
11.15参考文件275
12款本地型号279
12.1简介279
12.2竞争性学习280
十二、目录
12.2.1在线k-Means 280
12.2.2自适应共振理论285
12.2.3自组织地图286
12.3径向基函数288
12.4纳入基于规则的知识294
12.5标准化基函数295
12.6竞争性Bas
12.7学习矢量量化300
12.8专家混合物300
12.8.1合作专家303
12.8.2竞争专家304
12.9专家等级混合304
12.10注释305
12.11练习306
12.12参考文献307
13核心机309
13.1简介309
13.2最优分离超平面311
13.3不可分离情况:软边界超平面315
13.4ν-支持向量机318
13.5内核技巧319
13.6矢量核321
13.7定义内核324
13.8多核学习325
13.9多类内核机327
13.10回归的核心机器328
13.11一类内核机器333
13.12核降维335
13.13注释337
13.14练习338
13.15参考文献339
14贝叶斯估计341
14.1简介341
14.2估计分布参数343
14.2.1离散变量343
14.2.2连续变量345
14.3函数参数的贝叶斯估计348
目录十三
14.3.1回归348
14.3.2基函数/核函数的使用352
14.3.3贝叶斯分类353
14.4高斯过程356
14.5注释359
14.6练习360
14.7参考文献361
15隐马尔可夫模型363
15.1简介363
15.2离散马尔可夫过程364
15.3隐马尔可夫模型367
15.4 HMMs的三个基本问题369
15.5评估问题369
15.6寻找状态序列373
15.7学习模型参数375
15.8连续观察378
15.9输入379的HMM
15.10 HMM中的模型选择380
15.11注释382
15.12练习383
15.13参考文献384
16图形模型387
16.1简介387
16.2条件独立的典型案例389
16.3图形模型示例396
16.3.1朴素贝叶斯分类器396
16.3.2隐马尔可夫模型398
16.3.3线性回归401
16.4 d-分离402
16.5信念传播402
16.5.1链条403
16.5.2树木405
16.5.3多年生树木407
16.5.4连接树409
16.6无向图:马尔可夫随机场410
16.7学习图形模型的结构413
16.8影响图414
十四、目录
16.9注释414
16.10练习417
16.11参考文献417
17组合多个学习者419
17.1理由419
17.2培养多样化的学习者420
17.3模型组合方案423
17.4投票424
17.5纠错输出代码427
17.6装袋430
17.7增压431
17.8专家重访434
17.9概述435
17.10微调合奏437
17.11级联438
17.12注释440
17.13练习442
17.14参考文献443
18强化学习447
18.1简介447
18.2单一国家案件:K武装土匪449
18.3强化学习的要素450
18.4基于模型的学习453
18.4.1数值迭代453
18.4.2政策迭代454
18.5时差学习454
18.5.1勘探策略455
18.5.2确定性奖励和行动456
18.5.3不确定性奖励和行动457
18.5.4资格跟踪459
18.6概述461
18.7部分可观测状态464
18.7.1设置464
18.7.2示例:老虎问题465
18.8注释470
18.9练习472
18.10参考文献473
目录十五
19机器学习实验的设计与分析475
19.1简介475
19.2实验的因素、反应和策略478
19.3响应面设计481
19.4随机化、复制和阻断482
19.5机器学习实验指南483
19.6交叉验证和重采样方法486
19.6.1 K-折叠交叉验证487
19.6.2 5×2交叉验证488
19.6.3引导489
19.7测量分类器性能489
19.8区间估计493
19.9假设检验496
19.10评估分类算法的性能498
19.10.1二项检验499
19.10.2近似正常试验500
19.10.3 t试验500
19.11比较两种分类算法501
19.11.1麦克内玛试验501
19.11.2 K-折叠交叉验证配对t检验501
19.11.3 5 × 2配对t检验502
19.11.4 5 × 2配对F试验503
19.12比较多种算法:方差分析504
19.13多个数据集的比较508
19.13.1比较两种算法509
19.13.2多种算法511
19.14注释512
19.15练习513
19.16参考文献514
概率517
A.1概率要素517
A.1.1概率公理518
A.1.2条件概率518
A.2随机变量519
A.2.1概率分布和密度函数519
A.2.2联合分布和密度函数520
A.2.3条件分配520
A.2.4贝叶斯规则521
十六、目录
A.2.5期望521
A.2.6差异522
A.2.7弱大数定律523
A.3特殊随机变量523
A.3.1伯努利分布523
A.3.2二项分布524
A.3.3多项式分布524
A.3.4均匀分布524
A.3.5正态(高斯)分布525
A.3.6卡方分布526
A.3.7 t分配527
A.3.8 F分配527
A.4参考文献527
索引529
Contents
Series Foreword xvii
Figures xix
Tables xxix
Preface xxxi
Acknowledgments xxxiii
Notes for the Second Edition xxxv
Notations xxxix
1 Introduction 1
1.1 What Is Machine Learning? 1
1.2 Examples of Machine Learning Applications 4
1.2.1 Learning Associations 4
1.2.2 Classification 5
1.2.3 Regression 9
1.2.4 Unsupervised Learning 11
1.2.5 Reinforcement Learning 13
1.3 Notes 14
1.4 Relevant Resources 16
1.5 Exercises 18
1.6 References 19
2 Supervised Learning 21
2.1 Learning a Class from Examples 21
viii Contents
2.2 Vapnik-Chervonenkis (VC) Dimension 27
2.3 Probably Approximately Correct (PAC) Learning 29
2.4 Noise 30
2.5 Learning Multiple Classes 32
2.6 Regression 34
2.7 Model Selection and Generalization 37
2.8 Dimensions of a Supervised Machine Learning Algorithm 41
2.9 Notes 42
2.10 Exercises 43
2.11 References 44
3 Bayesian Decision Theory 47
3.1 Introduction 47
3.2 Classification 49
3.3 Losses and Risks 51
3.4 Discriminant Functions 53
3.5 Utility Theory 54
3.6 Association Rules 55
3.7 Notes 58
3.8 Exercises 58
3.9 References 59
4 Parametric Methods 61
4.1 Introduction 61
4.2 Maximum Likelihood Estimation 62
4.2.1 Bernoulli Density 63
4.2.2 Multinomial Density 64
4.2.3 Gaussian (Normal) Density 64
4.3 Evaluating an Estimator: Bias and Variance 65
4.4 The Bayes’ Estimator 66
4.5 Parametric Classification 69
4.6 Regression 73
4.7 Tuning Model Complexity: Bias/Variance Dilemma 76
4.8 Model Selection Procedures 80
4.9 Notes 84
4.10 Exercises 84
4.11 References 85
5 Multivariate Methods 87
5.1 Multivariate Data 87
Contents ix
5.2 Parameter Estimation 88
5.3 Estimation of Missing Values 89
5.4 Multivariate Normal Distribution 90
5.5 Multivariate Classification 94
5.6 Tuning Complexity 99
5.7 Discrete Features 102
5.8 Multivariate Regression 103
5.9 Notes 105
5.10 Exercises 106
5.11 References 107
6 Dimensionality Reduction 109
6.1 Introduction 109
6.2 Subset Selection 110
6.3 Principal Components Analysis 113
6.4 Factor Analysis 120
6.5 Multidimensional Scaling 125
6.6 Linear Discriminant Analysis 128
6.7 Isomap 133
6.8 Locally Linear Embedding 135
6.9 Notes 138
6.10 Exercises 139
6.11 References 140
7 Clustering 143
7.1 Introduction 143
7.2 Mixture Densities 144
7.3 k-Means Clustering 145
7.4 Expectation-Maximization Algorithm 149
7.5 Mixtures of Latent Variable Models 154
7.6 Supervised Learning after Clustering 155
7.7 Hierarchical Clustering 157
7.8 Choosing the Number of Clusters 158
7.9 Notes 160
7.10 Exercises 160
7.11 References 161
8 Nonparametric Methods 163
8.1 Introduction 163
8.2 Nonparametric Density Estimation 165
x Contents
8.2.1 Histogram Estimator 165
8.2.2 Kernel Estimator 167
8.2.3 k-Nearest Neighbor Estimator 168
8.3 Generalization to Multivariate Data 170
8.4 Nonparametric Classification 171
8.5 Condensed Nearest Neighbor 172
8.6 Nonparametric Regression: Smoothing Models 174
8.6.1 Running Mean Smoother 175
8.6.2 Kernel Smoother 176
8.6.3 Running Line Smoother 177
8.7 How to Choose the Smoothing Parameter 178
8.8 Notes 180
8.9 Exercises 181
8.10 References 182
9 Decision Trees 185
9.1 Introduction 185
9.2 Univariate Trees 187
9.2.1 Classification Trees 188
9.2.2 Regression Trees 192
9.3 Pruning 194
9.4 Rule Extraction from Trees 197
9.5 Learning Rules from Data 198
9.6 Multivariate Trees 202
9.7 Notes 204
9.8 Exercises 207
9.9 References 207
10 Linear Discrimination 209
10.1 Introduction 209
10.2 Generalizing the Linear Model 211
10.3 Geometry of the Linear Discriminant 212
10.3.1 Two Classes 212
10.3.2 Multiple Classes 214
10.4 Pairwise Separation 216
10.5 Parametric Discrimination Revisited 217
10.6 Gradient Descent 218
10.7 Logistic Discrimination 220
10.7.1 Two Classes 220
Contents xi
10.7.2 Multiple Classes 224
10.8 Discrimination by Regression 228
10.9 Notes 230
10.10 Exercises 230
10.11 References 231
11 Multilayer Perceptrons 233
11.1 Introduction 233
11.1.1 Understanding the Brain 234
11.1.2 Neural Networks as a Paradigm for Parallel
Processing 235
11.2 The Perceptron 237
11.3 Training a Perceptron 240
11.4 Learning Boolean Functions 243
11.5 Multilayer Perceptrons 245
11.6 MLP as a Universal Approximator 248
11.7 Backpropagation Algorithm 249
11.7.1 Nonlinear Regression 250
11.7.2 Two-Class Discrimination 252
11.7.3 Multiclass Discrimination 254
11.7.4 Multiple Hidden Layers 256
11.8 Training Procedures 256
11.8.1 Improving Convergence 256
11.8.2 Overtraining 257
11.8.3 Structuring the Network 258
11.8.4 Hints 261
11.9 Tuning the Network Size 263
11.10 Bayesian View of Learning 266
11.11 Dimensionality Reduction 267
11.12 Learning Time 270
11.12.1 Time Delay Neural Networks 270
11.12.2 Recurrent Networks 271
11.13 Notes 272
11.14 Exercises 274
11.15 References 275
12 Local Models 279
12.1 Introduction 279
12.2 Competitive Learning 280
xii Contents
12.2.1 Online k-Means 280
12.2.2 Adaptive Resonance Theory 285
12.2.3 Self-Organizing Maps 286
12.3 Radial Basis Functions 288
12.4 Incorporating Rule-Based Knowledge 294
12.5 Normalized Basis Functions 295
12.6 Competitive Basis Functions 297
12.7 Learning Vector Quantization 300
12.8 Mixture of Experts 300
12.8.1 Cooperative Experts 303
12.8.2 Competitive Experts 304
12.9 Hierarchical Mixture of Experts 304
12.10 Notes 305
12.11 Exercises 306
12.12 References 307
13 Kernel Machines 309
13.1 Introduction 309
13.2 Optimal Separating Hyperplane 311
13.3 The Nonseparable Case: Soft Margin Hyperplane 315
13.4 ν-SVM 318
13.5 Kernel Trick 319
13.6 Vectorial Kernels 321
13.7 Defining Kernels 324
13.8 Multiple Kernel Learning 325
13.9 Multiclass Kernel Machines 327
13.10 Kernel Machines for Regression 328
13.11 One-Class Kernel Machines 333
13.12 Kernel Dimensionality Reduction 335
13.13 Notes 337
13.14 Exercises 338
13.15 References 339
14 Bayesian Estimation 341
14.1 Introduction 341
14.2 Estimating the Parameter of a Distribution 343
14.2.1 Discrete Variables 343
14.2.2 Continuous Variables 345
14.3 Bayesian Estimation of the Parameters of a Function 348
Contents xiii
14.3.1 Regression 348
14.3.2 The Use of Basis/Kernel Functions 352
14.3.3 Bayesian Classification 353
14.4 Gaussian Processes 356
14.5 Notes 359
14.6 Exercises 360
14.7 References 361
15 Hidden Markov Models 363
15.1 Introduction 363
15.2 Discrete Markov Processes 364
15.3 Hidden Markov Models 367
15.4 Three Basic Problems of HMMs 369
15.5 Evaluation Problem 369
15.6 Finding the State Sequence 373
15.7 Learning Model Parameters 375
15.8 Continuous Observations 378
15.9 The HMM with Input 379
15.10 Model Selection in HMM 380
15.11 Notes 382
15.12 Exercises 383
15.13 References 384
16 Graphical Models 387
16.1 Introduction 387
16.2 Canonical Cases for Conditional Independence 389
16.3 Example Graphical Models 396
16.3.1 Naive Bayes’ Classifier 396
16.3.2 Hidden Markov Model 398
16.3.3 Linear Regression 401
16.4 d-Separation 402
16.5 Belief Propagation 402
16.5.1 Chains 403
16.5.2 Trees 405
16.5.3 Polytrees 407
16.5.4 Junction Trees 409
16.6 Undirected Graphs: Markov Random Fields 410
16.7 Learning the Structure of a Graphical Model 413
16.8 Influence Diagrams 414
xiv Contents
16.9 Notes 414
16.10 Exercises 417
16.11 References 417
17 Combining Multiple Learners 419
17.1 Rationale 419
17.2 Generating Diverse Learners 420
17.3 Model Combination Schemes 423
17.4 Voting 424
17.5 Error-Correcting Output Codes 427
17.6 Bagging 430
17.7 Boosting 431
17.8 Mixture of Experts Revisited 434
17.9 Stacked Generalization 435
17.10 Fine-Tuning an Ensemble 437
17.11 Cascading 438
17.12 Notes 440
17.13 Exercises 442
17.14 References 443
18 Reinforcement Learning 447
18.1 Introduction 447
18.2 Single State Case: K-Armed Bandit 449
18.3 Elements of Reinforcement Learning 450
18.4 Model-Based Learning 453
18.4.1 Value Iteration 453
18.4.2 Policy Iteration 454
18.5 Temporal Difference Learning 454
18.5.1 Exploration Strategies 455
18.5.2 Deterministic Rewards and Actions 456
18.5.3 Nondeterministic Rewards and Actions 457
18.5.4 Eligibility Traces 459
18.6 Generalization 461
18.7 Partially Observable States 464
18.7.1 The Setting 464
18.7.2 Example: The Tiger Problem 465
18.8 Notes 470
18.9 Exercises 472
18.10 References 473
Contents xv
19 Design and Analysis of Machine Learning Experiments 475
19.1 Introduction 475
19.2 Factors, Response, and Strategy of Experimentation 478
19.3 Response Surface Design 481
19.4 Randomization, Replication, and Blocking 482
19.5 Guidelines for Machine Learning Experiments 483
19.6 Cross-Validation and Resampling Methods 486
19.6.1 K-Fold Cross-Validation 487
19.6.2 5×2 Cross-Validation 488
19.6.3 Bootstrapping 489
19.7 Measuring Classifier Performance 489
19.8 Interval Estimation 493
19.9 Hypothesis Testing 496
19.10 Assessing a Classification Algorithm’s Performance 498
19.10.1 Binomial Test 499
19.10.2 Approximate Normal Test 500
19.10.3 t Test 500
19.11 Comparing Two Classification Algorithms 501
19.11.1 McNemar’s Test 501
19.11.2 K-Fold Cross-Validated Paired t Test 501
19.11.3 5 × 2 cv Paired t Test 502
19.11.4 5 × 2 cv Paired F Test 503
19.12 Comparing Multiple Algorithms: Analysis of Variance 504
19.13 Comparison over Multiple Datasets 508
19.13.1 Comparing Two Algorithms 509
19.13.2 Multiple Algorithms 511
19.14 Notes 512
19.15 Exercises 513
19.16 References 514
A Probability 517
A.1 Elements of Probability 517
A.1.1 Axioms of Probability 518
A.1.2 Conditional Probability 518
A.2 Random Variables 519
A.2.1 Probability Distribution and Density Functions 519
A.2.2 Joint Distribution and Density Functions 520
A.2.3 Conditional Distributions 520
A.2.4 Bayes’ Rule 521
xvi Contents
A.2.5 Expectation 521
A.2.6 Variance 522
A.2.7 Weak Law of Large Numbers 523
A.3 Special Random Variables 523
A.3.1 Bernoulli Distribution 523
A.3.2 Binomial Distribution 524
A.3.3 Multinomial Distribution 524
A.3.4 Uniform Distribution 524
A.3.5 Normal (Gaussian) Distribution 525
A.3.6 Chi-Square Distribution 526
A.3.7 t Distribution 527
A.3.8 F Distribution 527
A.4 References 527
Index 529