机器学习系列文章-集成学习综述

文章分为两个部分：

第一部分，介绍集成学习的理论原理。
第二部分，实践部分。主要使用sklearn封装的包。

第一部分理论原理

什么是集成学习

首先集成学习（ensemble learning）不是具体的算法，应该算一种算法思想（Algorithm Framework，类比EM算法）。在机器学习中，通过一定组合策略，将多种学习器组合起来，以此提升泛化能力的框架统称为集成学习。

集成学习框架可以分成两个部分：基分类器集合、组合策略。

最常用的集成学习框架有：Bagging、Boosting、Stacking，在机器学习的比赛中被广泛使用。

Bagging

Bootstrap aggregating 简称Bagging，下面的流程图很清楚的说明了Bagging的框架流程。

算法流程

算法输入：$TrainData \ D={(x_1,y_1),(x_2,y_2),…,(x_m,y_m)}$

基学习器： $L$

训练迭代的次数： $T$

算法过程：

1、For t =1,2,3，…,T DO:

2、对训练集$D$进行第t次随机采样，共采集m次，得到包含m个样本的采样集$Dt$

3、$h_t=L(D,D_t)$

4、end for

算法输出：$H(x)=argmax_{y \in Y}{\sum_{t=1}^{T}I(h_t(x)=y)}$

关键步骤：

随机采样：每次采样，从训练集中采样固定数量的样本。并且是有放回的。每次未被采样的数据称为袋外数据（Out Of Bag, 简称OOB ），作为测试集，检测基学习器的泛化能力。
Bagging的基学习器可以是“异质”的。
Bagging的组合策略有：投票法（分类问题）、平均值（回归问题）。

随机森林（Random Forest）是最常用的Bagging算法架构，基学习器使用CART决策树。当然RF对原架构有部分改动：

RF中每个基分类器（决策树），每个节点对特征进行部分采样，然后在采样集中选择最优特征。提高基分类器的泛化能力。

普通的CART并不对特征进行采样。

Boosting（Adaboost）

在Bagging框架下，基学习器是相互独立的学习过程。而在boosing框架中，基学习器存在前后依赖关系。可以将弱学习器（强于随机猜想）提升为强学习器。

“boosting”的图片搜索结果

数据集有$n$个点，Boosting框架给数据集赋予了权重分布结构${w_i}_{i=1}^n$表示这个点重要性（这些权重值会在代价函数中体现）。对于前项学习器分类错误的点，提高权重（即提高下一项学习器代价函数对于“错”点关注）。最后得到一个学习器序列${y_i}_{i=1}^M$,最后通过加权得到最终的模型。

上面的例子只是简单介绍。Boosting最出名的算法框架为Adaboost（adaptive boosting）。

Adaboost 算法流程

以二元分类为例：

算法输入：$TrainData \ D={(x_1,y_1),(x_2,y_2),…,(x_m,y_m)}$

基学习器： $L$

训练迭代的次数： $T$

算法过程：

1、初始化数据集权值分布：$D_1(x)=1/m$

2、For t=1,2,3,…,T DO

3、$h_t=L(D,D_t)$

4、$\epsilon_t=P_{x\sim D_t}(h_t(x)\ne f(x))$

5、if $\epsilon_t >0.5$ then break

6、$\alpha_t=\dfrac{1}{2}ln(\dfrac{1-\epsilon_t}{\epsilon_t} )$

7、$D_{t+1}(x)=\dfrac{D_t(x)}{Z_t}exp(-\alpha_tf(x)h_t(x))$

8、end for

算法输出：$H(x)=sign{\sum_{t=1}^{T}}\alpha_t h_t(x)$

注记：Boosting只能针对二分类的问题。对于多分类问题已经有相关的推广工作。

Multi-class AdaBoost ：https://web.stanford.edu/~hastie/Papers/samme.pdf

该论文的算法已经在sklearn中进行封装实现。第二部分我们将讲解。

Stacking

第二部分实践

sklearn中有集成学习丰富的封装包：

The sklearn.ensemble module includes ensemble-based methods for classification, regression and anomaly detection.

User guide: See the Ensemble methods section for further details.

`ensemble.AdaBoostClassifier`([…])	An AdaBoost classifier.
`ensemble.AdaBoostRegressor`([base_estimator, …])	An AdaBoost regressor.
`ensemble.BaggingClassifier`([base_estimator, …])	A Bagging classifier.
`ensemble.BaggingRegressor`([base_estimator, …])	A Bagging regressor.
`ensemble.ExtraTreesClassifier`([…])	An extra-trees classifier.
`ensemble.ExtraTreesRegressor`([n_estimators, …])	An extra-trees regressor.
`ensemble.GradientBoostingClassifier`([loss, …])	Gradient Boosting for classification.
`ensemble.GradientBoostingRegressor`([loss, …])	Gradient Boosting for regression.
`ensemble.IsolationForest`([n_estimators, …])	Isolation Forest Algorithm
`ensemble.RandomForestClassifier`([…])	A random forest classifier.
`ensemble.RandomForestRegressor`([…])	A random forest regressor.
`ensemble.RandomTreesEmbedding`([…])	An ensemble of totally random trees.
`ensemble.VotingClassifier`(estimators[, …])	Soft Voting/Majority Rule classifier for unfitted estimators.

AdaBoostClassifier参数介绍

源码路径：sklearn\ensemble\weight_boosting.py

主要参数：

base_estimator

> base_estimator : object, optional (default=DecisionTreeClassifier)
>         The base estimator from which the boosted ensemble is built.
>         Support for sample weighting is required, as well as proper `classes_`
>         and `n_classes_` attributes.
>

基分类器。默认为决策树（DecisionTreeClassifier）。基分类器的选择有个trick在algorithm参数中一并介绍。

n_estimators

n_estimators : integer, optional (default=50)
    The maximum number of estimators at which boosting is terminated.
    In case of perfect fit, the learning procedure is stopped early.

最大训练迭代次数。默认值为50。另外从boosting原理看，也可以理解为最大生成基学习器的序列长度。

learning_rate

> learning_rate : float, optional (default=1.)
>         Learning rate shrinks the contribution of each classifier by
>         ``learning_rate``. There is a trade-off between ``learning_rate`` and
>         ``n_estimators``.
>

学习率（步长）。源码中的介绍也提到参数的选择和参数n_estimators是一个trade-off。

algorithm

> algorithm : {'SAMME', 'SAMME.R'}, optional (default='SAMME.R')
>         If 'SAMME.R' then use the SAMME.R real boosting algorithm.
>         ``base_estimator`` must support calculation of class probabilities.
>         If 'SAMME' then use the SAMME discrete boosting algorithm.
>         The SAMME.R algorithm typically converges faster than SAMME,
>         achieving a lower test error with fewer boosting iterations.
>

算法，其实应该是权重更新算法。参数有两个选择：’SAMME’, ‘SAMME.R’，默认是“’SAMME.R”。理解两个算法的区别就需要我们回溯到原论文：《Multi-class AdaBoost》：

SAMME 算法

SAMME.R 算法

enter image description here

简单的理解：如果使用SAMME.R算法，需要基分类器的输出是概率值。即在sklearn框架中需要class中有predict_proba函数。

可以通过下面的python脚本查看哪些分类器是满足的。

import inspect
from sklearn.utils.testing import all_estimators
for name, clf in all_estimators(type_filter='classifier'):
    if 'sample_weight' in inspect.getargspec(clf().fit)[0]:
       print(name)

random_state

> random_state : int, RandomState instance or None, optional (default=None)
>         If int, random_state is the seed used by the random number generator;
>         If RandomState instance, random_state is the random number generator;
>         If None, the random number generator is the RandomState instance used
>         by `np.random`.
>

随机种子。参数随意，如果要结果能复现，需要相同的随机种子。

AdaBoostRegressor参数简介

源码路径：sklearn\ensemble\weight_boosting.py

主要参数：

base_estimator

> base_estimator : object, optional (default=DecisionTreeRegressor)
>         The base estimator from which the boosted ensemble is built.
>         Support for sample weighting is required.
>

基回归器。默认为回归树（DecisionTreeRegressor）。

n_estimators

> n_estimators : integer, optional (default=50)
>         The maximum number of estimators at which boosting is terminated.
>         In case of perfect fit, the learning procedure is stopped early.
>

最大训练迭代次数。默认值为50。同AdaBoostClassifier。

learning_rate

> learning_rate : float, optional (default=1.)
>         Learning rate shrinks the contribution of each regressor by
>         ``learning_rate``. There is a trade-off between ``learning_rate`` and
>         ``n_estimators``.
>

学习率（步长）。

loss

> loss : {'linear', 'square', 'exponential'}, optional (default='linear')
>         The loss function to use when updating the weights after each
>         boosting iteration.
>

误差。

这个参数只有AdaBoostRegressor有，Adaboost.R2算法需要用到。有线性‘linear’, 平方‘square’和指数 ‘exponential’三种选择, 默认是线性，一般使用线性就足够了，除非你怀疑这个参数导致拟合程度不好。这个值的意义在原理篇我们也讲到了，它对应了我们对第k个弱分类器的中第i个样本的误差的处理，即：如果是线性误差，则eki=|yi−Gk(xi)|Ekeki=|yi−Gk(xi)|Ek；如果是平方误差，则eki=(yi−Gk(xi))2E2keki=(yi−Gk(xi))2Ek2，如果是指数误差，则eki=1−exp（−yi+Gk(xi))Ek）eki=1−exp（−yi+Gk(xi))Ek），EkEk为训练集上的最大误差Ek=max|yi−Gk(xi)|i=1,2…m

random_state

> random_state : int, RandomState instance or None, optional (default=None)
>         If int, random_state is the seed used by the random number generator;
>         If RandomState instance, random_state is the random number generator;
>         If None, the random number generator is the RandomState instance used
>         by `np.random`.       
>

随机数种子。

调参心得传授

集成学习的参数按照层次可以分为两层：

1、集成框架的参数。

2、基分类器的参数。

例子

参考文献

1、ROC和AUC介绍以及如何计算AUC（http://alexkong.net/2013/06/introduction-to-auc-and-roc/）

2、sklearn（http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics）

3、wiki百科（https://zh.wikipedia.org/wiki/ROC%E6%9B%B2%E7%BA%BF）

4、统计学习方法

5、深度学习

第一部分 理论原理

什么是集成学习

Bagging

算法流程

Boosting（Adaboost）

Adaboost 算法流程

Stacking

第二部分 实践

AdaBoostClassifier参数介绍

base_estimator

n_estimators

learning_rate

algorithm

SAMME 算法

SAMME.R 算法

random_state

AdaBoostRegressor参数简介

base_estimator

n_estimators

learning_rate

loss

random_state

调参心得传授

例子

参考文献

第一部分理论原理

第二部分实践