An overview of Ensemble learning

Ensemble learning involves combining multiple machine learning techniques into one predictive model in order to create a stronger overall prediction, which means to decrease variance and bias.

Bias and Variance in Ensemble learning

Error due to Bias: The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict.
Error due to Variance: The error due to variance is taken as the variability of a model prediction for a given data point.

We can plot four different cases representing combinations of both high and low bias and variance (Assume that red spot is the real value and blue dots are predictions) :

Graphical illustration of bias and variance

Bagging

Bagging (stands for Bootstrap Aggregation) is a simple ensemble method by taking multiple random samples(with replacement) and using each of these samples to construct a separate model and separate predictions. These predictions are then averaged to create a final prediction value.

Bagging for Bias

As Bagging takes random samples from data and then build same model, so based on E[∑ Xi/n ]=E[Xi] (1) this basic statistical equation, so the bias for final prediction value is almost as same as for each model's prediction value. So bagging needs "strong learner" as base learner.

Bagging for Variance

In addition, we know that Var(∑Xi/n)=Var(Xi)/n is true if each model is independent and then the variance will be decreased a lot in this case. While Var(∑Xi/n)=Var(Xi) is true if all the models are same and then the variance stays the same. The models in Bagging are dependent but not same (as the random sample and same model), so the variance for bagging is the intermediate state of the above two extremes. To make it clearer, suppose we have n i.d. random variables with positive pairwise correlation ρ, and each has variance σ², then the variance of the average is:

Thus Bagging can decrease variance as it can decrease the second term. To decrease variance more, models in bagging should have less correlation, such as random forest.

Random Forest

As mentioned above, the base model in random forest has lower correlation by taking random samples and part of features(usually √p, denote p as number of features). In this way, random forest builds a large collection of de-correlated trees, and then averages them. Thus random forest can decrease both the first term and second term.

Boosting

Boosting for Bias

Boosting is an iterative technique which adjust the weight of an observation based on the last classification. If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa.
From the optimization point of view, Boosting is using forward-stagewise method to minimize the loss function. In this way, Boosting can be seen as a way of fitting new model to minimize the following loss function


Forward Stagewise

step by step. As a result, the bias decreases sequentially.

Boosting for Variance

In Boosting, each model is based last one, so the correlation in models are high. As a result, the variance is high. Boosting can not decrease variance.

Adaboost

The first realization of boosting that saw great success in application was Adaptive Boosting or AdaBoost for short.

By changing the weights of the training samples based on the errors in each iteration, Adaboost learns a number of different classifiers by same base model, and linearly combines these classifiers to improve the classification performance.


Adaboost

In addition, AdaBoost is equivalent to Forward Stagewise Additive Modeling using the exponential loss function.

AdaBoost can be sensitive to outliers and noise because it is fitting to an exponential loss function, and the exponential loss function is sensitive to outliers/label noise. It is quite clear that if there exists an outlier/noise, this prediction would suffer a large loss/penalty since the penalty is exponentiated (exp(-f(x)*y)). So the classifier will be influenced when trying to minimize the loss function. There have been several papers on using various other loss functions with boosting that result in less sensitivity to outliers and noise, like SavageBoost.

Gradient Boosting

AdaBoost and related algorithms were recast in a statistical framework first by Breiman calling them ARCing algorithms.

Arcing is an acronym for Adaptive Reweighting and Combining. Each step in an arcing algorithm consists of a weighted minimization followed by a recomputation of [the classifiers] and [weighted input].
Prediction Games and Arching Algorithms [PDF], 1997

This framework was further developed by Friedman and called Gradient Boosting Machines. --Greedy Function Approximation: A Gradient Boosting Machine [PDF], 1999.

The gradient boosting method is:

gradient boosting

In each stage, Boosting introduce a weak learner to compensate the
shortcomings of existing weak learners. From this point of view, “shortcomings” are identified by gradients in Gradient Boosting, and are identified by high-weight data points in Adaboost.

Elements in Gradient Boosting

Gradient boosting involves three elements:

  1. A loss function to be optimized.
    The loss function must be differentiable. But it used depends on the type of problem being solved. For example, regression may use a squared error and classification may use logarithmic loss. Also you can define your own, any differentiable loss function can be used.

  2. A weak learner to make predictions.
    Decision trees are widely used as the weak learner in gradient boosting.

  3. An additive model to add weak learners to minimize the loss function.
    Traditionally, gradient descent is used to update a set of parameters, such as the coefficients in a linear regression or weights in a neural network, to minimize the loss function. After calculating error or loss, the weights are updated to minimize that error.
    Instead of parameters, we have weak learner sub-models or more specifically decision trees. In Gradient Boosting, we modify the parameters of a new tree, then move in the right direction by reducing the residual loss.
    Generally this approach is called functional gradient descent or gradient descent with functions.

One way to produce a weighted combination of classifiers which optimizes [the cost] is by gradient descent in function space
Boosting Algorithms as Gradient Descent in Function Space [PDF], 1999

Regularization in Gradient Boosting

Gradient boosting is a greedy algorithm and can overfit a training dataset quickly. So regularization will help a lot.

  • Tree Constraints
    When we use tree as base learner, we can set some constraints to keep the tree as weak learner:
  • Depth
  • Number of Trees
  • Number of nodes or leaves
  • Minimum improvement of loss
  • Shrinkage

The simplest implementation of shrinkage in the context of boosting is to scale the contribution of each tree by a factor 0 <ν< 1 when it is added to the current approximation.
—The Elements of Statistical Learning, P383

The ν, also called learning rate, is common to set in range of 0.1 to 0.3. With smaller learning rate, the number of iteration of boosting will increase, so it takes more time to train. It is a trade-off between the number of trees and learning rate.

  • Random sampling
    We can learn from random forest that trees created from subsamples of the training dataset can decrease the variance. This method also can be used to decrease the correlation between the trees in the sequence in gradient boosting models.
    This variation of boosting is called stochastic gradient boosting.

At each iteration a subsample of the training data is drawn at random (without replacement) from the full training dataset. The randomly selected subsample is then used, instead of the full sample, to fit the base learner.
Stochastic Gradient Boosting [PDF], 1999

A few variants of stochastic boosting that can be used:

  1. Subsample rows before creating each tree.
  2. Subsample columns before creating each tree
  3. Subsample columns before considering each split.

Random forest uses the second method, subsample columns to decrease the corrleation between base learners a lot.

According to user feedback, using column sub-sampling prevents over-fitting even more so than the traditional row sub-sampling
XGBoost: A Scalable Tree Boosting System, 2016

  • Penalized Learning
    Just like lasso and ridge, we can also add L1 and L2 regularization in Gradient Boosting.
    Classical decision trees like CART are not used as weak learners, instead a modified form called a regression tree is used that has numeric values in the leaf nodes (also called terminal nodes). The values in the leaves of the trees can be called weights. The leaf weight values of the trees can be regularized using popular regularization functions.
    Of course there is more than one way to define the complexity,The below one works well in practice.
Xgboost Regularization

The additional regularization term helps to smooth the final learnt weights to avoid over-fitting. Intuitively, the regularized objective will tend to select a model employing simple and predictive functions.
XGBoost: A Scalable Tree Boosting System, 2016

Stacking

To be continued....

Reference:

  1. Understanding the bias and variance
  2. The Elements of Statistical Learning
  3. Gradient_boosting
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 157,298评论 4 360
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 66,701评论 1 290
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 107,078评论 0 237
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 43,687评论 0 202
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,018评论 3 286
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,410评论 1 211
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,729评论 2 310
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,412评论 0 194
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,124评论 1 239
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,379评论 2 242
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 31,903评论 1 257
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,268评论 2 251
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 32,894评论 3 233
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,014评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,770评论 0 192
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,435评论 2 269
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,312评论 2 260

推荐阅读更多精彩内容

  • 人体就是个有组织的系统,只要人体继续保持有组织状态,人就活着,而死亡只不过是人体处于无组织状态的后果;那么,一个盲...
    彼岸花开2016阅读 269评论 0 0
  • 癫乱地述说着真实的癫乱,这个社会的真实与虚伪不过是人性的另一种阐述与解释。在这个迷乱的社会中,人来人往,但是,大白...
    瘗心阅读 304评论 0 0
  • 今天写这篇感赏,是为孩子这两天来的改变而感,从前天的对老师的冲撞和不礼貌(当着老师的面厉声说要换老师,这个老师太对...
    梅子2000阅读 243评论 0 1