通用机器学习算法列表

1.    线性回归（Linear Regression）

2.    逻辑回归（Logistic Regression）

3.    决策树（Decision Tree）

4.    支持向量机（SVM）

5.    朴素贝叶斯（Naive Bayes）

6.    k近邻（kNN）

7.    k均值（K-Means）

8.    随机森林（Random Forest）

9.    降维算法（Dimensionality Reduction Algorithms）

GBM

XGBoost

LightGBM

CatBoos

Y --- 因变量

a --- 倾斜量

X --- 自变量

b --- 截距

Python code

1# Import Library

2# Import other necessary libraries like pandas, numpy...

3from sklearn import linear_model

4# Load Train and Test datasets

5# Identify feature and response variable(s) and values must be numeric and numpy arrays

6x_train=input_variables_values_training_datasets

7y_train=target_variables_values_training_datasets

8x_test=input_variables_values_test_datasets

9# Create linear regression object

10linear = linear_model.LinearRegression()

11# Train the model using the training sets and check score

12linear.fit(x_train, y_train)

13linear.score(x_train, y_train)

14# Equation coefficient and Intercept

15print('Coefficient: \n', linear.coef_)

16print('Intercept: \n', linear.intercept_)

17# Predict Output

18predicted= linear.predict(x_test)

2逻辑回归

1odds = p / (1- p) = probabilityofevent occurrence / probabilityofnotevent occurrence

2ln(odds) = ln(p / (1- p))

3logit(p) = ln(p / (1- p)) = b0+b1X1+b2X2+b3X3....+bkXk

Python code

1# Import Library

2fromsklearn.linear_modelimportLogisticRegression

3# Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

4# Create logistic regression object

5model = LogisticRegression()

6# Train the model using the training sets and check score

7model.fit(X, y)

8model.score(X, y)

9# Equation coefficient and Intercept

10print('Coefficient: \n', model.coef_)

11print('Intercept: \n', model.intercept_)

12# Predict Output

13predicted= model.predict(x_test)

3决策树

Python code

1# Import Library

2# Import other necessary libraries like pandas, numpy...

3from sklearn import tree

4# Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

5# Create tree object

6model = tree.DecisionTreeClassifier(criterion='gini')# for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini

7# model = tree.DecisionTreeRegressor() for regression

8# Train the model using the training sets and check score

9model.fit(X, y)

10model.score(X, y)

11# Predict Output

12predicted= model.predict(x_test)

4SVM(支持向量机)

SVM是一个分类方法。在该算法中，我们将每个数据项绘制为n维空间中的一个点（其中n是您拥有的要素数），每个要素的值都是特定坐标的值。

Python code

1# Import Library

2from sklearn import svm

3# Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

4# Create SVM classification object

5model = svm.svc()# there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail.

6# Train the model using the training sets and check score

7model.fit(X, y)

8model.score(X, y)

9# Predict Output

10predicted= model.predict(x_test)

5朴素贝叶斯

P（c | x）是给定预测器（属性）的类（目标）的后验概率。

P（c）是类的先验概率。

P（x | c）是预测器给定类的概率的可能性。

P（x）是预测变量的先验概率。

Python code

1# Import Library

2fromsklearn.naive_bayesimportGaussianNB

3# Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

4# Create SVM classification object model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link

5# Train the model using the training sets and check score

6model.fit(X, y)

7# Predict Output

8predicted= model.predict(x_test)

6kNN (k-最近邻)

kNN可以用于分类和回归问题。然而，它在业内的分类问题中被更广泛地使用。 K最近邻算法是一个简单的算法，它存储所有可用的案例，并通过其k个邻居的多数投票来分类新案例。被分配给类的情况在距离函数测量的k近邻中最为常见。

kNN可以很容易地映射到我们的真实生活中。如果你想了解一个你没有信息的人，你可能想知道他的亲密朋友和他进入的圈子并获得他/她的信息！

Python code

1# Import Library

2fromsklearn.neighborsimportKNeighborsClassifier

3# Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

4# Create KNeighbors classifier object model

5KNeighborsClassifier(n_neighbors=6)# default value for n_neighbors is 5

6# Train the model using the training sets and check score

7model.fit(X, y)

8# Predict Output

9predicted= model.predict(x_test)

7K-均值（K-Means）

K-均值是一种解决聚类问题的无监督算法。其过程遵循一个简单的方法，通过一定数量的聚类（假设k个聚类）对给定的数据集进行分类。群集内的数据点与同级群组是同质且异质的。

K-means如何形成群集：

K-means为每个簇选取k个点，称为质心。

Python code

1# Import Library

2from sklearn.cluster import KMeans

3# Assumed you have, X (attributes) for training data set and x_test(attributes) of test_dataset

4# Create KNeighbors classifier object model

5k_means = KMeans(n_clusters=3, random_state=0)

6# Train the model using the training sets and check score

7model.fit(X)

8# Predict Output

9predicted= model.predict(x_test)

8

Python code

1# Import Library

2from sklearn.ensemble import RandomForestClassifier

3# Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

4# Create Random Forest object

5model= RandomForestClassifier()

6# Train the model using the training sets and check score

7model.fit(X, y)

8# Predict Output

9predicted= model.predict(x_test)

9

Python code

1# Import Library

2from sklearn import decomposition

3# Assumed you have training and test data set as train and test

4# Create PCA obeject pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features)

5# For Factor analysis

6# fa= decomposition.FactorAnalysis()

7# Reduced the dimension of training dataset using PCA

8train_reduced = pca.fit_transform(train)

9# Reduced the dimension of test dataset

10test_reduced = pca.transform(test)

10梯度下降算法

10.1. GBM

Python code

1# Import Library

3# Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

4# Create Gradient Boosting Classifier object

6# Train the model using the training sets and check score

7model.fit(X, y)

8# Predict Output

9predicted= model.predict(x_test)

10.2. XGBoost

XGBoost是另一个经典的渐变增强算法，被称为在一些Kaggle比赛中获胜的关键性算法。

XGBoost具有非常高的预测能力，使其成为事件精确度的最佳选择，因为它具有线性模型和树学习算法，使得该算法比现有梯度增强技术快近10倍。

XGBoost最有趣的事情之一就是它也被称为正规化提升技术。这有助于减少过度装配建模，并且对Scala，Java，R，Python，Julia和C ++等一系列语言提供大量支持。

Python code

1from xgboost import XGBClassifier

2from sklearn.model_selection import train_test_split

3from sklearn.metrics import accuracy_score

4X = dataset[:,0:10]

5Y = dataset[:,10:]

6seed = 1

7X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=seed)

8model = XGBClassifier()

9model.fit(X_train, y_train)

10#Make predictions for test data

11y_pred = model.predict(X_test)

10.3. LightGBM

LightGBM是一种梯度提升框架，使用基于树的学习算法。它的设计是分布式和高效的，具有以下优点：

Python code

1data = np.random.rand(500, 10)# 500 entities, each contains 10 features

2label = np.random.randint(2, size=500)# binary target

3train_data = lgb.Dataset(data, label=label)

4test_data = train_data.create_valid('test.svm')

5param = {'num_leaves':31, 'num_trees':100, 'objective':'binary'}

6param['metric'] = 'auc'

7num_round = 10

8bst = lgb.train(param, train_data, num_round, valid_sets=[test_data])

9bst.save_model('model.txt')

10# 7 entities, each contains 10 features

11data = np.random.rand(7, 10)

12ypred = bst.predict(data)

10.4. Catboost

CatBoost最棒的部分是它不需要像其他ML模型那样的大量数据训练，并且可以处理各种数据格式;不会破坏它的可靠性。

Catboost可以自动处理分类变量而不显示类型转换错误，这有助于您更专注于更好地调整模型。

Python code

1importpandasaspd

2importnumpyasnp

3fromcatboostimportCatBoostRegressor

7#Imputing missing values for both train and test

8train.fillna(-999, inplace=True)

9test.fillna(-999,inplace=True)

10#Creating a training set for modeling and validation set to check model performance

11X = train.drop(['Item_Outlet_Sales'], axis=1)

12y = train.Item_Outlet_Sales

13fromsklearn.model_selectionimporttrain_test_split

14X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.7, random_state=1234)

15categorical_features_indices = np.where(X.dtypes != np.float)[0]

16#importing library and building model

17fromcatboostimportCatBoostRegressormodel=CatBoostRegressor(iterations=50, depth=3, learning_rate=0.1, loss_function='RMSE')

18model.fit(X_train, y_train,cat_features=categorical_features_indices,eval_set=(X_validation, y_validation),plot=True)

19submission = pd.DataFrame()

20submission['Item_Identifier'] = test['Item_Identifier']

21submission['Outlet_Identifier'] = test['Outlet_Identifier']

22submission['Item_Outlet_Sales'] = model.predict(test)