姓名:冬之晓
时间:2018年11月2日,北京
我们将使用经典的Iris数据集。该数据集包含有关三种不同类型鸢尾花的信息:
数据集包含四个变量的度量:
Iris数据集有许多有趣的特点:
# packages to load # Check the versions of libraries # Python version import warnings warnings.filterwarnings('ignore') import sys print('Python: {}'.format(sys.version)) # scipy import scipy print('scipy: {}'.format(scipy.__version__)) import numpy # matplotlib import matplotlib print('matplotlib: {}'.format(matplotlib.__version__)) # numpy import numpy as np # linear algebra print('numpy: {}'.format(np.__version__)) # pandas import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) print('pandas: {}'.format(pd.__version__)) import seaborn as sns print('seaborn: {}'.format(sns.__version__)) sns.set(color_codes=True) import matplotlib.pyplot as plt print('matplotlib: {}'.format(matplotlib.__version__)) #matplotlib inline # scikit-learn import sklearn print('sklearn: {}'.format(sklearn.__version__)) # Input data files are available in the "../input/" directory. # For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory import os #matplotlib inline from sklearn.metrics import accuracy_score # Importing metrics for evaluation from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report
数据收集是收集和测量数据、信息或任何感兴趣的变量的过程,以标准化和确定的方式,使收集器能够回答或测试假设并评估特定集合的结果
Iris数据集由3种不同类型的鸢尾花(Setosa、versicolor和virginica)组成,它的花瓣和萼片长度,储存在150x4的numpy.ndarray中。
读取数据
读取数据、查看数据类型、数据形状以及基本信息
# import Dataset to play with it import pandas as pd dataset = pd.read_csv('../data/Iris.csv') print(type(dataset)) print(dataset.shape) print(dataset.size) print(dataset.info())
import pandas as pd dataset = pd.read_csv('../data/Iris.csv') print(dataset['Species'].unique()) print(dataset["Species"].value_counts())
import pandas as pd dataset = pd.read_csv('../data/Iris.csv') print(dataset.head(5)) #开头 print(dataset.tail()) #结尾 print(dataset.sample(5)) #随机
import pandas as pd dataset = pd.read_csv('../data/Iris.csv') print(dataset.describe())
import pandas as pd dataset = pd.read_csv('../data/Iris.csv') print(dataset.where(dataset ['Species']=='Iris-setosa')) print(dataset[dataset['SepalLengthCm']>7.2])
import pandas as pd dataset = pd.read_csv('../data/Iris.csv') import seaborn as sns import matplotlib.pyplot as plt # Modify the graph above by assigning each species an individual color. sns.FacetGrid(dataset, hue="Species", size=5) \ .map(plt.scatter, "SepalLengthCm", "SepalWidthCm") \ .add_legend() plt.show()
该函数详细说明参考:http://seaborn.pydata.org/generated/seaborn.FacetGrid.map.html#seaborn.FacetGrid.map
import pandas as pd import matplotlib.pyplot as plt dataset = pd.read_csv('../data/Iris.csv') dataset.plot(kind='box', subplots=True, layout=(2,3), sharex=False, sharey=False) plt.show()
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns dataset = pd.read_csv('../data/Iris.csv') sns.boxplot(x="Species", y="PetalLengthCm", data=dataset ) plt.show()
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns dataset = pd.read_csv('../data/Iris.csv') ax= sns.boxplot(x="Species", y="PetalLengthCm", data=dataset) ax= sns.stripplot(x="Species", y="PetalLengthCm", data=dataset, jitter=True, edgecolor="gray") plt.show()
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns dataset = pd.read_csv('../data/Iris.csv') ax= sns.boxplot(x="Species", y="PetalLengthCm", data=dataset) ax= sns.stripplot(x="Species", y="PetalLengthCm", data=dataset, jitter=True, edgecolor="gray") boxtwo = ax.artists[2] boxtwo.set_facecolor('red') boxtwo.set_edgecolor('black') boxthree=ax.artists[1] boxthree.set_facecolor('yellow') boxthree.set_edgecolor('black') plt.show()
该函数详细说明参考:http://seaborn.pydata.org/generated/seaborn.boxplot.html#seaborn.boxplot
我们也可以创建每个输入变量的直方图来获得分布的概念
import pandas as pd import matplotlib.pyplot as plt dataset = pd.read_csv('../data/Iris.csv') dataset.hist(figsize=(15,20)) plt.show()
看起来可能有两个输入变量有 高斯分布。这一点值得注意,因为我们可以使用算法来利用这个假设
对角线是直方图,其它的部分是两个变量之间的散点图
import pandas as pd import matplotlib.pyplot as plt dataset = pd.read_csv('../data/Iris.csv') pd.plotting.scatter_matrix(dataset,figsize=(10,10)) plt.show()
注意一些属性对的对角分组。这表明了一种高相关性和可预测的关系。
小提琴图又称核密度图,它是结合了箱形图和核密度图的图,将箱形图和密度图用一个图来显示,因为形状很像小提琴,所以被称作小提琴图。
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns dataset = pd.read_csv('../data/Iris.csv') sns.violinplot(data=dataset,x="Species", y="PetalLengthCm") plt.show()
该函数详细说明参考:http://seaborn.pydata.org/generated/seaborn.violinplot.html?highlight=violinplot#seaborn.violinplot
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns dataset = pd.read_csv('../data/Iris.csv') sns.pairplot(dataset, hue="Species") plt.show()
该函数详细说明参考: http://seaborn.pydata.org/generated/seaborn.pairplot.html?highlight=pairplot#seaborn.pairplot
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns dataset = pd.read_csv('../data/Iris.csv') sns.FacetGrid(dataset, hue="Species", size=5).map(sns.kdeplot, "PetalLengthCm").add_legend() plt.show()
该函数详细说明参考: http://seaborn.pydata.org/generated/seaborn.FacetGrid.html#seaborn.FacetGrid
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns dataset = pd.read_csv('../data/Iris.csv') # Use seaborn's jointplot to make a hexagonal bin plot #Set desired size and ratio and choose a color. sns.jointplot(x="SepalLengthCm", y="SepalWidthCm", data=dataset, size=10,ratio=10, kind='hex',color='green') plt.show()
该函数详细说明参考: http://seaborn.pydata.org/generated/seaborn.jointplot.html#seaborn.jointplot
每个点 定义一个有限傅里叶序列:
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns dataset = pd.read_csv('../data/Iris.csv') from pandas.plotting import andrews_curves andrews_curves(dataset.drop("Id", axis=1), "Species",colormap='rainbow') plt.show()
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns dataset = pd.read_csv('../data/Iris.csv') sns.heatmap(dataset.corr(),annot=True,cmap='cubehelix_r') #draws heatmap with input as the correlation matrix calculted by(iris.corr()) plt.show()
该函数详细说明参考: http://seaborn.pydata.org/generated/seaborn.heatmap.html?highlight=heatmap#seaborn.heatmap
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns dataset = pd.read_csv('../data/Iris.csv') from pandas.tools.plotting import radviz radviz(dataset.drop("Id", axis=1), "Species") plt.show()
数据清理的主要目标
需要解决问题包括
在本节中,应用了 将近20个学习算法
分类算法的评价术语
分类算法的评价指标
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) # K-Nearest Neighbours from sklearn.neighbors import KNeighborsClassifier Model = KNeighborsClassifier(n_neighbors=8) Model.fit(X_train, y_train) y_pred = Model.predict(X_test) # Summary of the predictions made by the classifier print(classification_report(y_test, y_pred)) print(confusion_matrix(y_test, y_pred)) # Accuracy score print('accuracy is',accuracy_score(y_pred,y_test))
precision recall f1-score support Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 1.00 1.00 1.00 13
Iris-virginica 1.00 1.00 1.00 6
avg / total 1.00 1.00 1.00 30
11 0 0]
[ 0 13 0]
[ 0 0 6
accuracy is 1.0
K近邻法算法原理参考:https://www.cnblogs.com/pinard/p/6061661.html
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) from sklearn.neighbors import RadiusNeighborsClassifier Model=RadiusNeighborsClassifier(radius=8.0) Model.fit(X_train,y_train) y_pred=Model.predict(X_test) #summary of the predictions made by the classifier print(classification_report(y_test,y_pred)) print(confusion_matrix(y_test,y_pred)) #Accouracy score print('accuracy is ', accuracy_score(y_test,y_pred))
precision recall f1-score support Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 1.00 1.00 1.00 13
Iris-virginica 1.00 1.00 1.00 6
avg / total 1.00 1.00 1.00 30
11 0 0]
[ 0 13 0]
[ 0 0 6
accuracy is 1.0
K近邻法和限定半径最近邻法类库参数 https://www.cnblogs.com/pinard/p/6065607.html
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) # LogisticRegression from sklearn.linear_model import LogisticRegression Model = LogisticRegression() Model.fit(X_train, y_train) y_pred = Model.predict(X_test) # Summary of the predictions made by the classifier print(classification_report(y_test, y_pred)) print(confusion_matrix(y_test, y_pred)) # Accuracy score print('accuracy is',accuracy_score(y_pred,y_test))
precision recall f1-score support Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 1.00 0.85 0.92 13
Iris-virginica 0.75 1.00 0.86 6
avg / total 0.95 0.93 0.94 30
11 0 0]
[ 0 11 2]
[ 0 0 6
accuracy is 0.9333333333333333
逻辑回归算法原理参考:https://www.cnblogs.com/pinard/p/6029432.html
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) from sklearn.linear_model import PassiveAggressiveClassifier Model = PassiveAggressiveClassifier() Model.fit(X_train, y_train) y_pred = Model.predict(X_test) # Summary of the predictions made by the classifier print(classification_report(y_test, y_pred)) print(confusion_matrix(y_test, y_pred)) # Accuracy score print('accuracy is',accuracy_score(y_pred,y_test))
G:\Anaconda3\lib\site-packages\sklearn\linear_model\stochastic_gradient.py:128: FutureWarning: max_iter and tol parameters have been added in <class 'sklearn.linear_model.passive_aggressive.PassiveAggressiveClassifier'> in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
"and default tol will be 1e-3." % type(self), FutureWarning)
G:\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
precision recall f1-score support
Iris-setosa 0.73 1.00 0.85 11
Iris-versicolor 0.00 0.00 0.00 13
Iris-virginica 0.40 1.00 0.57 6
avg / total 0.35 0.57 0.42 30
11 0 0]
[ 4 0 9]
[ 0 0 6
accuracy is 0.5666666666666667
被动攻击算法原理参考:http://scikit-learn.org/0.19/modules/linear_model.html#passive-aggressive
在机器学习中,朴素贝叶斯分类器是一组简单的“概率分类器”,基于贝叶斯定理,特征之间有很强的(朴素的)独立性假设,下面是高斯朴素贝叶斯实验:
GaussianNB假设特征的先验概率为正态分布,即如下式:
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) # Naive Bayes from sklearn.naive_bayes import GaussianNB Model = GaussianNB() Model.fit(X_train, y_train) y_pred = Model.predict(X_test) # Summary of the predictions made by the classifier print(classification_report(y_test, y_pred)) print(confusion_matrix(y_test, y_pred)) # Accuracy score print('accuracy is',accuracy_score(y_pred,y_test))
precision recall f1-score support Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 1.00 1.00 1.00 13
Iris-virginica 1.00 1.00 1.00 6
avg / total 1.00 1.00 1.00 30
11 0 0]
[ 0 13 0]
[ 0 0 6
accuracy is 1.0
朴素贝叶斯算法原理参考:https://www.cnblogs.com/pinard/p/6069267.html
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) # MultinomialNB from sklearn.naive_bayes import MultinomialNB Model = MultinomialNB() Model.fit(X_train, y_train) y_pred = Model.predict(X_test) # Summary of the predictions made by the classifier print(classification_report(y_test, y_pred)) print(confusion_matrix(y_test, y_pred)) # Accuracy score print('accuracy is',accuracy_score(y_pred,y_test))
precision recall f1-score support Iris-setosa 1.00 0.91 0.95 11
Iris-versicolor 0.83 0.77 0.80 13
Iris-virginica 0.62 0.83 0.71 6
avg / total 0.85 0.83 0.84 30
10 1 0]
[ 0 10 3]
[ 0 1 5
accuracy is 0.8333333333333334
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) # BernoulliNB from sklearn.naive_bayes import BernoulliNB Model = BernoulliNB() Model.fit(X_train, y_train) y_pred = Model.predict(X_test) # Summary of the predictions made by the classifier print(classification_report(y_test, y_pred)) print(confusion_matrix(y_test, y_pred)) # Accuracy score print('accuracy is',accuracy_score(y_pred,y_test))
G:\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
precision recall f1-score support
Iris-setosa 0.00 0.00 0.00 11
Iris-versicolor 0.00 0.00 0.00 13
Iris-virginica 0.20 1.00 0.33 6
avg / total 0.04 0.20 0.07 30
0 0 11]
[ 0 0 13]
[ 0 0 6
accuracy is 0.2
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) # Support Vector Machine from sklearn.svm import SVC Model = SVC() Model.fit(X_train, y_train) y_pred = Model.predict(X_test) # Summary of the predictions made by the classifier print(classification_report(y_test, y_pred)) print(confusion_matrix(y_test, y_pred)) # Accuracy score print('accuracy is',accuracy_score(y_pred,y_test))
precision recall f1-score support Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 1.00 1.00 1.00 13
Iris-virginica 1.00 1.00 1.00 6
avg / total 1.00 1.00 1.00 30
11 0 0]
[ 0 13 0]
[ 0 0 6
accuracy is 1.0
支持向量机算法原理参考:http://www.cnblogs.com/pinard/p/6097604.html
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) # Support Vector Machine's from sklearn.svm import NuSVC Model = NuSVC() Model.fit(X_train, y_train) y_pred = Model.predict(X_test) # Summary of the predictions made by the classifier print(classification_report(y_test, y_pred)) print(confusion_matrix(y_test, y_pred)) # Accuracy score print('accuracy is',accuracy_score(y_pred,y_test))
precision recall f1-score support Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 1.00 1.00 1.00 13
Iris-virginica 1.00 1.00 1.00 6
avg / total 1.00 1.00 1.00 30
11 0 0]
[ 0 13 0]
[ 0 0 6
accuracy is 1.0
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) # Linear Support Vector Classification from sklearn.svm import LinearSVC Model = LinearSVC() Model.fit(X_train, y_train) y_pred = Model.predict(X_test) # Summary of the predictions made by the classifier print(classification_report(y_test, y_pred)) print(confusion_matrix(y_test, y_pred)) # Accuracy score print('accuracy is',accuracy_score(y_pred,y_test))
precision recall f1-score support Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 1.00 0.77 0.87 13
Iris-virginica 0.67 1.00 0.80 6
avg / total 0.93 0.90 0.90 30
11 0 0]
[ 0 10 3]
[ 0 0 6
accuracy is 0.9
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) # Decision Tree's from sklearn.tree import DecisionTreeClassifier Model = DecisionTreeClassifier() Model.fit(X_train, y_train) y_pred = Model.predict(X_test) # Summary of the predictions made by the classifier print(classification_report(y_test, y_pred)) print(confusion_matrix(y_test, y_pred)) # Accuracy score print('accuracy is',accuracy_score(y_pred,y_test))
precision recall f1-score support Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 0.93 1.00 0.96 13
Iris-virginica 1.00 0.83 0.91 6
avg / total 0.97 0.97 0.97 30
11 0 0]
[ 0 13 0]
[ 0 1 5
accuracy is 0.9666666666666667
决策树算法原理参考:https://www.cnblogs.com/pinard/p/6050306.html
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) # ExtraTreeClassifier from sklearn.tree import ExtraTreeClassifier Model = ExtraTreeClassifier() Model.fit(X_train, y_train) y_pred = Model.predict(X_test) # Summary of the predictions made by the classifier print(classification_report(y_test, y_pred)) print(confusion_matrix(y_test, y_pred)) # Accuracy score print('accuracy is',accuracy_score(y_pred,y_test))
precision recall f1-score support Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 0.93 1.00 0.96 13
Iris-virginica 1.00 0.83 0.91 6
avg / total 0.97 0.97 0.97 30
11 0 0]
[ 0 13 0]
[ 0 1 5
accuracy is 0.9666666666666667
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) from sklearn.neural_network import MLPClassifier Model=MLPClassifier() Model.fit(X_train,y_train) y_pred=Model.predict(X_test) # Summary of the predictions print(classification_report(y_test,y_pred)) print(confusion_matrix(y_test,y_pred)) #Accuracy Score print('accuracy is ',accuracy_score(y_pred,y_test))
G:\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
precision recall f1-score support
Iris-setosa 0.00 0.00 0.00 11
Iris-versicolor 0.43 1.00 0.60 13
Iris-virginica 0.00 0.00 0.00 6
avg / total 0.19 0.43 0.26 30
0 11 0]
[ 0 13 0]
[ 0 6 0
accuracy is 0.43333333333333335
随机森林是一种元估计,它适用于数据集的各种子样本上的许多决策树分类器,并使用平均法来提高预测精度和控制过拟合。子样本大小总是与原始输入样本大小相同,但是如果bootstrap=True(默认),则用替换的方式绘制样本。
随机森林的实验
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) from sklearn.ensemble import RandomForestClassifier Model=RandomForestClassifier(max_depth=2) Model.fit(X_train,y_train) y_pred=Model.predict(X_test) print(classification_report(y_test,y_pred)) print(confusion_matrix(y_pred,y_test)) #Accuracy Score print('accuracy is ',accuracy_score(y_pred,y_test))
G:\Anaconda3\lib\site-packages\sklearn\ensemble\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
from numpy.core.umath_tests import inner1d
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 0.93 1.00 0.96 13
Iris-virginica 1.00 0.83 0.91 6
avg / total 0.97 0.97 0.97 30
11 0 0]
[ 0 13 1]
[ 0 0 5
accuracy is 0.9666666666666667
随机森林算法原理参考:https://www.cnblogs.com/pinard/p/6156009.html
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) from sklearn.ensemble import BaggingClassifier Model=BaggingClassifier() Model.fit(X_train,y_train) y_pred=Model.predict(X_test) print(classification_report(y_test,y_pred)) print(confusion_matrix(y_pred,y_test)) #Accuracy Score print('accuracy is ',accuracy_score(y_pred,y_test))
G:\Anaconda3\lib\site-packages\sklearn\ensemble\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
from numpy.core.umath_tests import inner1d
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 0.93 1.00 0.96 13
Iris-virginica 1.00 0.83 0.91 6
avg / total 0.97 0.97 0.97 30
11 0 0]
[ 0 13 1]
[ 0 0 5
accuracy is 0.9666666666666667
Bagging算法原理参考:https://www.cnblogs.com/pinard/p/6156009.html
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) from sklearn.ensemble import AdaBoostClassifier Model=AdaBoostClassifier() Model.fit(X_train,y_train) y_pred=Model.predict(X_test) print(classification_report(y_test,y_pred)) print(confusion_matrix(y_pred,y_test)) #Accuracy Score print('accuracy is ',accuracy_score(y_pred,y_test))
G:\Anaconda3\lib\site-packages\sklearn\ensemble\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
from numpy.core.umath_tests import inner1d
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 0.93 1.00 0.96 13
Iris-virginica 1.00 0.83 0.91 6
avg / total 0.97 0.97 0.97 30
11 0 0]
[ 0 13 1]
[ 0 0 5
accuracy is 0.9666666666666667
Adaboost算法原理参考:https://www.cnblogs.com/pinard/p/6133937.html
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) from sklearn.ensemble import GradientBoostingClassifier Model=GradientBoostingClassifier() Model.fit(X_train,y_train) y_pred=Model.predict(X_test) print(classification_report(y_test,y_pred)) print(confusion_matrix(y_pred,y_test)) #Accuracy Score print('accuracy is ',accuracy_score(y_pred,y_test))
G:\Anaconda3\lib\site-packages\sklearn\ensemble\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
from numpy.core.umath_tests import inner1d
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 0.93 1.00 0.96 13
Iris-virginica 1.00 0.83 0.91 6
avg / total 0.97 0.97 0.97 30
11 0 0]
[ 0 13 1]
[ 0 0 5
accuracy is 0.9666666666666667
梯度提升树(GBDT)算法原理参考:https://www.cnblogs.com/pinard/p/6140514.html
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) from sklearn.discriminant_analysis import LinearDiscriminantAnalysis Model=LinearDiscriminantAnalysis() Model.fit(X_train,y_train) y_pred=Model.predict(X_test) print(classification_report(y_test,y_pred)) print(confusion_matrix(y_pred,y_test)) #Accuracy Score print('accuracy is ',accuracy_score(y_pred,y_test))
precision recall f1-score support Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 1.00 1.00 1.00 13
Iris-virginica 1.00 1.00 1.00 6
avg / total 1.00 1.00 1.00 30
11 0 0]
[ 0 13 0]
[ 0 0 6
accuracy is 1.0
线性判别分析LDA算法原理参考:https://www.cnblogs.com/pinard/p/6244265.html
import pandas as pd #进行评价需要导入的包 from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score dataset = pd.read_csv('../data/Iris.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis Model=QuadraticDiscriminantAnalysis() Model.fit(X_train,y_train) y_pred=Model.predict(X_test) print(classification_report(y_test,y_pred)) print(confusion_matrix(y_pred,y_test)) #Accuracy Score print('accuracy is ',accuracy_score(y_pred,y_test))
precision recall f1-score support Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 1.00 1.00 1.00 13
Iris-virginica 1.00 1.00 1.00 6
avg / total 1.00 1.00 1.00 30
11 0 0]
[ 0 13 0]
[ 0 0 6
accuracy is 1.0
二次判别分析QDA算法原理参考:http://scikit-learn.org/stable/modules/lda_qda.html#dimensionality-reduction-using-linear-discriminant-analysis
[2] Andrews D F. Plots of High-Dimensional Data[J]. Biometrics, 1972, 28(1):125-136.