回到主页

目录

▶︎
all
running...

使用Python语言进行机器学习工作流的实例分析

姓名:冬之晓
时间:2018年11月2日,北京

1. 介绍

2. 机器学习工作流程

3 问题定义

3.1 问题特征

3.2 目标

3.3 变量

4. 输入输出

5. 安装工具和依赖包

# packages to load 
# Check the versions of libraries
# Python version
import warnings
warnings.filterwarnings('ignore')
import sys
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
import numpy
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# numpy
import numpy as np # linear algebra
print('numpy: {}'.format(np.__version__))
# pandas
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
print('pandas: {}'.format(pd.__version__))
import seaborn as sns
print('seaborn: {}'.format(sns.__version__))
sns.set(color_codes=True)
import matplotlib.pyplot as plt
print('matplotlib: {}'.format(matplotlib.__version__))
#matplotlib inline
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import os
#matplotlib inline
from sklearn.metrics import accuracy_score
# Importing metrics for evaluation
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
▶︎
all
running...
Python: 3.6.6 |Anaconda custom (64-bit)| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)] scipy: 1.1.0 matplotlib: 2.2.3 numpy: 1.15.1 pandas: 0.20.3 seaborn: 0.9.0 matplotlib: 2.2.3 sklearn: 0.19.2

6. 探索性数据分析

6.1 数据搜集与简单探索

# import Dataset to play with it
import pandas as pd
dataset = pd.read_csv('../data/Iris.csv')
print(type(dataset))
print(dataset.shape)
print(dataset.size)
print(dataset.info())
▶︎
all
running...
(150, 6) 900 RangeIndex: 150 entries, 0 to 149 Data columns (total 6 columns): Id 150 non-null int64 SepalLengthCm 150 non-null float64 SepalWidthCm 150 non-null float64 PetalLengthCm 150 non-null float64 PetalWidthCm 150 non-null float64 Species 150 non-null object dtypes: float64(4), int64(1), object(1) memory usage: 7.1+ KB None
import pandas as pd
dataset = pd.read_csv('../data/Iris.csv')
print(dataset['Species'].unique())
print(dataset["Species"].value_counts())
▶︎
all
running...
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica'] Iris-versicolor 50 Iris-virginica 50 Iris-setosa 50 Name: Species, dtype: int64
import pandas as pd
dataset = pd.read_csv('../data/Iris.csv')
print(dataset.head(5))   #开头
print(dataset.tail())    #结尾
print(dataset.sample(5)) #随机
▶︎
all
running...
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species 0 1 5.1 3.5 1.4 0.2 Iris-setosa 1 2 4.9 3.0 1.4 0.2 Iris-setosa 2 3 4.7 3.2 1.3 0.2 Iris-setosa 3 4 4.6 3.1 1.5 0.2 Iris-setosa 4 5 5.0 3.6 1.4 0.2 Iris-setosa Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \ 145 146 6.7 3.0 5.2 2.3 146 147 6.3 2.5 5.0 1.9 147 148 6.5 3.0 5.2 2.0 148 149 6.2 3.4 5.4 2.3 149 150 5.9 3.0 5.1 1.8 Species 145 Iris-virginica 146 Iris-virginica 147 Iris-virginica 148 Iris-virginica 149 Iris-virginica Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \ 71 72 6.1 2.8 4.0 1.3 27 28 5.2 3.5 1.5 0.2 38 39 4.4 3.0 1.3 0.2 89 90 5.5 2.5 4.0 1.3 73 74 6.1 2.8 4.7 1.2 Species 71 Iris-versicolor 27 Iris-setosa 38 Iris-setosa 89 Iris-versicolor 73 Iris-versicolor
import pandas as pd
dataset = pd.read_csv('../data/Iris.csv')
print(dataset.describe()) 
▶︎
all
running...
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm count 150.000000 150.000000 150.000000 150.000000 150.000000 mean 75.500000 5.843333 3.054000 3.758667 1.198667 std 43.445368 0.828066 0.433594 1.764420 0.763161 min 1.000000 4.300000 2.000000 1.000000 0.100000 25% 38.250000 5.100000 2.800000 1.600000 0.300000 50% 75.500000 5.800000 3.000000 4.350000 1.300000 75% 112.750000 6.400000 3.300000 5.100000 1.800000 max 150.000000 7.900000 4.400000 6.900000 2.500000
import pandas as pd
dataset = pd.read_csv('../data/Iris.csv')
print(dataset.where(dataset ['Species']=='Iris-setosa'))
print(dataset[dataset['SepalLengthCm']>7.2])
▶︎
all
running...
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \ 0 1.0 5.1 3.5 1.4 0.2 1 2.0 4.9 3.0 1.4 0.2 2 3.0 4.7 3.2 1.3 0.2 3 4.0 4.6 3.1 1.5 0.2 4 5.0 5.0 3.6 1.4 0.2 5 6.0 5.4 3.9 1.7 0.4 6 7.0 4.6 3.4 1.4 0.3 7 8.0 5.0 3.4 1.5 0.2 8 9.0 4.4 2.9 1.4 0.2 9 10.0 4.9 3.1 1.5 0.1 10 11.0 5.4 3.7 1.5 0.2 11 12.0 4.8 3.4 1.6 0.2 12 13.0 4.8 3.0 1.4 0.1 13 14.0 4.3 3.0 1.1 0.1 14 15.0 5.8 4.0 1.2 0.2 15 16.0 5.7 4.4 1.5 0.4 16 17.0 5.4 3.9 1.3 0.4 17 18.0 5.1 3.5 1.4 0.3 18 19.0 5.7 3.8 1.7 0.3 19 20.0 5.1 3.8 1.5 0.3 20 21.0 5.4 3.4 1.7 0.2 21 22.0 5.1 3.7 1.5 0.4 22 23.0 4.6 3.6 1.0 0.2 23 24.0 5.1 3.3 1.7 0.5 24 25.0 4.8 3.4 1.9 0.2 25 26.0 5.0 3.0 1.6 0.2 26 27.0 5.0 3.4 1.6 0.4 27 28.0 5.2 3.5 1.5 0.2 28 29.0 5.2 3.4 1.4 0.2 29 30.0 4.7 3.2 1.6 0.2 .. ... ... ... ... ... 120 NaN NaN NaN NaN NaN 121 NaN NaN NaN NaN NaN 122 NaN NaN NaN NaN NaN 123 NaN NaN NaN NaN NaN 124 NaN NaN NaN NaN NaN 125 NaN NaN NaN NaN NaN 126 NaN NaN NaN NaN NaN 127 NaN NaN NaN NaN NaN 128 NaN NaN NaN NaN NaN 129 NaN NaN NaN NaN NaN 130 NaN NaN NaN NaN NaN 131 NaN NaN NaN NaN NaN 132 NaN NaN NaN NaN NaN 133 NaN NaN NaN NaN NaN 134 NaN NaN NaN NaN NaN 135 NaN NaN NaN NaN NaN 136 NaN NaN NaN NaN NaN 137 NaN NaN NaN NaN NaN 138 NaN NaN NaN NaN NaN 139 NaN NaN NaN NaN NaN 140 NaN NaN NaN NaN NaN 141 NaN NaN NaN NaN NaN 142 NaN NaN NaN NaN NaN 143 NaN NaN NaN NaN NaN 144 NaN NaN NaN NaN NaN 145 NaN NaN NaN NaN NaN 146 NaN NaN NaN NaN NaN 147 NaN NaN NaN NaN NaN 148 NaN NaN NaN NaN NaN 149 NaN NaN NaN NaN NaN Species 0 Iris-setosa 1 Iris-setosa 2 Iris-setosa 3 Iris-setosa 4 Iris-setosa 5 Iris-setosa 6 Iris-setosa 7 Iris-setosa 8 Iris-setosa 9 Iris-setosa 10 Iris-setosa 11 Iris-setosa 12 Iris-setosa 13 Iris-setosa 14 Iris-setosa 15 Iris-setosa 16 Iris-setosa 17 Iris-setosa 18 Iris-setosa 19 Iris-setosa 20 Iris-setosa 21 Iris-setosa 22 Iris-setosa 23 Iris-setosa 24 Iris-setosa 25 Iris-setosa 26 Iris-setosa 27 Iris-setosa 28 Iris-setosa 29 Iris-setosa .. ... 120 NaN 121 NaN 122 NaN 123 NaN 124 NaN 125 NaN 126 NaN 127 NaN 128 NaN 129 NaN 130 NaN 131 NaN 132 NaN 133 NaN 134 NaN 135 NaN 136 NaN 137 NaN 138 NaN 139 NaN 140 NaN 141 NaN 142 NaN 143 NaN 144 NaN 145 NaN 146 NaN 147 NaN 148 NaN 149 NaN [150 rows x 6 columns] Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \ 105 106 7.6 3.0 6.6 2.1 107 108 7.3 2.9 6.3 1.8 117 118 7.7 3.8 6.7 2.2 118 119 7.7 2.6 6.9 2.3 122 123 7.7 2.8 6.7 2.0 130 131 7.4 2.8 6.1 1.9 131 132 7.9 3.8 6.4 2.0 135 136 7.7 3.0 6.1 2.3 Species 105 Iris-virginica 107 Iris-virginica 117 Iris-virginica 118 Iris-virginica 122 Iris-virginica 130 Iris-virginica 131 Iris-virginica 135 Iris-virginica

6.2 可视化

6.2.1 散点图

import pandas as pd
dataset = pd.read_csv('../data/Iris.csv')
import seaborn as sns
import matplotlib.pyplot as plt
# Modify the graph above by assigning each species an individual color.
sns.FacetGrid(dataset, hue="Species", size=5) \
   .map(plt.scatter, "SepalLengthCm", "SepalWidthCm") \
   .add_legend()
plt.show()
▶︎
all
running...
G:\Anaconda3\lib\site-packages\seaborn\axisgrid.py:230: UserWarning: The `size` paramter has been renamed to `height`; please update your code. warnings.warn(msg, UserWarning)

该函数详细说明参考:http://seaborn.pydata.org/generated/seaborn.FacetGrid.map.html#seaborn.FacetGrid.map

6.2.2 盒图

import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv('../data/Iris.csv')
dataset.plot(kind='box', subplots=True, layout=(2,3), sharex=False, sharey=False)
plt.show()
▶︎
all
running...

该函数详细说明参考:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html?highlight=plot#pandas.DataFrame.plot

6.2.2 盒图(附)

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
sns.boxplot(x="Species", y="PetalLengthCm", data=dataset )
plt.show()
▶︎
all
running...
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
ax= sns.boxplot(x="Species", y="PetalLengthCm", data=dataset)
ax= sns.stripplot(x="Species", y="PetalLengthCm", data=dataset, jitter=True, edgecolor="gray")
plt.show()
▶︎
all
running...
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
ax= sns.boxplot(x="Species", y="PetalLengthCm", data=dataset)
ax= sns.stripplot(x="Species", y="PetalLengthCm", data=dataset, jitter=True, edgecolor="gray")

boxtwo = ax.artists[2]
boxtwo.set_facecolor('red')
boxtwo.set_edgecolor('black')
boxthree=ax.artists[1]
boxthree.set_facecolor('yellow')
boxthree.set_edgecolor('black')

plt.show()
▶︎
all
running...

该函数详细说明参考:http://seaborn.pydata.org/generated/seaborn.boxplot.html#seaborn.boxplot

6.2.3 条形图

我们也可以创建每个输入变量的直方图来获得分布的概念

import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv('../data/Iris.csv')
dataset.hist(figsize=(15,20))
plt.show()
▶︎
all
running...

看起来可能有两个输入变量有 高斯分布。这一点值得注意,因为我们可以使用算法来利用这个假设

该函数详细说明参考:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html?highlight=hist#pandas.DataFrame.hist

6.2.4 双变量的散点图

对角线是直方图,其它的部分是两个变量之间的散点图

import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv('../data/Iris.csv')
pd.plotting.scatter_matrix(dataset,figsize=(10,10))
plt.show()
▶︎
all
running...

注意一些属性对的对角分组。这表明了一种高相关性和可预测的关系。

该函数详细说明参考: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.plotting.scatter_matrix.html?highlight=plotting

6.2.5 小提琴图

小提琴图又称核密度图,它是结合了箱形图和核密度图的图,将箱形图和密度图用一个图来显示,因为形状很像小提琴,所以被称作小提琴图。

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
sns.violinplot(data=dataset,x="Species", y="PetalLengthCm")
plt.show()
▶︎
all
running...
G:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

该函数详细说明参考:http://seaborn.pydata.org/generated/seaborn.violinplot.html?highlight=violinplot#seaborn.violinplot

6.2.6 成对图

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
sns.pairplot(dataset, hue="Species")
plt.show()
▶︎
all
running...
G:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

该函数详细说明参考: http://seaborn.pydata.org/generated/seaborn.pairplot.html?highlight=pairplot#seaborn.pairplot

6.2.7 kdeplot

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
sns.FacetGrid(dataset, hue="Species", size=5).map(sns.kdeplot, "PetalLengthCm").add_legend()
plt.show()
▶︎
all
running...
G:\Anaconda3\lib\site-packages\seaborn\axisgrid.py:230: UserWarning: The `size` paramter has been renamed to `height`; please update your code. warnings.warn(msg, UserWarning) G:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

该函数详细说明参考: http://seaborn.pydata.org/generated/seaborn.FacetGrid.html#seaborn.FacetGrid

6.2.8 jointplot

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
# Use seaborn's jointplot to make a hexagonal bin plot
#Set desired size and ratio and choose a color.
sns.jointplot(x="SepalLengthCm", y="SepalWidthCm", data=dataset, size=10,ratio=10, kind='hex',color='green')
plt.show()
▶︎
all
running...
G:\Anaconda3\lib\site-packages\seaborn\axisgrid.py:2262: UserWarning: The `size` paramter has been renamed to `height`; please update your code. warnings.warn(msg, UserWarning) G:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

该函数详细说明参考: http://seaborn.pydata.org/generated/seaborn.jointplot.html#seaborn.jointplot

6.2.9 安德鲁斯曲线

每个点 x=x1,x2,,xdx={x_1,x_2,…,x_d }定义一个有限傅里叶序列:
f(t)=x12+x2sin(t)+x3cos(t)+x4sin(2t)+x5cos(2t)+f(t)=\frac {x_1}{\sqrt2}+x_2 sin⁡(t)+x_3 cos⁡(t)+x_4 sin⁡(2t)+ x_5 cos⁡(2t)+ …

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
from pandas.plotting import andrews_curves
andrews_curves(dataset.drop("Id", axis=1), "Species",colormap='rainbow')
plt.show()
▶︎
all
running...
Traceback (most recent call last): File "H:\Dongzhixiao.github.io\_drafts\6d2cbpq2l_code_chunk", line 31, in plt.show() File "H:\Dongzhixiao.github.io\_drafts\6d2cbpq2l_code_chunk", line 10, in new_plt_show plt.savefig(sys.stdout, format="svg") File "G:\Anaconda3\lib\site-packages\matplotlib\pyplot.py", line 695, in savefig res = fig.savefig(*args, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\figure.py", line 2062, in savefig self.canvas.print_figure(fname, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\backend_bases.py", line 2263, in print_figure **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\backends\backend_svg.py", line 1210, in print_svg result = self._print_svg(filename, fh, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\backends\backend_svg.py", line 1240, in _print_svg self.figure.draw(renderer) File "G:\Anaconda3\lib\site-packages\matplotlib\artist.py", line 55, in draw_wrapper return draw(artist, renderer, *args, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\figure.py", line 1493, in draw renderer, self, artists, self.suppressComposite) File "G:\Anaconda3\lib\site-packages\matplotlib\image.py", line 141, in _draw_list_compositing_images a.draw(renderer) File "G:\Anaconda3\lib\site-packages\matplotlib\artist.py", line 55, in draw_wrapper return draw(artist, renderer, *args, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\axes\_base.py", line 2635, in draw mimage._draw_list_compositing_images(renderer, self, artists) File "G:\Anaconda3\lib\site-packages\matplotlib\image.py", line 141, in _draw_list_compositing_images a.draw(renderer) File "G:\Anaconda3\lib\site-packages\matplotlib\artist.py", line 55, in draw_wrapper return draw(artist, renderer, *args, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\axis.py", line 1195, in draw tick.draw(renderer) File "G:\Anaconda3\lib\site-packages\matplotlib\artist.py", line 55, in draw_wrapper return draw(artist, renderer, *args, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\axis.py", line 304, in draw self.label1.draw(renderer) File "G:\Anaconda3\lib\site-packages\matplotlib\artist.py", line 55, in draw_wrapper return draw(artist, renderer, *args, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\text.py", line 755, in draw ismath=ismath, mtext=mtext) File "G:\Anaconda3\lib\site-packages\matplotlib\backends\backend_svg.py", line 1168, in draw_text self._draw_text_as_path(gc, x, y, s, prop, angle, ismath, mtext) File "G:\Anaconda3\lib\site-packages\matplotlib\backends\backend_svg.py", line 901, in _draw_text_as_path writer.comment(s) File "G:\Anaconda3\lib\site-packages\matplotlib\backends\backend_svg.py", line 168, in comment self.__write("\n" % escape_comment(comment)) UnicodeEncodeError: 'gbk' codec can't encode character '\u2212' in position 5: illegal multibyte sequence

原理参考: http://www.jucs.org/jucs_11_11/visualization_of_high_dimensional/jucs_11_11_1806_1819_garc_a_osorio.pdf

该函数详细说明参考: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.plotting.andrews_curves.html?highlight=andrews_curves

6.2.10 热图

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
sns.heatmap(dataset.corr(),annot=True,cmap='cubehelix_r') #draws  heatmap with input as the correlation matrix calculted by(iris.corr())
plt.show()
▶︎
all
running...
Traceback (most recent call last): File "H:\Dongzhixiao.github.io\_drafts\bwxumt3k0_code_chunk", line 30, in plt.show() File "H:\Dongzhixiao.github.io\_drafts\bwxumt3k0_code_chunk", line 10, in new_plt_show plt.savefig(sys.stdout, format="svg") File "G:\Anaconda3\lib\site-packages\matplotlib\pyplot.py", line 695, in savefig res = fig.savefig(*args, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\figure.py", line 2062, in savefig self.canvas.print_figure(fname, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\backend_bases.py", line 2263, in print_figure **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\backends\backend_svg.py", line 1210, in print_svg result = self._print_svg(filename, fh, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\backends\backend_svg.py", line 1240, in _print_svg self.figure.draw(renderer) File "G:\Anaconda3\lib\site-packages\matplotlib\artist.py", line 55, in draw_wrapper return draw(artist, renderer, *args, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\figure.py", line 1493, in draw renderer, self, artists, self.suppressComposite) File "G:\Anaconda3\lib\site-packages\matplotlib\image.py", line 141, in _draw_list_compositing_images a.draw(renderer) File "G:\Anaconda3\lib\site-packages\matplotlib\artist.py", line 55, in draw_wrapper return draw(artist, renderer, *args, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\axes\_base.py", line 2635, in draw mimage._draw_list_compositing_images(renderer, self, artists) File "G:\Anaconda3\lib\site-packages\matplotlib\image.py", line 141, in _draw_list_compositing_images a.draw(renderer) File "G:\Anaconda3\lib\site-packages\matplotlib\artist.py", line 55, in draw_wrapper return draw(artist, renderer, *args, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\axis.py", line 1195, in draw tick.draw(renderer) File "G:\Anaconda3\lib\site-packages\matplotlib\artist.py", line 55, in draw_wrapper return draw(artist, renderer, *args, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\axis.py", line 306, in draw self.label2.draw(renderer) File "G:\Anaconda3\lib\site-packages\matplotlib\artist.py", line 55, in draw_wrapper return draw(artist, renderer, *args, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\text.py", line 755, in draw ismath=ismath, mtext=mtext) File "G:\Anaconda3\lib\site-packages\matplotlib\backends\backend_svg.py", line 1168, in draw_text self._draw_text_as_path(gc, x, y, s, prop, angle, ismath, mtext) File "G:\Anaconda3\lib\site-packages\matplotlib\backends\backend_svg.py", line 901, in _draw_text_as_path writer.comment(s) File "G:\Anaconda3\lib\site-packages\matplotlib\backends\backend_svg.py", line 168, in comment self.__write("\n" % escape_comment(comment)) UnicodeEncodeError: 'gbk' codec can't encode character '\u2212' in position 5: illegal multibyte sequence

该函数详细说明参考: http://seaborn.pydata.org/generated/seaborn.heatmap.html?highlight=heatmap#seaborn.heatmap

6.2.11 雷达图

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
from pandas.tools.plotting import radviz
radviz(dataset.drop("Id", axis=1), "Species")
plt.show()
▶︎
all
running...
Traceback (most recent call last): File "H:\Dongzhixiao.github.io\_drafts\lg9h3qeio_code_chunk", line 31, in plt.show() File "H:\Dongzhixiao.github.io\_drafts\lg9h3qeio_code_chunk", line 10, in new_plt_show plt.savefig(sys.stdout, format="svg") File "G:\Anaconda3\lib\site-packages\matplotlib\pyplot.py", line 695, in savefig res = fig.savefig(*args, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\figure.py", line 2062, in savefig self.canvas.print_figure(fname, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\backend_bases.py", line 2263, in print_figure **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\backends\backend_svg.py", line 1210, in print_svg result = self._print_svg(filename, fh, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\backends\backend_svg.py", line 1240, in _print_svg self.figure.draw(renderer) File "G:\Anaconda3\lib\site-packages\matplotlib\artist.py", line 55, in draw_wrapper return draw(artist, renderer, *args, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\figure.py", line 1493, in draw renderer, self, artists, self.suppressComposite) File "G:\Anaconda3\lib\site-packages\matplotlib\image.py", line 141, in _draw_list_compositing_images a.draw(renderer) File "G:\Anaconda3\lib\site-packages\matplotlib\artist.py", line 55, in draw_wrapper return draw(artist, renderer, *args, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\axes\_base.py", line 2635, in draw mimage._draw_list_compositing_images(renderer, self, artists) File "G:\Anaconda3\lib\site-packages\matplotlib\image.py", line 141, in _draw_list_compositing_images a.draw(renderer) File "G:\Anaconda3\lib\site-packages\matplotlib\artist.py", line 55, in draw_wrapper return draw(artist, renderer, *args, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\axis.py", line 1195, in draw tick.draw(renderer) File "G:\Anaconda3\lib\site-packages\matplotlib\artist.py", line 55, in draw_wrapper return draw(artist, renderer, *args, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\axis.py", line 304, in draw self.label1.draw(renderer) File "G:\Anaconda3\lib\site-packages\matplotlib\artist.py", line 55, in draw_wrapper return draw(artist, renderer, *args, **kwargs) File "G:\Anaconda3\lib\site-packages\matplotlib\text.py", line 755, in draw ismath=ismath, mtext=mtext) File "G:\Anaconda3\lib\site-packages\matplotlib\backends\backend_svg.py", line 1168, in draw_text self._draw_text_as_path(gc, x, y, s, prop, angle, ismath, mtext) File "G:\Anaconda3\lib\site-packages\matplotlib\backends\backend_svg.py", line 901, in _draw_text_as_path writer.comment(s) File "G:\Anaconda3\lib\site-packages\matplotlib\backends\backend_svg.py", line 168, in comment self.__write("\n" % escape_comment(comment)) UnicodeEncodeError: 'gbk' codec can't encode character '\u2212' in position 5: illegal multibyte sequence

该函数详细说明参考: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.plotting.radviz.html?highlight=radviz#pandas.plotting.radviz

6.3 数据预处理

6.4 数据清洗

7. 模型探索

7.1 K近邻算法

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# K-Nearest Neighbours
from sklearn.neighbors import KNeighborsClassifier

Model = KNeighborsClassifier(n_neighbors=8)
Model.fit(X_train, y_train)

y_pred = Model.predict(X_test)

# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score

print('accuracy is',accuracy_score(y_pred,y_test))
▶︎
all
running...
             precision    recall  f1-score   support

Iris-setosa       1.00      1.00      1.00        11

Iris-versicolor 1.00 1.00 1.00 13
Iris-virginica 1.00 1.00 1.00 6

avg / total       1.00      1.00      1.00        30

11 0 0] [ 0 13 0] [ 0 0 6
accuracy is 1.0

K近邻法算法原理参考:https://www.cnblogs.com/pinard/p/6061661.html

7.2 限定半径最近邻算法

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.neighbors import  RadiusNeighborsClassifier
Model=RadiusNeighborsClassifier(radius=8.0)
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
#summary of the predictions made by the classifier
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
#Accouracy score
print('accuracy is ', accuracy_score(y_test,y_pred))
▶︎
all
running...
             precision    recall  f1-score   support

Iris-setosa       1.00      1.00      1.00        11

Iris-versicolor 1.00 1.00 1.00 13
Iris-virginica 1.00 1.00 1.00 6

avg / total       1.00      1.00      1.00        30

11 0 0] [ 0 13 0] [ 0 0 6
accuracy is 1.0

K近邻法和限定半径最近邻法类库参数 https://www.cnblogs.com/pinard/p/6065607.html

7.3 逻辑回归(线性模型)

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# LogisticRegression
from sklearn.linear_model import LogisticRegression
Model = LogisticRegression()
Model.fit(X_train, y_train)

y_pred = Model.predict(X_test)

# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
print('accuracy is',accuracy_score(y_pred,y_test))
▶︎
all
running...
             precision    recall  f1-score   support

Iris-setosa       1.00      1.00      1.00        11

Iris-versicolor 1.00 0.85 0.92 13
Iris-virginica 0.75 1.00 0.86 6

avg / total       0.95      0.93      0.94        30

11 0 0] [ 0 11 2] [ 0 0 6
accuracy is 0.9333333333333333

逻辑回归算法原理参考:https://www.cnblogs.com/pinard/p/6029432.html

7.4 被动攻击分类 (线性模型)

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.linear_model import PassiveAggressiveClassifier
Model = PassiveAggressiveClassifier()
Model.fit(X_train, y_train)

y_pred = Model.predict(X_test)

# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
print('accuracy is',accuracy_score(y_pred,y_test))
▶︎
all
running...

G:\Anaconda3\lib\site-packages\sklearn\linear_model\stochastic_gradient.py:128: FutureWarning: max_iter and tol parameters have been added in <class 'sklearn.linear_model.passive_aggressive.PassiveAggressiveClassifier'> in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
"and default tol will be 1e-3." % type(self), FutureWarning)
G:\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
precision recall f1-score support

Iris-setosa       0.73      1.00      0.85        11

Iris-versicolor 0.00 0.00 0.00 13
Iris-virginica 0.40 1.00 0.57 6

avg / total       0.35      0.57      0.42        30

11 0 0] [ 4 0 9] [ 0 0 6
accuracy is 0.5666666666666667

被动攻击算法原理参考:http://scikit-learn.org/0.19/modules/linear_model.html#passive-aggressive

7.5 朴素贝叶斯

在机器学习中,朴素贝叶斯分类器是一组简单的“概率分类器”,基于贝叶斯定理,特征之间有很强的(朴素的)独立性假设,下面是高斯朴素贝叶斯实验:
GaussianNB假设特征的先验概率为正态分布,即如下式:
P(xiy)=12πσy2exp((xiμy)22σy2)P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Naive Bayes
from sklearn.naive_bayes import GaussianNB
Model = GaussianNB()
Model.fit(X_train, y_train)

y_pred = Model.predict(X_test)

# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
print('accuracy is',accuracy_score(y_pred,y_test))
▶︎
all
running...
             precision    recall  f1-score   support

Iris-setosa       1.00      1.00      1.00        11

Iris-versicolor 1.00 1.00 1.00 13
Iris-virginica 1.00 1.00 1.00 6

avg / total       1.00      1.00      1.00        30

11 0 0] [ 0 13 0] [ 0 0 6
accuracy is 1.0

朴素贝叶斯算法原理参考:https://www.cnblogs.com/pinard/p/6069267.html

7.6 多项式朴素贝叶斯分类器

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# MultinomialNB
from sklearn.naive_bayes import MultinomialNB
Model = MultinomialNB()
Model.fit(X_train, y_train)

y_pred = Model.predict(X_test)

# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
print('accuracy is',accuracy_score(y_pred,y_test))
▶︎
all
running...
             precision    recall  f1-score   support

Iris-setosa       1.00      0.91      0.95        11

Iris-versicolor 0.83 0.77 0.80 13
Iris-virginica 0.62 0.83 0.71 6

avg / total       0.85      0.83      0.84        30

10 1 0] [ 0 10 3] [ 0 1 5
accuracy is 0.8333333333333334

7.7 伯努利朴素贝叶斯分类器

P(xiy)=P(iy)xi+(1P(iy))(1xi)P(x_i \mid y) = P(i \mid y) x_i + (1 - P(i \mid y)) (1 - x_i)

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# BernoulliNB
from sklearn.naive_bayes import BernoulliNB
Model = BernoulliNB()
Model.fit(X_train, y_train)

y_pred = Model.predict(X_test)

# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
print('accuracy is',accuracy_score(y_pred,y_test))
▶︎
all
running...

G:\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
precision recall f1-score support

Iris-setosa       0.00      0.00      0.00        11

Iris-versicolor 0.00 0.00 0.00 13
Iris-virginica 0.20 1.00 0.33 6

avg / total       0.04      0.20      0.07        30

0 0 11] [ 0 0 13] [ 0 0 6
accuracy is 0.2

朴素贝叶斯类算法小结

7.8 支持向量机

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Support Vector Machine
from sklearn.svm import SVC

Model = SVC()
Model.fit(X_train, y_train)

y_pred = Model.predict(X_test)

# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score

print('accuracy is',accuracy_score(y_pred,y_test))
▶︎
all
running...
             precision    recall  f1-score   support

Iris-setosa       1.00      1.00      1.00        11

Iris-versicolor 1.00 1.00 1.00 13
Iris-virginica 1.00 1.00 1.00 6

avg / total       1.00      1.00      1.00        30

11 0 0] [ 0 13 0] [ 0 0 6
accuracy is 1.0

支持向量机算法原理参考:http://www.cnblogs.com/pinard/p/6097604.html

7.9 Nu-Support Vector Classification

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Support Vector Machine's 
from sklearn.svm import NuSVC

Model = NuSVC()
Model.fit(X_train, y_train)

y_pred = Model.predict(X_test)

# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score

print('accuracy is',accuracy_score(y_pred,y_test))
▶︎
all
running...
             precision    recall  f1-score   support

Iris-setosa       1.00      1.00      1.00        11

Iris-versicolor 1.00 1.00 1.00 13
Iris-virginica 1.00 1.00 1.00 6

avg / total       1.00      1.00      1.00        30

11 0 0] [ 0 13 0] [ 0 0 6
accuracy is 1.0

7.10 线性支持向量机

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Linear Support Vector Classification
from sklearn.svm import LinearSVC

Model = LinearSVC()
Model.fit(X_train, y_train)

y_pred = Model.predict(X_test)

# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score

print('accuracy is',accuracy_score(y_pred,y_test))
▶︎
all
running...
             precision    recall  f1-score   support

Iris-setosa       1.00      1.00      1.00        11

Iris-versicolor 1.00 0.77 0.87 13
Iris-virginica 0.67 1.00 0.80 6

avg / total       0.93      0.90      0.90        30

11 0 0] [ 0 10 3] [ 0 0 6
accuracy is 0.9

支持向量机小结

7.11 决策树

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Decision Tree's
from sklearn.tree import DecisionTreeClassifier

Model = DecisionTreeClassifier()

Model.fit(X_train, y_train)

y_pred = Model.predict(X_test)

# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
print('accuracy is',accuracy_score(y_pred,y_test))
▶︎
all
running...
             precision    recall  f1-score   support

Iris-setosa       1.00      1.00      1.00        11

Iris-versicolor 0.93 1.00 0.96 13
Iris-virginica 1.00 0.83 0.91 6

avg / total       0.97      0.97      0.97        30

11 0 0] [ 0 13 0] [ 0 1 5
accuracy is 0.9666666666666667

决策树算法原理参考:https://www.cnblogs.com/pinard/p/6050306.html

7.12 额外决策树

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# ExtraTreeClassifier
from sklearn.tree import ExtraTreeClassifier

Model = ExtraTreeClassifier()

Model.fit(X_train, y_train)

y_pred = Model.predict(X_test)

# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
print('accuracy is',accuracy_score(y_pred,y_test))
▶︎
all
running...
             precision    recall  f1-score   support

Iris-setosa       1.00      1.00      1.00        11

Iris-versicolor 0.93 1.00 0.96 13
Iris-virginica 1.00 0.83 0.91 6

avg / total       0.97      0.97      0.97        30

11 0 0] [ 0 13 0] [ 0 1 5
accuracy is 0.9666666666666667

决策树小节

7.13 神经网络

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.neural_network import MLPClassifier
Model=MLPClassifier()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
# Summary of the predictions
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))
▶︎
all
running...

G:\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
precision recall f1-score support

Iris-setosa       0.00      0.00      0.00        11

Iris-versicolor 0.43 1.00 0.60 13
Iris-virginica 0.00 0.00 0.00 6

avg / total       0.19      0.43      0.26        30

0 11 0] [ 0 13 0] [ 0 6 0
accuracy is 0.43333333333333335

该函数详细说明参考:http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier

集成学习概述

7.14 随机森林

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.ensemble import RandomForestClassifier
Model=RandomForestClassifier(max_depth=2)
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))
▶︎
all
running...

G:\Anaconda3\lib\site-packages\sklearn\ensemble\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
from numpy.core.umath_tests import inner1d
precision recall f1-score support

Iris-setosa       1.00      1.00      1.00        11

Iris-versicolor 0.93 1.00 0.96 13
Iris-virginica 1.00 0.83 0.91 6

avg / total       0.97      0.97      0.97        30

11 0 0] [ 0 13 1] [ 0 0 5
accuracy is 0.9666666666666667

随机森林算法原理参考:https://www.cnblogs.com/pinard/p/6156009.html

7.15 Bagging

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.ensemble import BaggingClassifier
Model=BaggingClassifier()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))
▶︎
all
running...

G:\Anaconda3\lib\site-packages\sklearn\ensemble\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
from numpy.core.umath_tests import inner1d
precision recall f1-score support

Iris-setosa       1.00      1.00      1.00        11

Iris-versicolor 0.93 1.00 0.96 13
Iris-virginica 1.00 0.83 0.91 6

avg / total       0.97      0.97      0.97        30

11 0 0] [ 0 13 1] [ 0 0 5
accuracy is 0.9666666666666667

Bagging算法原理参考:https://www.cnblogs.com/pinard/p/6156009.html

7.16 自适应提示分类器Adaboost

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.ensemble import AdaBoostClassifier
Model=AdaBoostClassifier()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))
▶︎
all
running...

G:\Anaconda3\lib\site-packages\sklearn\ensemble\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
from numpy.core.umath_tests import inner1d
precision recall f1-score support

Iris-setosa       1.00      1.00      1.00        11

Iris-versicolor 0.93 1.00 0.96 13
Iris-virginica 1.00 0.83 0.91 6

avg / total       0.97      0.97      0.97        30

11 0 0] [ 0 13 1] [ 0 0 5
accuracy is 0.9666666666666667

Adaboost算法原理参考:https://www.cnblogs.com/pinard/p/6133937.html

7.17 梯度提升分类器GradientBoosting

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.ensemble import GradientBoostingClassifier
Model=GradientBoostingClassifier()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))
▶︎
all
running...

G:\Anaconda3\lib\site-packages\sklearn\ensemble\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
from numpy.core.umath_tests import inner1d
precision recall f1-score support

Iris-setosa       1.00      1.00      1.00        11

Iris-versicolor 0.93 1.00 0.96 13
Iris-virginica 1.00 0.83 0.91 6

avg / total       0.97      0.97      0.97        30

11 0 0] [ 0 13 1] [ 0 0 5
accuracy is 0.9666666666666667

梯度提升树(GBDT)算法原理参考:https://www.cnblogs.com/pinard/p/6140514.html

7.18 线性判别分析LDA

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
Model=LinearDiscriminantAnalysis()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))
▶︎
all
running...
             precision    recall  f1-score   support

Iris-setosa       1.00      1.00      1.00        11

Iris-versicolor 1.00 1.00 1.00 13
Iris-virginica 1.00 1.00 1.00 6

avg / total       1.00      1.00      1.00        30

11 0 0] [ 0 13 0] [ 0 0 6
accuracy is 1.0

线性判别分析LDA算法原理参考:https://www.cnblogs.com/pinard/p/6244265.html

7.19 二次判别分析QDA

import pandas as pd
#进行评价需要导入的包
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset = pd.read_csv('../data/Iris.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
Model=QuadraticDiscriminantAnalysis()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))
▶︎
all
running...
             precision    recall  f1-score   support

Iris-setosa       1.00      1.00      1.00        11

Iris-versicolor 1.00 1.00 1.00 13
Iris-virginica 1.00 1.00 1.00 6

avg / total       1.00      1.00      1.00        30

11 0 0] [ 0 13 0] [ 0 0 6
accuracy is 1.0

二次判别分析QDA算法原理参考:http://scikit-learn.org/stable/modules/lda_qda.html#dimensionality-reduction-using-linear-discriminant-analysis

8. 总结

9. 参考文献

[1] Fisher R A. THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS[J]. Annals of Human Genetics, 2012, 7(2):179-188.

[2] Andrews D F. Plots of High-Dimensional Data[J]. Biometrics, 1972, 28(1):125-136.