Machine Learning Tutorials

Iris Tutorial Part 1

This article is part of a series of articles and tutorials related to trading system development using the scientific method.

It is presented by Dr. Howard Bandy

The hosting website is www.blueowlpress.com

Load the tools and the data

In [1]:
#  The Iris Dataset has a long history, dating from 1936 when Edgar Anderson 
#  collected the data and published a paper describing it.
#  Also in 1936, Ronald Fisher used linear discriminant analysis to analyze 
#  the data and build a model of it.  
#  This link takes you to the Wikipedia page:
#  https://en.wikipedia.org/wiki/Iris_flower_data_set

#  The data consists of 150 data points, 50 each from a single flower.
#  The three species are Iris-setosa, Iris-verscolor, Iris-virginica.
#  There are four measurements from each flower:
#    sepal-length
#    sepal-width  
#    petal-length
#    petal-width
#  The species of each flower has been identified by an expert botanist.
#  This is an example of supervised learning.

#  A common theme in machine learning is to begin with inspection of the data.
#  To that end, this tutorial begins by showing the results of several metrics
#  of the data and several visual presentations of its properties.

#  Several visual libraries are illustrated.
#  The first set of plots are made using the plot methods of matplotlib and pandas.
#  Then matplotlib (using the inline option), and finally seaborn.

#  As you will see in the charts displayed below,
#  one of the species is easily identified.  
#  Since we can draw a straight line separating it from the others, 
#  the data is linearly separable. 
#  The other two have overlapping data distributions and are not
#  linearly separable.

#  This tutorial borrows heavily from two published tutorials:
#
#  Your First Machine Learning Project in Python Step-By-Step, by Dr. Jason Brownlee.
#  https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
#
#  Python Data Visualizations, by Ben Hammer, CTO of Kaggle.
#  https://www.kaggle.com/benhamner/python-data-visualizations
#
#  I highly recommend that you visit the websites of these two authors,
#  study their tutorials, subscribe to their publications, buy their books.
#  Dr. Brownlee:  https://machinelearningmastery.com/
#  Ben Hammer:  https://www.kaggle.com/benhamner
In [2]:
#  Begin with a check of the software.

# The version of python
import sys
print('Python: {}'.format(sys.version))
# scipy
import scipy 
print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy 
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib 
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas 
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn 
print('sklearn: {}'.format(sklearn.__version__))
Python: 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
scipy: 0.19.0
numpy: 1.12.1
matplotlib: 2.0.2
pandas: 0.20.1
sklearn: 0.18.1
In [3]:
#  Load the libraries
#  If you installed the Anaconda distribution, 
#  these were all installed at that time.

import pandas

from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

#  Enable inline plotting from matplotlib

%matplotlib inline

#  Set some styles for seaborn

sns.set(style="white", color_codes=True)
In [4]:
#  Load the dataset into a pandas dataframe named 'iris'
#  The dataset resides at the University of California - Irvine.
#  https://archive.ics.uci.edu/ml/datasets/iris
#  The UCI dataset repository has many datasets available for your use.
#  All have been curated and well studied.

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

#  Define the names of the variables as we want them

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'species']
iris = pandas.read_csv(url, names=names)

Examine the data

In [5]:
#  Identify the type of dataset
#  Expect it to be a pandas dataframe

print (type(iris))
<class 'pandas.core.frame.DataFrame'>
In [6]:
#  Check the shape of the data.
#  Expect it to be 150 rows and 5 columns.

print(iris.shape)
(150, 5)
In [7]:
#  Print the first 20 data points -- the head of the dataset

print(iris.head(20))
    sepal-length  sepal-width  petal-length  petal-width      species
0            5.1          3.5           1.4          0.2  Iris-setosa
1            4.9          3.0           1.4          0.2  Iris-setosa
2            4.7          3.2           1.3          0.2  Iris-setosa
3            4.6          3.1           1.5          0.2  Iris-setosa
4            5.0          3.6           1.4          0.2  Iris-setosa
5            5.4          3.9           1.7          0.4  Iris-setosa
6            4.6          3.4           1.4          0.3  Iris-setosa
7            5.0          3.4           1.5          0.2  Iris-setosa
8            4.4          2.9           1.4          0.2  Iris-setosa
9            4.9          3.1           1.5          0.1  Iris-setosa
10           5.4          3.7           1.5          0.2  Iris-setosa
11           4.8          3.4           1.6          0.2  Iris-setosa
12           4.8          3.0           1.4          0.1  Iris-setosa
13           4.3          3.0           1.1          0.1  Iris-setosa
14           5.8          4.0           1.2          0.2  Iris-setosa
15           5.7          4.4           1.5          0.4  Iris-setosa
16           5.4          3.9           1.3          0.4  Iris-setosa
17           5.1          3.5           1.4          0.3  Iris-setosa
18           5.7          3.8           1.7          0.3  Iris-setosa
19           5.1          3.8           1.5          0.3  Iris-setosa
In [8]:
#  Use the describe function to describe some of the 
#  statistical properties of the data.

print(iris.describe())
       sepal-length  sepal-width  petal-length  petal-width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000
In [9]:
#  Use the groupby method to determine the class distribution

print(iris.groupby('species').size())
species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
dtype: int64

Visually inspect the data

In [10]:
#  Create and show box and whisker plots

plt.figure(figsize=(8,4),dpi=288)
iris.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()
<matplotlib.figure.Figure at 0x1d32aa95b38>
In [11]:
#  Create and show histograms
#  Each plot is of all 150 data elements

iris.hist()
plt.show()
In [12]:
#  Create and show a scatter plot matrix

scatter_matrix(iris)
plt.show()
In [13]:
#  Use the .plot extension from Pandas dataframes to make a scatterplot of the Iris features.

iris.plot(kind="scatter", x="sepal-length", y="sepal-width")
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d32fd90b38>
In [14]:
#  Use the seaborn library to make a similar plot
#  A seaborn jointplot shows bivariate scatterplots and univariate histograms in the same figure

sns.jointplot(x="sepal-length", y="sepal-width", data=iris, size=6)
Out[14]:
<seaborn.axisgrid.JointGrid at 0x1d33063e198>
In [15]:
#  Use seaborn's FacetGrid to color the scatterplot by species

sns.FacetGrid(iris, hue="species", size=6) \
   .map(plt.scatter, "sepal-length", "sepal-width") \
   .add_legend()
Out[15]:
<seaborn.axisgrid.FacetGrid at 0x1d3307b1e48>
In [16]:
#  We can look at an individual feature in seaborn using a boxplot

#  This plot shows that Iris-setosa can be separated from the other
#  two species by a straight line drawn horizontally from
#  a petal length value of about 2.5

sns.boxplot(x="species", y="petal-length", data=iris)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d3308ef8d0>
In [17]:
#  *****
#  We can extend this plot is adding a layer of individual points on top of
#  it through Seaborn's striplot
# 
#  Use jitter=True so that all the points don't fall in single vertical lines
#  above the species
#
#  Saving the resulting axes as ax each time causes the resulting plot to be shown
#  on top of the previous axes

ax = sns.boxplot(x="species", y="petal-length", data=iris)
ax = sns.stripplot(x="species", y="petal-length", data=iris, jitter=True, edgecolor="gray")
In [18]:
#  A violin plot combines the benefits of the previous two plots and simplifies them
#  Denser regions of the data are fatter, and sparser thiner in a violin plot

sns.violinplot(x="species", y="petal-length", data=iris, size=6)
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d32fee9748>
In [19]:
#  Another plot useful for looking at univariate relations is the kdeplot,
#  which creates and visualizes a kernel density estimate of the underlying feature

#  We again see that Iris-setosa is linearly separable from the other two species,
#  but Iris-versicolor and Iris-virginica overlap in petal-length.

sns.FacetGrid(iris, hue="species", size=6) \
   .map(sns.kdeplot, "petal-length") \
   .add_legend()
Out[19]:
<seaborn.axisgrid.FacetGrid at 0x1d3300f87f0>
In [20]:
#  This is the seaborn pairplot, which shows the bivariate relation
#  between each pair of features
# 
#  From the pairplot, we'll see that the Iris-setosa species is separated from the other
#  two across all feature combinations

sns.pairplot(iris, hue="species", size=3)
Out[20]:
<seaborn.axisgrid.PairGrid at 0x1d32fdade48>
In [21]:
#  Seaborn computes and plots the regression for each atribute, 
#  all species combined

sns.pairplot(iris, kind="reg", size=3)
Out[21]:
<seaborn.axisgrid.PairGrid at 0x1d332724be0>
In [22]:
#  Or one regression for each species

sns.pairplot(iris, hue="species", kind="reg", size=3)
Out[22]:
<seaborn.axisgrid.PairGrid at 0x1d330ea5c18>