Machine Learning Models

We are performing supervised learning.  Every data point has a target. 

For the training phase, the target is known and the machine learning modeling libraries find the best fit of the equations that relate the predictor variables to the target variable.   That model is saved.  For the validation phase, a set of data that is both more recent than the training data and has not been previously used is passed to the model, the model returns the predicted value of the target, the predicted value is compared with the known value, giving a score for the accuracy of the model. 

Using daily data and predicting whether tomorrow’s close will be higher than today’s close, we have a target for every day.  The model will provide a prediction for every day.  We compute results and mark-to-market every day.  We have an opportunity to adjust our position every day.  Rather than looking for impulse signals to buy and sell, we are looking for state signals to beLong or beFlat.  Machine learning fits perfectly with the trading system development, trading management techniques, and metrics of safe-f and CAR25.

Machine learning is, in general, the solution of a complex set of simultaneous equations.  We need a little terminology, some background, and perhaps a gentle introduction before continuing to trading systems.

Introduction

We begin with a simple system, well known in the machine learning literature as the Iris Data Set.  This data is, and many other useful data sets are, hosted at the University of California, Irvine.  Tutorials for using the iris data set abound.  The models have three characteristics.  They are:

  • supervised (compare with unsupervised)
  • classification (compare with regression)
  • stationary (compare with time series)

Models with these characteristics are used in applications such as credit scoring, plant identification, business success, spam email identification, disease diagnosis.

The iris data set is used to build a model for plant identification.  An expert botanist examined 150 iris plants — 50 each of three species:  Iris-Setosa, Iris-Versicolor,  Iris-Virginica.  She prepares a note card for each plant with the following information:  species name, sepal length in cm, sepal width, petal length, and petal width.  She gives you the deck of note cards and asks you to develop a model that will allow less experienced botanists to identify the species of an unknown iris, given these four measurements.

Each note card is a data point, an example, an instance.  Each card has five data fields.  The known species name is the target, the four measurements are the features, the independent variables, the predictors.

The data is stationary.  We do not know the order in which the plants were identified by the expert, and we assume that it would not improve our model if we knew.  The order of cards in the deck is not predictive.  Importantly, we will make use of the stationarity in developing the model — we will shuffle the deck.

There are three categories or classes.  If we want the model to identify which species an unknown belongs to, we can have the model give one of three categories as its output — Setosa, Versicolor, or Virginica.  If we are primarily interested in one species, say Virginica, we can have the model give one of two categories as its output — Virginica or not Virginica.

We will be using Python Version 3, along with several libraries.  I recommend the Anaconda distribution, which is free and is available for Windows, Mac, and Linux.  The download contains two environments — Spyder and Jupyter. 

Getting started with Spyder. 

Spyder is installed automatically with Anaconda. 

Before the very first use of Spyder, pull down the Anaconda menu, right-click the Spyder icon (the web), click more to show more options, click “pin to taskbar.”  The web icon will be added to the taskbar at the bottom of window’s pane.

From then forward, click the icon to open Spyder.  It will remember the files you had loaded and load them.  It will remember the file you had open and open it in the left pane.  Editing is straightforward and works as you expect it to.  Click the green arrow at the top to run the current program.  The file is automatically saved before execution.  Output appears in the pane in the lower right.

Getting started with Jupyter.     

This is is a link to the official website of the jupyter project.  Jupyter was installed when you installed Anaconda, so you can skip the installation page.  But check the documentation and nbviewer pages for information and examples.

This link leads to a read-the-docs site that explains how to install and use jupyter notebooks.

The DataCamp website has a tutorial that you might find useful.  They also have several cheat sheets related to python, jupyter, and other libraries you will use.

Next:

Using the Iris Data in a Jupyter notebook

Back to Systems Development