Calculated Risks, Gerd Gigerenzer, 2002. Do not be put off by the 2002 publication date. This is a wonderful book with content that is as applicable today as it was in 2002. In a very readable style, without complex mathematics, Dr. Gigerenzer explains many of the concepts we use when we analyze our trading systems — illusion of certainty, false positive, false negative, probability, frequency, and evaluation of risk. It is a nice complement to Dr. Kahneman’s book.

The authors explain that their coverage of predictive modeling includes machine learning, pattern recognition, and data mining, and expands to a broader guide to the process of developing models and quantifying their predictive accuracies.

A major theme throughout the book is detection of overfitting. Techniques to manage overfitting are discussed in detail. These include data preprocessing, normalization, standardization, transformation of distributions, feature selection, train-test split, cross validation, goodness of fit, and error metrics.

Linear and non-linear models are described, with detailed examples of use with actual data.

The illustrations are superb. Fully disclosed code in R is included.

This book is a very readable handbook that I highly recommend to everyone developing predictive models.

Data Science for Business, Foster Provost & Tom Fawcett, 2013. This book was developed by professor Provost over several years as he taught these concepts at NYU’s Stern School of Business. Tom Fawcett is a principal data scientist at Data Scientists LLC in Mountain View, California. The book is widely used — it is the primary text for data science courses in well over 100 university programs around the world.

From the book’s preface: “This book does not presume a sophisticated mathematical background. However, by its very nature the material is somewhat technical–the goal is to impart a significant understanding of data science, not just to give a high-level overview. In general, we have tried to minimize the mathematics and make the exposition as “conceptual” as possible.”

The book focuses on classification models. Descriptions and explanations of techniques are accompanied by clear and illustrative diagrams. There are a few equations. There is no computer code in the book, but course materials available from the author’s website use python, scikit-learn, and Jupyter notebooks.

Python for Data Analysis (Second Edition), Wes McKinney, 2018. Wes is the developer of Pandas. He is described as

the man behind the most important tool in data science. Pandas extends the data analysis capabilities of python beyond that provided by numpy and scipy with a set of data structures and functions designed to handle financial time series data. Wes developed pandas while a quantitative analyst at Cliff Asness’s $200 billion hedge fund,

AQR, and now works for

Two Sigma, a quantitative hedge fund with $50 billion under management.

His bio is interesting.

Python Data Science Handbook, Jake VanderPlas, 2017. Jake is a regular presenter at Python and machine learning conferences. This book covers IPython (now Jupyter Notebook), Pandas, matplotlib, machine learning. I highly recommend anything by Dr. VanderPlas.

Data Science from Scratch, Joel Grus, 2015. This book presents a clear explanation of some of the concepts central to data science. It begins with “A crash course in python,” provides a quick review of linear algebra and the python data structures needed, and both frequentist and Bayesian statistics.

Chapter 8 begins to get into data science with a description of the gradient descent method of finding the set of parameters that maximize the objective function. The “from scratch” approach shows all the details.

Chapter 10 begins with methods for exploring the data. Examining the distribution, plotting, normalizing, rescaling, and dimensionality reduction.

Chapter 11 begins machine learning — models, overfitting, underfitting, bias-variance tradeoff, and feature extraction.

The book continues with explanations of k-nearest neighbors, the curse of dimensionality, naive Bayes, linear regression, multiple regression, logistic regression, measures of goodness of fit, decision trees, and neural networks. The section on random forests, one of the ensemble techniques, is described in surprisingly concise code.

While this is not the best book to learn python, machine learning, or model development, it is valuable in explaining each of these topics with fully disclosed logic and computer code.

Python: Deeper Insights into Machine Learning (Second Edition with expanded material), Sebastian Raschka, David Julian, John Hearty, 2017. Sebastian is at the forefront of machine learning using Python. This book gets well into feature engineering, which is the special sauce of machine learning.

Automate the Boring Stuff with Python, Al Sweigart, 2015. Free under Creative Commons. Also available at Amazon.

Use this book for a nicely paced self study introduction to Python. Or as a text book to accompany the Udemy course.