Resources — Articles

Articles you might find interesting.  Organized by title.

All that glitters is not gold: Comparing backtest and out-of-sample performance on a large cohort of trading algorithms, Dr. Thomas Wiecki, Andrew Campbell, Justin Lent, Dr. Jessica Stauth, Quantopian Inc.  

From the abstract: When automated trading strategies are developed and evaluated using backtests on historical pricing data, there exists a tendency to overfit to the past. Using a unique dataset of 888 algorithmic trading strategies developed and backtested on the Quantopian platform with at least 6 months of out-of-sample performance, we study the prevalence and impact of backtest overfitting. Specifically, we find that commonly reported backtest evaluation metrics like the Sharpe ratio offer little value in
predicting out of sample performance (R² < 0.025).

Avoiding Common Mistakes with Time Series Analysis, Tom Fawcett.  Dr. Fawcett is the author of Data Science for Business, one of the books we recommend.  Valuable insight into dealing with time series.

Basics of Classifier Evaluation, Part 1 and Part 2, Tom Fawcett. 

Quoting Tom: “If its easy, its probably wrong.”  and “What’s wrong with accuracy?” and “Ranking is better than classifying.”  

Pseudo-Mathematics and Financial Charlatanism:  The Effects of Backtest Overfitting on Out-of-Sample Performance, David H. Bailey, Jonathan M. Borwein, Marcos López de Prado, and Qiji Jim Zhu.

From the conclusions:  While the literature on regression overfitting is extensive, we believe that this is the first study to discuss the issue of overfitting on the subject of investment simulations (backtests) and its negative effect on OOS performance. 

Visualizing high-dimensional datasets using PCA and t-SNE in Python, Luuk Derksen.  

Excerpt of the paper:  The first step around any data related challenge is to start by exploring the data itself. This could be by looking at, for example, the distributions of certain variables or looking at potential correlations between variables.

The problem nowadays is that most datasets have a large number of variables. In other words, they have a high number of dimensions along which the data is distributed. Visually exploring the data can then become challenging and most of the time even practically impossible to do manually. However, such visual exploration is incredibly important in any data-related problem. Therefore it is key to understand how to visualise high-dimensional datasets. This can be achieved using techniques known as dimensionality reduction. This post will focus on two techniques that will allow us to do this: PCA and t-SNE.

—  end  —