Modeling Intent in R and/or Python

Learning or experimenting with Tidytext has been on my radar for at least a few years. Only recently did i have a need to pick it up. As with most learnings, they lead you down a path of more knowledge (read: rabbit holes) than you foresaw. This post is a hat-tip to the resources i used, knitting them together in a sample use case with an extension using parallel processing for the R implementation.

First mention must go to Manuel Amunategui for his post on Intent Modeling. Manuel shares a link to his python code in the post. I experienced some joy running his code after making a few changes. You can find my working notebook here. I used Windows and VSCode for the python implementation. The general logic Manuel uses for modeling intent is to:

  1. Tag the transcripts to identify the most common verbs and nouns
  2. Cluster the observations based on the presence of these key verbs and key nouns
  3. Conduct n-gram analysis on the clusters to identify patterns of key word sequences
  4. Use the keyword sequences to identify the relevant transcripts worth reviewing more closely to understand intent.

It is such a simple yet powerful approach. And it is made possible thanks to the spaCy package in python for parts-of-speech tagging and of course the nltk package for n-grams analysis. For the R implementation i used spacyr and Tidytext.

The Data

Manuel uses the Consumer Complaints Database from The Bureau of Consumer Financial Protection. This is available at which looks like a great source of open data for experimenting with models. There are over 2.4 million chat transcripts. For the purposes of our analysis we look at just 200,000 of them, the number Manuel uses in his original notebook.

What about R?

So Manuel’s python code (after a few minor adjustments for me and my environment) runs swimmingly which is great and hat-tip to Manuel. However, when i set out on my intent modeling journey initially using the approach Manuel describes, i did so using R. For parts-of-speech tagging i used the spacyr package, and to speed up the tagging i wrapped it in a foreach “loop” using all my cores. However, in addition to identifying key verbs and key nouns, i used noun phrases as well. What are noun phrases? From the spaCy docs:

Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world’s largest tech fund”.

The reason for using noun phrases was to increase the information available from tokenizing the transcripts and computing n-gram analyses. For example the noun phrase “a legitimate claim” becomes “a_legitimate_claim” and therefore a single token when joined with an underscore. A tri-gram analysis where one or more of the tokens are noun phrase combinations should theoretically hold more information than not. The tradeoff of course is a lower count of the individual words making up the noun phrase. Whether or not this tradeoff makes sense will depend on the use case. Let’s look at some results:

We see the top 20 verbs, nouns and noun phrases.

Top 20 verbs
Top 20 nouns
Top 20 noun phrases

Note the masked personal data. That is merely for Data Loss Protection (DLP) purposes. It would be appropriate to remove those tokens but for this toy example I felt it beneficial to point out the nuance.

As Manuel points out the threshold for determining which verbs, nouns and in our case the noun phrases to include as keywords will depend on your data. We use a threshold of 10k. There are 115 verbs, 140 nouns and 74 noun phrases which exceed this threshold in the 200k chat transcripts in our sample. We combine these keywords and one-hot encode the presence of the keyword in each of the 200k samples. Below is a snippet of the resulting matrix:

one-hot encoded keyword matrix

At this point we can employ a `kmeans` clustering algorithm to cluster the observations based on the presence of the keywords. We follow the method used by Manuel and specify 50 centers. After replacing the noun phrases in the original text with their underscored equivalents, we can perform some n-gram analysis to understand the frequency of particular word sequences.

Of course, we perform the n-gram analysis based on the clusters. Looking at the number of observations per cluster it appears the 15th cluster has the most observations. We will use that cluster for illustrative purposes.

observations per cluster

We see the most frequent 4-gram from cluster 15 has to do with people complaining about identity theft. It should also be clear how tokenizing noun phrases gives us additional information…instead of a 4-gram of only 4 words, we see 5 words. Thanks to spaCy and noun chunks.

Top 4-grams in cluster 46

At this point, its possible to tie back the n-gram to the actual complaints…and review for a better sense of the issues related to identity theft.

Complaints including the term “a victim of identity theft”

Well, thats it. I hope you enjoyed reading. All resources are below, including links to my code. Cheers!



A Personal Portfolio Allocation Approach

My experience in financial markets to date has mostly been related to trading and investment banking. From executing index arbitrage and various other strategies, to product managing a team building execution algorithms for automating strategies which minimize market impact and make relative and conditional trading decisions. More recently, post a move to Canada from South Africa, I have had the opportunity to cut my teeth on the importance of smart order routing, including latency optimization for minimizing latency arbitrage and information leakage. In my spare time i contribute to various open source packages in the R/Finance ecosystem, mostly quantstrat and blotter which are amazingly powerful packages for building and testing quantitative strategies.

None of the above are skills that are directly transferable to a portfolio optimization problem. So why bother with portfolio optimization from a personal perspective? Well, a few reasons. Firstly an opportunity to learn and utilize some of the portfolio analytics functionality available as open source in various R/Finance packages. Secondly, an opportunity to apply institutional techniques to a personal portfolio. This is important. As someone with a day job which does not involve staring at screens of asset prices (those were the days) i simply do not have the time to follow the daily ebb and flow of the markets nor to pick stocks with very much due diligence. Of course stock picking for the majority is a net loser’s game, but with entertainment value for those who enjoy the gyrations of their portfolio net equity. The third reason, which is kind of related to the second is that i need a process which i can apply to my disposable income, and ideally something with less volatility than your average meme stock basket.

Below is a view of a portfolio i started towards the end of last year. Whilst the cumulative time-weighted return is outperforming my rudimentary S&P TSX Composite Index benchmark, that was not the case during part of May and June. Fortunately the portfolio recovered, helped by some speculative plays in Blackberry and averaging into some technology names that took an absolute beating with their prospects unchanged or even improving with each quarter. Nevertheless, it was a time of introspection and an acknowledgement that i needed a process for at least part of my portfolio, the non-stock-picking allocation part. With a diligent portfolio optimization process i should be able to reduce the risk in my portfolio, with relatively infrequent rebalancing requirements which will help with minimizing execution fees.

This blog post is to document my very recently conspired process for future reference, to share with others if they find it valuable and of course to receive feedback and/or criticism where the process is flawed or can be improved.

Investable Universe

The starting point for any portfolio allocation process is to select a range of possible investments. This universe will be largely dependent on your objectives (preserve capital, exceed inflation, outperform S&P500 in absolute or risk-adjusted terms, minimize drawdowns, etc) and constraints (market access, capital available, etc). For a Global Tactical Asset Allocation (GTAA) approach to selecting your investment universe, there are several starting points you could consider. One of the seminal papers on the topic is Meb Faber’s A Quantitative Approach to Tactical Asset Allocation. Meb starts with the following assets:

US Large Cap (S&P 500), Foreign Developed (MSCI EAFE), US 10-Year Government Bonds, Commodities (Goldman Sachs Commodity Index), Real Estate Investment Trusts (NAREIT Index).

In a paper published by the team at ReSolve Asset Managemet, Adaptive Asset Allocation: A Primer, the authors use 10 asset classes spanning U.S. stocks, European stocks, Japanese stocks, Emerging market stocks, U.S. REITs, International REITs, U.S. 7-10 year Treasuries, U.S. 20+ year Treasuries, Commodities and Gold.

For my purposes, considering this is currently a pet project for a tiny portfolio allocation i figured i would limit my range of possible investments to the range of iShares ETFs offered by BlackRock and select the top ranked securities per major asset class using trailing 6m returns. As at 14 July 2021 there were 394 possible ETFs. When downloading historical data for these names (more on the data in a bit) i was unable to download data for 15 of the 394 ETFs. The breakdown of the remaining ETFs by asset class using Net Assets is below.

Distribution of Net Assets across Asset Class – iShares ETFs

So we have a healthy mix of Asset Classes represented in the range of iShares ETFs offered by BlackRock. Clearly Net Assets are skewed towards Equity and Fixed Income. This is not surprising. What about the count of ETFs offered per asset class above?

Distribution of Count of ETFs across Asset Class – iShares ETFs

As expected, the number of ETFs across asset classes is slightly less skewed than when looking at Net Assets.

The next question i asked myself was how was the absolute value of Net Assets distributed across each of the iShares ETFs?

Distribution of Net Assets by ETF – iShares ETFs

The notable outlier at almost 300bn USD in Net Assets is the iShares Core S&P 500 ETF (not observable in the plot, but in an interactive session hovering over the bar with plotly will highlight the name of the ETF). This is not surprising, and highlights why securities can move like they do when it is announced they will be added to the S&P 500 Index. Tesla and Moderna are recent examples.

Ok, with this information it is clear to see there are many fringe ETFs with low Net Assets. At this point i chose a Net Assets threshold of 5bn USD. Any ETFs with less than 5bn USD would get automatically excluded from my investment universe. With that filter the investment universe reduces to 84 ETFs. When deciding on an appropriate USD cutoff i looked at a scatterplot of the Net Assets to Days Since Inception. Clearly the longer an ETF has been around, the more opportunity it has had to receive investments. Of course there is also more opportunity for drawdowns and redemptions to reduce the size of Net Assets, but on the whole there was a positive relationship between Net Assets and Days Since Inception. I excluded ETFs with Net Assets > 50bn USD from this plot, for better visualization.

Net Assets to Days Since Inception, fitted with the slope from a linear regression

There are several ETFs above the threshold which have existed for at least 4-5 years, so i am happy with this rudimentary cutoff. I will revisit this in future iterations of my process. Ultimately it is the liquidity of the underlying ETF and their constituents that is most important, but for now we will use Net Assets of the ETF as a proxy.

So, with the remaining list of 84 ETFs what is the distribution across Asset Class, Sub Asset Class, Region, Market and Investment Style?

Asset Class pie – 84 iShares ETFs with Net Assets > 5bn USD
Sub Asset Class pie – 84 iShares ETFs with Net Assets > 5bn USD
Region pie – 84 iShares ETFs with Net Assets > 5bn USD
Market pie – 84 iShares ETFs with Net Assets > 5bn USD
Investment Style pie – 84 iShares ETFs with Net Assets > 5bn USD

Its clear the remaining 84 ETFs are nicely distributed across Asset Class, Region and Market etc. The sole Active ETF is the BlackRock Ultra Short-Term Bond ETF. At the right price, there is a market for anything.


Most of the data for the above plots was taken from the product screener downloadable from the iShares website. For running the portfolio optimization process which required daily returns for as far back as possible, i used the IBrokers package in R. With relatively little effort i could configure my local instance of Trader Workstation (TWS) to accept API connections. Once i signed up for the relevant market data subscriptions with Interactive Brokers (1 USD per month for the Cboe One Add-On Bundle – waived if monthly commissions reach 5 USD, and 10 USD per month for US Securities Snapshot and Futures Value Bundle – waived if monthly commissions reach 30 USD) i was able to download historical bar data, as well as real-time quote and bar data for all the iShares ETFs.

In a previous call to download data for all 394 ETFs, 15 failed, hence my starting point being the remaining 379 ETFs. Given the 84 ETFs with Net Assets greater than 5bn USD, my next task was to download monthly bar data from which to compute my momentum filter, using 6m returns. Before doing that however, i downloaded a snapshot of the real-time quotes for all 84 ETFs just to get an idea of the spreads (in basis points) in the market at the time. That chart is below.

iShares ETF spreads – 21 July 14:30

These spreads appear pretty narrow which bodes well for trading purposes and as another proxy for liquidity. A closer look at the tick size of the spread illustrated that most were as narrow as they could be given the tick size constraints of one penny.

Momentum Filter

After downloading bar data for the remaining 84 ETFs, i was able to compute the 6m return per asset class. I did this somewhat crudely, merely converting the monthly prices to quarterly returns (using periodReturn from the quantmod package) and summing the last 2 observations. Below are bar plots of the distribution of returns per Asset Class from the remaining 84 ETFs.

Equity ETFs – 6m return
Fixed Income ETFs – 6m return
Commodity ETFs – 6m return
Real Estate ETF – 6m return

Now of course, momentum as a factor in both time series and cross sectionally is a well studied and well documented factor. My choice of 6m is based on what i have seen in the literature. Future iterations may look at alternate horizons, and alternate metrics for computing a momentum ranking. For now it will suffice.

We notice in the above plots how pervasive Equity ETFs are, followed by Fixed Income ETFs and then 2 Commodity ETFs (Gold and Silver) and the sole Real Estate ETF being the iShares US Real Estate ETF. For future iterations, i will need to look at a broader representation of these Asset Classes, likely by reducing or even eliminating my 5bn USD constraint for starters. Again, for now…it will suffice. The goal is to get to portfolio optimization using the Hierarchical Risk Parity technique documented by Marcos Lopez de Prado.

With the above information, we will select a single security from each Asset Class to start. We will select the ETF with the highest returns. In some papers the authors discard the 10th and 90th percentile in order to minimize potential idiosyncratic risk (thanks Brian for the reminder). For ETFs, this is less of a concern.

Hierarchical Risk Parity

Right, given a list of prospective investments how do i determine the relevant weights to assign to each? There are many approaches one can take here, from quadratic optimization using a mean-variance approach to more convex solvers using particle swarming or genetic algorithms. For a thorough implementation with functionality to specify multiple objectives and constraints and using an open source package in the R/Finance ecosystem you can refer to the introductory vignette for the PortfolioAnalytics package.

However, Marcos Lopez de Prado writes about Building Diversified Portfolios that Outperform Out-of-Sample and suggests an approach which is meant to outperform more familiar approaches. At this point i was not going to test different approaches, but learn the particular approach suggested by Marcos. In future iterations i will definitely be doing portfolio backtesting and comparisons for multiple portfolio optimization techniques and objectives. In the paper he goes on to illustrate, with an example and code (dont you just love papers that are published with reproducible results and the accompanying code) the idea of a Hierarchical Risk Parity approach to optimizing a given portfolio. Gautier Marti, who now works with Marcos, gives a slightly more refined version in his blog post with code ready to run inside a Jupyter Notebook. Gautier’s code is in Python 3, and Marcos’ is in Python 2. In order to get Marcos’ code to run in Python 3, you will need to edit a few lines of code…particularly replace xrange with range and use // instead of a single / in the getRecBipart function from the paper.

Being a stronger R user than a Python user my next task was to use the code kindly contributed by Ilya Kipnis in his blog post here. Using the Python generateData function from the paper i was only able to reproduce half of the data inside the correlation and covariance matrices, which was kindly disclosed in Ilya’s blog post but not included in the paper. I tried in both Python 2 and 3, and neither worked. This was not a showstopper, it just means i will be unable to reproduce the results from the paper exactly. My ultimate goal from that point was to get identical results with the correlation and covariance matrices i did have, and see if i could get similar results using Ilya’s code and Marcos’ code.

Once i was able to figure out some Python 2 and Python 3 conflicts, i was getting identical results. Great…this means i have 2 implementations of the HRP approach proposed by Marcos in his paper. From here i needed to get real data, specifically correlation and covariance matrices for my prospective investments. Using the IBrokers package again (shoutout to Jeff Ryan and Josh Ulrich as the authors and maintainers of this package respectively), i downloaded daily data over a 10 year period for each of IGV, TLT, IAU and IYR. Since i could only get data as far back as 2016 for all the ETFs, i cut off significant amounts of data for the ETFs which had the entire 10 year dataset. Again, this is not ideal but for getting an MVP solution off the ground this would suffice.

Using the Return.calculate function from the popular PerformanceAnalytics package to compute returns, i then computed the correlation and covariance matrices. From there, i had the option of using either R or Python for computing the necessary weights. One thing to note is that Marcos’ code outputs weights in their cluster order, whereas Ilya is outputting them in the order that they appear in the correlation and covariance matrices. Using the matrices computed from my daily returns for the 4 ETFs, i was able to replicate the weights proposed by Marcos’ code and Ilya’s code.

The theory behind the approach is widely discussed by Marcos, elaborated on by Gautier and Ilya does a great job of simplifying it down to the main components in his blog post. So i will refrain from commenting on the method, but suffice to say i need to spend some time in the Recursive Bisection function to properly understand the mechanism for assigning weights using a minimum variance and clustered approach.

The weights for my investments IGV, TLT, IAU and IYR respectively were:

[1] 0.1043488 0.3841945 0.3761123 0.1353444

The below code to reproduce these results in Python and R is included below. The correlation and covariance matrices are included below as well (i could only upload them as .xls documents…so you will need to convert to .csv or edit the code to read .xls documents):

# R code
corMat <- read.csv("corMat_23072021.csv")[,1]
covMat <- read.csv("covMat_23072021.csv")[,1]
clustOrder <- hclust(dist(corMat), method = 'single')$order
# Plot Dendogram for some visualization of the order
clust <- hclust(dist(corMat), method = 'single')
getIVP <- function(covMat) {
# get inverse variance portfolio from diagonal of covariance matrix
invDiag <- 1/diag(as.matrix(covMat))
weights <- invDiag/sum(invDiag)
getClusterVar <- function(covMat, cItems) {
# compute cluster variance from the inverse variance portfolio above
covMatSlice <- covMat[cItems, cItems]
weights <- getIVP(covMatSlice)
cVar <- t(weights) %*% as.matrix(covMatSlice) %*% weights
getRecBipart <- function(covMat, sortIx) {
# keeping track of weights vector in the global environment
assign("w", value = rep(1, ncol(covMat)), envir = .GlobalEnv)
# run recursion function
recurFun(covMat, sortIx)
recurFun <- function(covMat, sortIx) {
# get first half of sortIx which is a cluster order
subIdx <- 1:trunc(length(sortIx)/2)
# subdivide ordering into first half and second half
cItems0 <- sortIx[subIdx]
cItems1 <- sortIx[subIdx]
# compute cluster variances of covariance matrices indexed
# on first half and second half of ordering
cVar0 <- getClusterVar(covMat, cItems0)
cVar1 <- getClusterVar(covMat, cItems1)
alpha <- 1 cVar0/(cVar0 + cVar1)
# updating weights outside the function using scoping mechanics
w[cItems0] <<- w[cItems0] * alpha
w[cItems1] <<- w[cItems1] * (1alpha)
# rerun the function on a half if the length of that half is greater than 1
if(length(cItems0) > 1) {
recurFun(covMat, cItems0)
if(length(cItems1) > 1) {
recurFun(covMat, cItems1)
out <- getRecBipart(covMat, clustOrder)
view raw HRP.R hosted with ❤ by GitHub

# Python 3 code
import matplotlib.pyplot as mpl
import scipy.cluster.hierarchy as sch,random
import numpy as np
import pandas as pd
# Now, take corMat and covMat generated from R, input into Marcos' Python, and check output is the same…
col_list = ["IGV.Close", "TLT.Close", "IAU.Close", "IYR.Close"]
corr = pd.read_csv("corMat_23072021.csv", usecols=col_list)
cov = pd.read_csv("covMat_23072021.csv", usecols = col_list)
def getIVP(cov,**kargs):
# Compute the inverse-variance portfolio
return ivp
def getClusterVar(cov,cItems):
# Compute variance per cluster
cov_=cov.loc[cItems,cItems] # matrix slice
return cVar
def plotCorrMatrix(path,corr,labels=None):
# Heatmap of the correlation matrix
if labels is None:labels=[]
mpl.close() # reset pylab
#3) cluster
def correlDist(corr):
# A distance matrix based on correlation, where 0<=d[i,j]<=1
# This is a proper distance metric
dist=((1corr)/2.)**.5 # distance matrix
return dist
def getQuasiDiag(link):
# Sort clustered items by distance
numItems=link[1,3] # number of original items
while sortIx.max()>=numItems:
sortIx.index=range(0,sortIx.shape[0]*2,2) # make space
df0=sortIx[sortIx>=numItems] # find clusters
sortIx[i]=link[j,0] # item 1
sortIx=sortIx.append(df0) # item 2
sortIx=sortIx.sort_index() # re-sort
sortIx.index=range(sortIx.shape[0]) # re-index
return sortIx.tolist()
sortIx=corr.index[sortIx].tolist() # recover labels
df0=corr.iloc[sortIx,sortIx] # reorder
#4) Capital allocation
cov_colnames = cov.columns.values
cov.rename(columns={'IGV.Close': 0, 'TLT.Close': 1, 'IAU.Close': 2, 'IYR.Close': 3}, inplace = True)
def getRecBipart(cov,sortIx):
# Compute HRP alloc
cItems=[sortIx] # initialize all items in one cluster
while len(cItems)>0:
cItems=[i[j:k] for i in cItems for j,k in ((0,len(i)//2), (len(i)//2,len(i))) if len(i)>1] # bi-section
for i in range(0,len(cItems),2): # parse in pairs
cItems0=cItems[i] # cluster 1
cItems1=cItems[i+1] # cluster 2
w[cItems0]*=alpha # weight 1
w[cItems1]*=1alpha # weight 2
return w
view raw hosted with ❤ by GitHub

For future iterations I will need to look into:

1. Understand whether or not expenses (and distributions for that matter) are priced into the prices im using for returns, or adjust with expense ratio info
2. Look at multiple metrics for “momentum”
3. Different objective functions and constraints
4. Different rebalancing periods
5. Backtest multiple portfolio optimization techniques
6. Refine how many assets to pick from each asset class, using a top-down approach. I suspect my choice of benchmark will matter here, and that will depend on my ultimate objectives and constraints.
7. Re-consider my 5bn USD threshold…there are better liquidity filters to use without risking the possible exclusion of otherwise valuable and appropriate ETFs

Thanks for reading!

How to: run quantstrat in a Jupyter Notebook with a Python3 Kernel

I have recently starting using Jupyter more frequently. Because i am curious and i see some potential, i wanted to know if i could run quantstrat from a Jupyter notebook with a Python3 Kernel. Well, you can…and these steps below are mostly for my recollection and hopefully will be of use to someone else out there.

Note: I am using Ubuntu 18.04

Firstly, you need rpy2. I tried for a while to install rpy2 directly in Jupyter but to no avail. So i installed Anaconda3, launched a Jupyter notebook from the Anaconda Navigator and was able to successfully install rpy2 in a Jupyter notebook with a Python3 Kernel.

Secondly, you will use the %load_ext magic command to load the rpy2 IPython extension into the notebook, “essentially initializing the R interface and allowing the notebook to connect and pass objects between the two languages.” Thank to Jared Stufft and the info in his LinkedIn post –

%load_ext rpy2.ipython

Third, run the below command in a conda activated terminal in order to install the package from GitHub. To get it to work I had to first tag quantstrat with a release number on GitHub. And now, we can more easily navigate different versions of quantstrat which is a nice side benefit. See for more info.

To install packages from CRAN, you can simply use install.packages() in your R code cell.

conda skeleton cran

Lastly, use the %%R magic command at the top of a cell block to tell Jupyter to run R code. Add quantstrat to your environment and run one of the demos to test, such as demo(‘bbands’). If you see trades printing and chart output, congrats. Now i don’t know how big a deal this will be for me…but knowing i can interface between R and Python using packages from GitHub fills me with wonder and optimism.


Below are some screenshots of the R console output in the Jupyter notebook. I saved the gist here

R console output image #1
R console output image #2

Leave a comment, a note, a thumbs up, anything if this is useful to you.

A replication of the Practical Application section in ‘The Probability of Backtest Overfitting’ – Bailey et al.

In their paper “The Probability of Backtest
Overfitting” [] Bailey et al. introduce a method for detecting overfitting. They refer to the method as CSCV or Combinatorially Symmetric Cross Validation. Bailey et al. proceed to show that CSCV produces reasonable estimates of PBO for several useful examples.

To illustrate the ease with which a random series can produce an overfit trading rule, Bailey et al. in “Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance” [] seek to identify an optimal monthly trading rule based on 4 parameters, namely: Entry day, Holding Period, Stop Loss and Side. Using the ranges C(1:22), c(1:20), c(0:10) and c(-1,1) respectively, their program iterates through each permutation (8800 in total) to determine the strategy with the highest Sharpe ratios (SR). Their example produces an optimal SR of 1.27. The authors extend the data mining example by applying a seasonal adjustment to the first 5 days of each month. Particularly, they scale the daily returns by a factor of a quarter of the sigma used to generate the original series. The python program for replicating this data mining process can be found here []. Using their CSCV method for computing the PBO of each series of trials the authors are able to illustrate examples of how CSCV diagnoses the probability of backtest overfitting for an overfit random strategy versus a strategy with a legitimate seasonal component.

The below code and analysis is an R-ported version of the python program together with some PBO output derived using the new blotter:::pbo function (currently in the, soon-to-be-merged with master, 73_pbo branch) for the purposes of replicating the paper introducing the CSCV method.

The main function is the dataMine() function, called backTest() in the original python program. We store the output as an S3 object, making possible the extension of additional analysis on the pbo output. There are a number of helper functions, namely:

  • numBusinessDays – for assigning an integer to the respective date, representing which business day in the month it is. This is used when applying the seasonality adjustment to the first ‘nDays’ of each month. Applicable when param ‘nDays’ is greater than 0.
  • getRefDates_MonthBusinessDate – for determining the respective nth business day in each month. Each month will have a 1st business day, so we assign dates for each 1st business day, then again for each 2nd business day of each month, etc.
  • getTrades – taking into consideration the constraints of the strategy (entry day, holding period {param name ‘exit’}, stop-loss and side), this function returns the trades.
  • attachTimeSeries – we keep a record of the daily returns for each trial (8800 in the paper) which we apply to the pbo() function. This function attaches each iterated trial.
  • evalPerf – used for recording various metrics for computing the psrStat as well as the Sharpe ratio, which is the performance metric used in the paper. Bailey et al. emphasise that other metrics can be used with their framework (eg. Sortino ratio, Jensen’s Alpha, Probabilistic Sharpe Ratio, etc), and introducing a range of methods to the analyst and allowing them to rank the performance of each for the purposes of detecting pbo is an exciting piece of future work we will look to implement in quantstrat in due course.
  • computePSR – a helper function within evalPerf() for computing the psrStat and psr.

We have kept the names of the helper functions identical to those used in the python program.

When the dataMine() function is called, we print the SR along with other strategy characteristics to the console for the first strategy and every subsequent strategy which improves on the prior SR. The last iteration to print once the function has completed, will be the iteration which generated the max Sharpe ratio. Each line prints 8 different details, which are:

  1. iteration number,
  2. business day in month used for entry day,
  3. holding period,
  4. stop-loss,
  5. side (-1 for sell, 1 for buy),
  6. Sharpe ratio,
  7. Frequency (see calc in evalPerf function),
  8. psrStat or probabilistic Sharpe ratio statistic
dataMine <- function(nDays=0, factor=0){
  # 1) Input parameters --- to be changed by the user
  holdingPeriod <- 20
  sigma <- 1
  stopLoss <- 10
  length <- 1000
  # 2) Prepare series
  date <- as.Date("2000-01-01") # original python program uses 01/01/2000 start date
  dates <- list()

  while(length(dates) < length){
    if(as.POSIXlt(date)$wday != 0 & as.POSIXlt(date)$wday != 6 & length(dates) == 0){
      dates <- date
    } else if(as.POSIXlt(date)$wday != 0 & as.POSIXlt(date)$wday != 6 & length(dates) != 0){
      dates <- append(dates, date)
    date <- as.Date(date) + 1

  series <- matrix(0, nrow = length, ncol = 1)
  for(i in 1:length(series)){
    series[i] <- rnorm(1, mean = 0, sd = sigma)
    pDay <- as.Date(paste(format(as.Date(dates[i], format="%d/%m/%Y"),"%Y"),
                          format(as.Date(dates[i], format="%d/%m/%Y"),"%m"),
                          sep = "-")
    if(numBusinessDays(pDay, dates[i]) <= nDays){
      series[i] <- series[i] + sigma * factor
  series <- cumsum(series)
  # 3) Optimize
  # get the daily index in each month for each date in the series,
  # so the 2nd BD in Jan 2000 is 03/01/2000 and in Feb 2000 is
  # 02/02/2000. refDates[[2]] will be a list of elements with these dates.
  refDates <- getRefDates_MonthBusinessDate(dates) 
  psr <- sr <- trades <- sl <- freq <- pDay <- pnl <- NULL
  count <- 0
  # start of main loop - we loop over each date in the refDates list
  sr <- NULL
  for(pDay in 1:length(refDates)){
    refDates_0 <- refDates[[pDay]]
    if(length(refDates_0) == 0){
    # 4) Get trades
    iterdf <- expand.grid(z=c(-1,1), y=c(-10:0), x=c(1:20))
    iterdf <- iterdf[c("x","y","z")]
    for(prod in 1:nrow(iterdf)){
      count = count + 1
      trades_0 <- getTrades(series, dates, refDates_0, iterdf[prod,1], iterdf[prod,2], iterdf[prod,3])
      dates_0 <- NULL
      pnl_0 <- NULL
      if(length(trades_0) == 0) { next }
      for(j in 1:length(trades_0)) { dates_0 <- append(dates_0, trades_0[[j]][1]) }
      for(j in 1:length(trades_0)) { pnl_0 <- append(pnl_0, as.numeric(trades_0[[j]][4])) }
      # 5) Eval performance
      if(length(pnl_0) > 2){
        # 6) Reconcile PnL
        pnl <- attachTimeSeries(pnl, pnl_0, dates_0, count)
        # 7) Evaluate
        eval_list <- list()
        eval_list <- evalPerf(pnl_0, dates[1], dates[length(dates)])
        names(eval_list) <- c("sr_0","psr_0","freq_0")
        for(j in 2:length(pnl_0)){
          pnl_0[j] <- pnl_0[j] + pnl_0[j - 1]
          sr <- 0 # give sr a numeric value
        if(eval_list["sr_0"] > sr){ # if sr == None or sr_ > sr:
          psr <- eval_list["psr_0"]
          sr <- eval_list["sr_0"]
          trades_df <- data.frame(matrix(unlist(trades_0), nrow=length(trades_0), byrow=T),stringsAsFactors=FALSE)
          trades_df[,1] <- as.Date(trades_df[,1])
          colnames(trades_df) <- c("txndate","num","side","daypl")
          freq <- eval_list["freq_0"]
          pDay <- pDay # freq, pDay, prod = freq_, pDay_, prod_
          constraints <- c(iterdf[prod,1],iterdf[prod,2],iterdf[prod,3])
          cat(count, pDay, constraints, round(sr,2), round(freq,2), round(psr,2), fill = TRUE, sep = ",")
  ret <- list(Total_Iterations = count,
              PL = pnl,
              PSR = psr,
              SR = sr,
              trades = trades_df,
              Freq = freq,
              Business_Day = pDay,
              HP_SL_Side = constraints, # HP = Holding Period, SL = Stop Loss, Side = Side
              Dates = dates,
              cumsum_Series = series) # series is a cumsum of the original return series

  class(ret) <- "dataMine"

# ----------------------------------------------------------------------------------------
evalPerf <- function(pnl, date0, date1, sr_ref=0){
  freq <- (length(pnl) / as.numeric((difftime(date1, date0) + 1)) * 365.25)
  m1 = mean(pnl)
  m2 = sd(pnl)
  m3 = skewness(pnl)
  m4 = kurtosis(pnl, methods = "fisher") # fisher=False in original, but unsure what alternative to use
  sr = m1 / m2 * freq ^ 0.5
  psr = computePSR(c(m1, m2, m3, m4), length(pnl), sr_ref=sr_ref / freq ^ 0.5, moments=4)[1]
  return(c(sr, psr, freq))

# ----------------------------------------------------------------------------------------
computePSR <- function(stats, obs, sr_ref=0, moments=4){
  # Compute PSR
  stats_0 = c(0, 0, 0, 3)
  stats_0 = stats[1:moments]
  sr = stats_0[1] / stats_0[2]
  psrStat = (sr - sr_ref) * (obs - 1) ^ 0.5 / (1 - sr * stats_0[3] + sr ^ 2 * (stats_0[4] - 1) / 4.) ^ 0.5
  psr = pnorm((sr - sr_ref) * (obs - 1) ^ 0.5 / (1 - sr * stats_0[2] + sr ** 2 * (stats_0[3] - 1) / 4.) ^ 0.5)
  return(c(psrStat, psr))

# ----------------------------------------------------------------------------------------
attachTimeSeries <- function(series, series_0, index=None, label='', how='outer'){
    series <- xts(series_0, as.Date(unlist(index)))
  } else {
    series_0 <- xts(series_0, as.Date(unlist(index)))
    series <- merge(series, series_0)

# ----------------------------------------------------------------------------------------
getTrades <- function(series, dates, refDates, exit, stopLoss, side){
  # Get trades
  trades <- list()
  pnl <- position_0 <- 0
  j <- 1
  num <- NULL
  for(i in 2:length(dates)){
    # Find next roll and how many trading dates to it
    if(dates[i] >= refDates[j]){
      if(dates[i - 1] < refDates[j]){ num <- pnl <- 0 }
      if(j < length(refDates) - 1){
        while(dates[i] > refDates[j]) { j = j + 1 }
    if(is.null(num)){ next }
    # Trading rule
    position <- 0
    if(num < exit & pnl > stopLoss){ position <- side }
    if(position != 0 | position_0 != 0) {
      trades <- append(trades, list(list(as.Date(dates[i]), as.numeric(num), 
                                         as.numeric(position_0 * (series[i] - series[i - 1])))))
      pnl <- pnl + as.numeric(trades[[length(trades)]][4])
      position_0 <- position
    num <- num + 1

# ----------------------------------------------------------------------------------------
getRefDates_MonthBusinessDate <- function(dates){
  refDates <- list() # python dict in de Prado program
  pDay <- list()
  first <- as.Date(paste(format(as.Date(dates[1], format="%d/%m/%Y"),"%Y"),
                         format(as.Date(dates[1], format="%d/%m/%Y"),"%m"),
                         sep = "-")
  m <- as.numeric(format(dates[1],"%m"))
  d <- numBusinessDays(first, dates[1]) + 1
  # For all dates, get the day (int)...will be all business days, and assign to pDay
  for(i in 1:length(dates)){
    if(m != as.numeric(format(dates[i],"%m"))){ 
      m <- as.numeric(format(as.Date(dates[i], format="%d/%m/%Y"),"%m"))
      d <- 1
    pDay <- append(pDay, d)
    d <- d + 1
  for(j in 1:30){ # python "range(1,30)"
    lst <- list()
    for(i in 1:length(dates)){
      if(pDay[i] == j){
        lst <- append(lst, dates[i])
    refDates[[j]] = lst

# ----------------------------------------------------------------------------------------
numBusinessDays <- function(date0, date1){
  m <- 1
  date_0 <- date0
    date_0 <- as.Date(date_0) + 1
    if(date_0 >= date1){ break }
    if(as.POSIXlt(date_0)$wday != 0 & as.POSIXlt(date_0)$wday != 6){
      m <- m + 1 

We will implement the dataMine() function twice, as in the paper, once for a purely random series and again for a seasonally adjusted series. In the first case we leave the default values for parameters ‘nDays’ and ‘factor’, which are zero. In the second run, we use ‘nDays’ = 5 and ‘factor’ = 0.25.


# First build our Sharpe function - the performance metric used in the paper for determining PBO
# which gets passed to pbo() as the performance metric used
sharpe <- function(x,rf=0.03/252) {
  sr <- apply(x,2,function(col) {
    er = col - rf

# Run dataMine() for a random series
t1 <- Sys.time()
dm_random <- dataMine(nDays = 0, factor = 0)
#> 1,1,1,-10,-1,0.77,22.99,1.52
#> 68,1,4,-10,1,0.93,57.48,1.85
#> 84,1,4,-2,1,0.93,56.43,1.86
#> 464,2,2,-10,1,1.29,34.49,2.7
#> 486,2,3,-10,1,1.46,45.98,2.97
#> 502,2,3,-2,1,1.54,45.72,3.16
#> 504,2,3,-1,1,1.55,44.94,3.19
t2 <- Sys.time()
timediff <- difftime(t2,t1)
#> Time difference of 22.62471 mins

# blotter:::pbo
dm_random$PL[!is.finite(dm_random$PL)] <- 0
pbo_random  <- blotter:::pbo(dm_random$PL, f=sharpe)
#> Performance function sharpe with threshold 0
#>    p_bo   slope    ar^2  p_loss 
#> 0.50000 0.11922 0.08100 0.48600
# histogram(random_series,type="density")
# xyplot(random_series,plotType="degradation")


# First build our Sharpe function - the performance metric used in the paper for determining PBO
# which gets passed to pbo() as the performance metric used
sharpe <- function(x,rf=0.03/252) {
  sr <- apply(x,2,function(col) {
    er = col - rf
# Run dataMine() for a seasonal series
t3 <- Sys.time()
dm_season <- dataMine(nDays = 5, factor = 0.25)
#> 2,1,1,-10,1,0.19,22.99,0.37
#> 24,1,2,-10,1,1.37,34.49,2.85
#> 46,1,3,-10,1,2.13,45.98,4.65
#> 68,1,4,-10,1,2.63,57.48,5.64
#> 486,2,3,-10,1,2.87,45.98,6.43
#> 504,2,3,-1,1,2.9,45.46,6.55
t4 <- Sys.time()
#> Time difference of 21.87491 mins
dm_season$PL[!is.finite(dm_season$PL)] <- 0 # Replace NAs with zeros
pbo_seasonal  <- blotter:::pbo(dm_season$PL, f=sharpe)
#> Performance function sharpe with threshold 0
#>    p_bo   slope    ar^2  p_loss 
#> 0.00000 0.19634 0.05000 0.00000
# histogram(season_series,type="density")
# xyplot(season_series,plotType="degradation")

The PBO results from both the random and seasonally adjusted series should be very similar to what is reported in the paper. Bailey et al. get an optimal SR of 1.27 from all their trials for the random series and a PBO of 55%. For the seasonal series they report an optimal SR of 1.54 and a PBO of only 13%. When we set the seed for both runs of dataMine() to be equivalent and equating to 777 (jackpot) for the purposes of comparison and replication, we get an optimal SR of 1.55 for the random series and a PBO of 50%. For the seasonal series we get an optimal SR of 2.9 and a PBO of 0%. The PBO scores using the results from our dataMine() program reflect similar results compared with Bailey et al. (55% and 13% respectively). The optimal iteration occurs at number 504 for both series’, when the random seed is set to 777.

Future work

  • Adding different performance metrics as methods to blotter:::pbo.
  • Adding a function for comparing overfit statistics for the different performance metrics. Presumably using the Sharpe ratio as your metric will give different results compared to using max drawdown, or Sortino Ratio etc. Can this tell us anything about the relevance and/or applicability of the different performance metrics?
  • Apply output from txnsim to pbo(), since txnsim() attempts to evaluate overfitting in the context of randomized versions of the original strategy whilst honouring the outward resemblance of the original strategy.
  • Optimise dataMine() with vectorization.


Thanks to Duncan Temple Lang (@duncantl), who developed the R package, RWordPress, and William K. Morris (@wkmor1), Yihui Xie (@xieyihui), and Jared Lander (@jaredlander), who developed the knit2wp function in knitr. These two packages make it possible to write a blog post in R Markdown and then publish it directly to a WordPress site as i did with this post. For instructions on how to do this, see this post by @HeatherUrry.

Round Turn Trade Simulation – in R

I was fortunate enough to talk about my latest open source work with Brian Peterson at the R/Finance conference in Chicago just less than 1 month ago. It was my first time to the conference, and I will be back again for sure. The topics and their presentations are available on the website. With this post, I hope to share the main ideas of my talk.

Back in August 2017 I wrote a post about a new function in the R blotter package called mcsim(). The objective for writing that post was to introduce the function, and how and why it may be useful when evaluating quantitative trading strategies. In this article I aim to achieve the same goal for a new function named txnsim(), with a similar application but with more analytical value.

With mcsim() an analyst is able to simulate a trading strategy’s portfolio PL which will give a person some information on the statistical properties of the strategy overall. Simulating portfolio PL can have drawbacks; in particular that simulations on a portfolio level may lack transparency or may not line up with historical market regimes. In these instances sampling round turn trades may be more useful.

With txnsim() the analyst is able to construct a random strategy that preserves as many of the stylized facts (or style) of the observed strategy as possible, while demonstrating no skill. The round turn trades of the random replicate strategies, while outwardly resembling the original strategy in summary time series statistics, are the result of random combinations of observed features taking place at random times in the tested time period. This effectively creates simulated traders with the same style but without skill. For this reason txnsim() is most appropriate for discerning skill vs. luck or overfitting.


Performance Simulations in the Literature

  • Pat Burns (2004) covers the use of random portfolios for performance measurement and in a subsequent paper in 2006 for evaluating trading strategies which he terms a related but distinct task. He goes on to mention in his evaluating strategies paper that statistical tests for a signal’s predictiveness was generally possible even in the presence of potential data snooping bias. Things have likely changed in the 12 years since in that data snooping has become more prevalent, with more data, significantly advanced computing power and the ability to fit an open source model to almost any dataset.

Pat Burns – “If we generate a random subset of the paths, then we can make statistical statements about the quality of the strategy.”

  • Jaekle & Tomasini in their Trading Systems book refer to the analysis of trading systems using Monte Carlo analysis of trade PNL. In particular they mention the benefit of a confidence interval estimation for max drawdowns.

Jaekle & Tomasini – “Changing the order of the performed trades gives you valuable estimations about expected maximum drawdowns.”

  • In the Probability of Backtest Overfitting paper by Lopez de Prado et al in 2015, they present a method for assessing data snooping as it relates to backtests, which are used by investment firms and portfolio managers to allocate capital.

Lopez de Prado et al – “…because the signal-to-noise ratio is so weak, often
the result of such calibration is that parameters are chosen to profit from
past noise rather than future signal.”

Thanks to the consent of Matt Barry, we will soon be implementing code from the pbo package into blotter and extending from there to fit in with the general framework of blotter and our ambitions of broadening the overfit detection methods available in quantstrat and blotter.

  • Harvey et al, in their series of papers including Backtesting and the Cross-Section of Expected Returns discuss their general dismay at the reported significance of papers attempting to explain the cross-section of expected returns. They propose a method for deflating the Sharpe Ratio when taking into account the data snooping bias otherwise referred to as Multiple Hypothesis testing.

Harvey et al – “We argue that most claimed research findings in financial economics are likely false.”

We implemented Harvey’s haircut Sharpe ratio model in quantstrat.

What all these methods have in common, is an element of random sampling based on some constraint. What we propose in txnsim() is the random sampling of round turn trades bound by the constraint of the stylized facts of the observed strategy.

Compared with more well-known simulation methods, such as simulating portfolio P&L, Round Turn Trade Simulation has the following benefits:

  1. Increased transparency, since you can view the simulation detail down to the exact transaction, thereby comparing the original strategy being simulated to random entries and exits with the same overall dynamic.
  2. More realistic since you sample from trade durations and quantities actually observed inside the strategy, thereby creating a distribution around the trading dynamics, not just the daily P&L.

What all this means, of course, is you are effectively creating simulated traders with the same style but zero skill.


Stylized facts

If you consider the stylized facts of a series of transactions that are the output of a discretionary or systematic trading strategy, it should be clear that there is a lot of information available to work with. The stylized facts txnsim() uses for simulating round turns include;

  • Round turn trade durations
  • Ratio of long:short durations
  • Quantity of each round turn trade
  • Direction of round turns
  • Number of layers entered, limited by maximum position

Using these stylized facts, txnsim() samples either with or without replacement between flat periods, short periods and long periods and then layers onto these periods the sampled quantities from the original strategy with their respective durations.


Round Turn Trades

In order to sample round turn trades, the analyst first needs to define what a round turn trade is for their purposes. In txnsim() there is a parameter named tradeDef which can take one of 3 arguments, 1. “”, 2. “”, 3. “”. The argument is subsequently passed to the blotter::perTradeStats() function from which we extract the original strategy’s stylized facts.

For a more comprehensive explanation of the different trade definitions, i will refer you to the help documentation for the perTradeStats() function as well as the documentation for the txnsim() function in blotter.



The first empirical example we will take a look at is an analysis using txnsim() and the longtrend demo in blotter. My only modification to the demo was to end the strategy in Dec 2017, purely for the purposes of replicating my results.

As we can see in the blue Positionfill window, the strategy only enters into a position once, before exiting.

longtrend performance-1

If we look at how the strategy performed overall (after setting the seed to lucky number 333), relative to its random replicates we can see fairly quickly that it is a difficult strategy to beat.

txnsim - longtrend

To observe the stylized facts of the original versus the winning replicate strategy, we can contrast the Positionfills of both.

Positionfil longtrend-1.png

Replicate number 664 out of 1,000 was the most profitable strategy overall (we confirm this in a moment), and we get a good sense of how txnsim() honoured the stylized facts of the original strategy when determining the random entries and exits of the ultimate winning replicate number 664.

One of the many slots returned in the txnsim object is named “replicates” and includes the replicate start timestamps, the durations and the corresponding quantities. With the duration information in particular, we are able to chart the distribution of long period durations and flat period durations.

histogram long period durations longtrend_txnsim with replacement-1

Perhaps not surprisingly, we see the original strategy duration for long and flat periods roughly in the middle of the distributions of the replicates.

Since longtrend was a long only strategy, the flat period distribution is a mirror image of the long period distribution.

histogram flat period durations longtrend_txnsim with replacement-1.png


Included in the returned list object of class txnsim, are ranks and pvalues which summarise the performance of the originally observed strategy versus the random replicates.

longtrend - ranks_pvalues.PNG

longtrend - pvalues

As we can see, longtrend was in the 90th percentile for all performance metrics analysed except for stddev where it ranked inside the 84th percentile.

Using the hist method for objects of type txnsim we can plot any of the performance metric distributions to gauge how the observed strategy fared overall. Looking at the Sharpe ratio, we see graphically how longtrend does relative to the replicates as well as relative to configurable confidence intervals.

longtrend - hist sharpe

Maximum drawdown is another one of the performance metrics used and generally a favorite for simulations. We can see again, visually, that longtrend outperforms most random replicates on this measure.

longtrend - hist maxdd.png


Layers and Long/Short strategies with ‘bbands’

For any round turn trade methodology which is not measuring round turns as, things get more complicated.

The first major complication with any trade that levels into a position is that the sum of trade durations will be longer than the market data. The general pattern of the solution is that we sample as usual, to a duration equivalent to the duration of the first layer of the strategy. In essence we are sampling assuming round turns are defined as “”. Any sampled durations beyond this first layer are overlapped onto the first layer. The number of layers is determined by the amount of times the first layer total duration is divisible into the total duration. In this way the total number of layers and their duration is directly related to the original strategy.

The next complication is max position. Now, a strategy may or may not utilize position limits. This is irrelevant. We have no idea which parameters are used within a strategy, only what is observable ex post. For this reason we store the maximum long and short positions observed as a stylized fact. To ensure we do not breach these observed max long and short positions during layering we keep track of the respective cumsum of each long and short leveled trade.

For any trade definition other than, however, we need to be cognizant of flat periods when layering to ensure we do not layer into an otherwise sampled flat period. For this reason we match the sum duration of flat periods in the original strategy for every replicate. To complete the first layer with long and short periods, we sample these separately and truncate the respectively sampled long and short duration which takes us over our target duration. When determining a target long and short total duration to sample to, we use the ratio of long periods to short periods from the original strategy to distinguish between the direction of non-flat periods.

To highlight the ability of txnsim() to capture the stylized facts of more comprehensive strategies including Long/Short strategies with leveling we use a variation of the ‘bbands’ strategy. Since we apply a sub-optimal position-sizing adjustment to the original demo strategy in order to illustrate leveling, we do not expect the strategy to outperform the majority of its random counterparts.

A quick look at the chart.Posn() output of bbands should highlight the difference in characteristics between longtrend and bbands.

bbands chart and equity curve-1.png

We run 1k replicates and the resulting equity curves as you can see here confirm our suspicions. We have a lower probability of outperforming random replicates for this version of ‘bbands’. In fact you will see there are periods during the backtest that we severely underperform the other random agents. Something I touch on in the future work section is the addition of something similar to Burns’ non-overlapping periodic p-values so we can better visualize just how the backtest performed through time.

bbands - txnsim.png

Taking a closer look at the performance and position taking of the “winning” random replicate, we get a sense of how the strategy attempts to mirror the original in terms of position sizing and duration of long versus short positions overall. It should also be evident how the replicate has honored the maximum long and short positions observed in the original strategy.

bbands txnsim winner.png

Comparing the position fills of the original strategy and the winning replicate more directly we get a better sense of the overall dynamic of both the original and the winning replicate.

bbands - Positionfills.png

What is potentially a red flag from this chart, is the difference in padding. The replicate clearly has less padding, meaning the total duration that the strategy is in the market will be less than the original.

When we plot the long and short duration distributions of the replicates and compare these to the original strategy, it highlights the magnitude of the discrepancy and is something we hope to resolve soon so we can move onto the txnsim vignette and hopefully the start of a paper on Round Turn Trade Simulations.

bbands - replicate long period durations.png


bbands - replicate short period durations.png


Future Work

As mentioned previously and in no particular order, future work items will include:

  • Refining the layering process to better replicate total trade duration of the original strategy
  • Adding p-value visualization through time, similar to Pat Burns’ 10-day non-overlapping pvalues
  • Adding other simulation methodologies
  • Simulation studies of multivariate portfolios bound by capital constraints
  • Basing simulations on simulated or resampled market data
  • Applying txnsim stylized facts to “OOS” market data
  • And of course, a vignette and hopefully a paper on Round Turn Trade Simulation



Round turn trade Monte Carlo simulates random traders who behave in a similar manner to an observed series of real or backtest transactions. We feel that round turn trade simulation offers insights significantly beyond what is available from:

  • equity curve Monte Carlo (implemented in blotter in mcsim),
  • from simple resampling (e.g. from pbo or boot),
  • or from the use of simulated input data (which typically fails to recover many important stylized facts of real market data).

Round turn trade Monte Carlo as implemented in txnsim directly analyzes what types of trades and P&L were plausible with a similar trade cadence to the observed series. It acts on the same real market data as the observed trades, efficiently searching the feasible space of possible trades given the stylized facts. It is, in our opinion, a significant contribution for any analyst seeking to evaluate the question of “skill vs. luck” of the observed trades, or for more broadly understanding what is theoretically possible with a certain trading cadence and style.

The source code for my talk and this post is on GitHub.

Thanks to my co-author Brian Peterson!

Monte Carlo Simulation for your Portfolio PL

‘Once you have a system, the biggest obstacle is trusting it.’
Tom Wills

‘What is the last thing you do before you climb on a ladder? You shake it. And that
is Monte Carlo simulation.’
Sam Savage, Stanford University



In my early days of looking at trading strategies, getting to the equity curve felt like the final piece of the puzzle. The piece that would provide enough perspective so that you could make the decision whether or not to proceed to production. I was always more comfortable with the scientific approach than my gut feel. Perhaps this was a function of my limited time in the market. After collaborating with Brian Peterson and studying the many great articles written by he and so many others in the community it became clear the equity curve was only the first piece of the puzzle.

Questions one should ask subsequent to generating a complete backtest are:

  • Is the result realistic?
  • Could we have overfit?
  • What range of values can we expect out of sample?
  • How much confidence can we have in the model?

A backtest generally represents limited information. It is merely one run of a particular strategy whose out-of-sample performance could be ambiguous. Sampling from this limited information however may allow us to get closer to the true process, thereby adding statistical significance to any inferences made from the backtest. mcsim() gives us the potential to do just that, simulating portfolio PL from either post-trade or backtested transactions, as well as providing multiple options for tweaking the sampling procedure according to strategy characteristics.

Whilst mcsim() has been alive and well in blotter for a while now (blotter is the open source transaction infrastructure package in R for defining instruments, accounts and portfolios for trading systems and simulations) i am yet to write a post about it. My prior posts here and here were a prelude to the superior functionality now available with mcsim().



Included in the documentation for the mcsim() function is an example of its use with the longtrend demo. I will refer to the mcsim() output of that strategy for this post. mcsim() comes with 10 parameters which i describe next.

  • Portfolio, Account

These are objects stored in your blotter environment. If you are using post-trade cash PL in your simulation then these parameters can be left as their default NULL values. If using the longtrend demo for example you can use “longtrend” for both.

The Account object is used solely for retrieving your initial equity amount specified at the start of your strategy. The Portfolio object is used for retrieving dailyEqPL or dailyTxnPL (both existing blotter functions) depending on the ‘use’ parameter, which we cover a few parameters down.

  • n

‘n’ is used to specify the number of simulations you would like to run on your backtest. The longtrend help file uses 1k simulations. This takes me ~15secs on my Intel i5-4460. Disclaimer: no optimisation or refactoring was carried out on mcsim()…this is a TODO item. In my experience 1k simulations almost certainly achieves an asymptotically normal distribution with an acceptable level of statistical significance. For investigating the general shape, however, 100 simulations should suffice.

  • replacement

Any sampling procedure can generally be run with or without replacement. Sampling without replacement means results will have the same mean and final PL but different paths. Sampling with replacement results in a wider distribution of the various paths the strategy could have taken, as well as results which vary final PL and the other sample statistics (maxDD, stddev, quasi-Sharpe). This differentiaion will be clear when running mcsim() for a given strategy with and without replacement, and calling the plot and hist functions on the mcsim object the results were assigned to. See the examples in the help docs, which we cover a little further down.

  • use

The ‘use’ parameter lets you specify whether or not you would like to use daily-marked portfolio PL, transaction PL (think intraday strategies) or cash PL if using an external xts object of post-trade returns. The method used in Tomasini & Jaekle (2009), Chapter 4 resembles that of sampling with use=’txns’.

  • l

The ‘l’ parameter (lower case L) stands for ‘length’ and allows you to sample blocks of PL observations. Assuming your strategy exhibits some form of autocorrelation it may make sense to incorporate a block length in your sampling procedure based on this observed autocorrelation. Alternatively some inverse multiple of your average holding period could also justify a value for block length ‘l’.

  • varblock

‘varblock’ is a boolean, which allows you to sample with a variably sized block length. If replacement=TRUE, the sampling is based on the tsboot() function in the boot package which you will note is a dependency for running mcsim(). The tsboot() function was the perfect solution for sampling time series, but it only allows for sampling with replacement which should give a sense of what the general preference is from a statistical perspective. When ‘varblock’ is set to TRUE together with ‘replacement’ the resulting block lengths are sampled from a geometric distribution with mean ‘l’. If replacement=FALSE the distribution of block lengths is uniformly random around ‘l’.

  • gap

For strategies which require leading market data before computing indicators on which to base signals and subsequent rules, users can use the ‘gap’ parameter to specify how many periods from the strategy initialization date to start the simulations. For longtrend we use a gap of 10 since it uses an SMA lookback period of 10 (monthly) observations.

  • CI

CI or Confidence Interval is the parameter used to specify the confidence level to display in the hist() function which charts a range of sample statistics for the strategy. Default is 95%.

  • cashPL

This parameter is used when running mcsim() on post-trade cash PL, and refers to a separate xts object created by the user.



There are currently 4x S3 methods related to mcsim(), all of which return some portion of the ‘mcsim’ object as either a plot, hist or table (summary and quantile). For each method you can choose to set normalize=TRUE or FALSE. This will simply define the return space your simulations operate in…either cash or percent. Depending on the strategy duration and your reinvestment assumptions, there could be pros and cons to either approach. The distributions should be fairly similar in any event. In addition to ‘normalize’ there is also a ‘methods’ parameter for hist() which lets you specify the sample statistics to return in histogram plots. There are currently 5 stats, namely: ‘mean’, ‘median’, ‘stddev’, ‘maxDD’ and ‘sharpe’. The default is to return all methods as 5 separate histogram plots. This may be inappropriate when setting replacement=FALSE since all sample statistics barring maxDD will be identical (in the case of cash returns).

[Note: There are numerous benefits to using S3 methods, including for example that calls to plot() and hist() dispatch the appropriately defined methods related to mcsim(). In fact a total of 20 slots are returned with a call to mcsim() and can be analysed directly in your R environment. The Help file covers the 20 values returned from a call to mcsim().]

Looking at the first example in the Help file we see the following assigned call to mcsim where replacement is FALSE: and stored in ‘nrsim’ for ‘no-replacement simulation.’

nrsim = mcsim("longtrend", "longtrend", n=1000, replacement=FALSE, l=1, gap=10, CI=0.95)



Calling plot() on the mcsim assigned object ‘nrsim’ returns the below xts plot (which will differ slightly on each simulation):

plot(nrsim, normalize=TRUE)


The same plot() call with normalize=FALSE yields a plot that converges to the same final PL.

plot(nrsim, normalize=FALSE)


The reason the plot with normalization does not converge to the same final number is purely a function of how mcsim() normalization is carried out on the replicates in the cash return space. Since sampling without replacement means sequences are not perfectly independent, using percent returns for without replacement may be a useful medium between cash returns with and without replacement which fan out and converge respectively. Below is an example of the same call, however this time with replacement;

wrsim = mcsim("longtrend", "longtrend", n=1000, replacement=TRUE, l=1, gap=10, CI=0.95)
plot(wrsim, normalize=FALSE)




From the range of histogram plots mcsim() generates we can infer expectations of the distribution of values for the 5 methods: mean, median, stddev, maxDD and sharpe. These are based on the frequency of the ‘use’ parameter except for maxDD. If use=’equity’ for example then the frequency is daily ie. distribution of daily mean PL, median PL, stddev PL and daily quasi-sharpe (we use quasi as the formula is not a true implementation of the original Sharpe ratio). Using user-specified confidence intervals (and thanks to the Central Limit theorem) it is possible to visualize the range of values that can be expected from the backtest for the sample statistics in question. In presumably most instances the distribution will be centered around the backtest. However, in the case of the max Drawdown the simulation mean and the backtest could be more significantly different. Let’s look at some output from the longtrend demo.

hist(rsim, normalize=FALSE, methods="mean")


hist(rsim, normalize=FALSE, methods="stddev")


hist(rsim, normalize=FALSE, methods="maxDD")


hist(rsim, normalize=FALSE, methods="sharpe")


It is evident from the above charts that the Monte Carlo simulations result in an asymptotic normal distribution of the sample statistics (thanks to the Central Limit Theorem). In all cases excluding maxDD the distribution is centered around the mean, with the range of values for a 95% confidence interval quite clear to view. In the case of maxDD, the backtest has a max Drawdown of 30k, whilst the mean is closer to 40k. A strategy with any form of risk management should have a better max Drawdown than the average of the replicates, which are randomized re-ordered sequences of PL observations. In Tomasini & Jaekle (2009) the authors infer that the 99th percentile max Drawdown could be an expected outcome for your strategy with a 1% probability. Assuming your backtest employs a risk management strategy i would argue this probability is well below 1%. Nevertheless, viewing the results with this level of conservatism may be prudent especially considering the likely presence of many other biases not accounted for.


print() – recent update thanks to Brian, which tidies up the output by rounding to 3 decimals. Note that the simulation is different to the results generated in the charts above, as the update was done post the publishing of this post –

To interrogate the values from the charts more directly, a call to print() (which summarizes the output from the summary.mcsim() S3 method) will suffice. Below is an example with the subsequent output (note all sample stat outputs are returned):

print(rsim, normalize=FALSE)


We can now see the exact values for the sample statistic metrics.




Monte Carlo and mcsim() limitations and disadvantages of Portfolio PL simulations 

As with any model there are assumptions. In the case of Monte Carlo analysis using mcsim() one of the most significant assumptions is that returns follow a Gaussian (normal) distribution. Depending on the distribution of the sample statistic of interest, you may be able to propose a more suitable distribution. The exponential impact of ‘fat tail’ events also cannot be underestimated, with one of the world’s earliest and most successful hedge funds succumbing to these limitations.

Another limitation of Monte Carlo analysis is the sample from which it simulates may be overfit or overly optimized for in-sample out-performance. It is the job of the analyst to manage this risk though, and hopefully with the recently added quantstrat functions degfreedom(), deflated.Sharpe() and haircut.Sharpe() the analyst will be more empowered to manage this risk. For a useful demonstration of their use in quantstrat see the macdParameters demo.

Sampling from Portfolio PL may also be unrealistic since no consideration is given to market regimes and strategy-specific stylized facts such as the individual positions’ time in and out the market. Where this data is available however, the analyst has another recently developed tool at their disposal, namely txnsim()This will be the subject of my next post.

Thanks for reading.

Thanks must also go to my co-author, Brian Peterson, from whose work and commitment all of what i have written would be impossible. Brian’s invaluable insights related to this and so many other topics have benefited the industry to no end. Another legend worth a mention is Joshua Ulrich who never gave up on me (or never gave away that he had) w.r.t my supposed git ignorance. The R/Finance open source community is strong and i am honoured to be a part of it.

R-view: Backtesting – Harvey & Liu (2015)

In this post i take an R-view of “Backtesting – Harvey & Liu (2015).” The authors propose an alternative to the commonly practiced 50% discount that is applied to reported Sharpe ratios when evaluating backtests of trading strategies. The reason for the discount is due to the inevitable data mining inherent in research, both past and present. Harvey & Liu (HL) propose a multiple hypothesis testing framework with a more stringent requirement for determining a test’s significance level compared with single tests. HL suggest that the single hypothesis test used by most researchers leads to too many false discoveries (Type I errors). This makes intuitive sense if you consider that the probability of finding a significant factor increases with the number of factors tested or the number of tests conducted.

To give an example quoted in Harvey & Liu – Evaluating Trading Strategies (which you may have heard before), imagine you receive a mail from a manager promoting a particular stock. The mail says to track the manager’s recommendations over time. The recommendation is either to go long or short. The manager is correct 10 weeks in a row with his recommendations. The probability of such a feat you figure is very small, in fact a 0.09765% probability of being a false discovery (0.5^10 = 0.0009765625) based on single hypothesis testing. But what if the manager sent a similar mail to 100k people telling half to go long and the other half to go short the same stock. Every week the manager trims half the people off his mailing list, the half for whom the recommendation did not work. Every week for 10 weeks the list is trimmed. By the end of week 10 a total of 97 lucky people have been the recipients of these perfect picks. The results were random, and the recipients would have been fooled.

In multiple hypothesis testing the challenge is to guard against false discoveries. HL argue that the appropriate “haircut Sharpe ratio” is non-linear, in that the highest Sharpe ratios (SR’s) are only moderately penalized whilst marginal SR’s more so. The implication is that high SR’s are more likely true discoveries in a multiple hypothesis testing framework.

If you would like to jump straight to my R implementation it is at the end of this post. The authors make their Matlab code for Exhibits 5 & 6 available here. I simply converted it to R and added some extra bits for illustrative purposes. [As of Aug ’17 these functions are available in quantstrat as haircut.Sharpe and profit.hurdle]

HL mention 5 caveats to their framework, namely;

  • Sharpe ratios may not be appropriate metrics for strategies with negatively skewed expected payoffs, such as option strategies.
  • Sharpe ratios normalize returns based on their volatility (ie. market risk), which may not be the most appropriate reflection of risk for a strategy.
  • Determining the appropriate significance level for multiple testing (where in single tests 5% is the normal cutoff).
  • Which multiple testing method you choose could yield different conclusions. HL proposes 3 methods together with an average.
  • Lastly, and i would say most sensitively is the number of tests used.

Linking Sharpe ratio with t-statistic

To explain the link between the Sharpe ratio and the t-stat and the application of a multiple testing p-value adjustment, HL use the simplest case of an individual investment strategy. Assume a null hypothesis in which the strategy’s mean return is significantly different from zero, therefore implying a 2-sided alternative hypothesis. A strategy can be regarded as profitable if its mean returns are either side of zero since investors can generally go long or short. Since returns will at least be asymptotically normally distributed (thanks to the Central Limit Theorem) a t-statistic following a t-distribution can be constructed and tested for significance. However, due to the link between the Sharpe ratio and the t-stat it is possible to assess the significance of a strategy’s excess returns directly using the Sharpe ratio. Assume \hat{\mu} denotes the mean of your sample of historical returns (daily or weekly etc) and \hat{\sigma} denotes standard deviation, then

t-statistic = \frac{ \hat{\mu}}{(\hat{\sigma}/\sqrt{T})}

where T-1 is degrees of freedom and since

\widehat{SR} =  \hat{\mu}/\hat{\sigma}

it can be shown that

\widehat{SR} = t-statistic/\sqrt{T}

By implication a higher Sharpe ratio equates to a higher t-ratio, implying a higher significance level (lower p-value) for an investment strategy. If we denote p-value of the single test as {p^s} then we can present the p-value of the single test as

{p^s} = Pr( |r| > t-ratio)


{p^s} = Pr( |r| > \widehat{SR}.\sqrt{T})

Now if the researcher was exploring a particular economic theory then this p-value might make sense, but what if the researcher has tested multiple strategies and presents only the most profitable one? In this case the p-value of the single test may severely overstate the actual significance. A more truthful p-value would be an adjusted multiple testing p-value which assuming we denote as {p^m} could be represented as

{p^m} = Pr( max{|{r_i}|, i = 1,...,N} > t-ratio)


{p^m} = 1 - (1 - {p^s}{)^N}

By equating the p-value of a single test to a multiple test p-value we get the defining equation of {p^m} which is

{p^m} = Pr( {|{r_i}|} > \widehat{SR}.\sqrt{T})

Using the example in HL, assuming a strategy with 20 years of monthly returns (T=240), an annual Sharpe ratio (SR) of 0.75, you would have a p-value for the single test of

{p^s} = Pr( |r| > t-ratio)

{p^s} \approxeq 0.0008

To compute the single p-value and the multiple testing p-value assuming researchers have tried N = 200 strategies, you can use the below R code to compute the adjusted or “haircut” Sharpe ratio;

# Set the variables
N = 200 # Number of strategies tested
T = 240 # Number of observations or degrees of freedom
SR_ann = 0.75 # Annualised Sharpe ratio
# Compute the single test p-value
p_single = 2 * (1-pnorm(SR_ann*sqrt(T/12)))
# Compute the multiple test p-value
p_multiple = 1 - (1 - p_single) ^ N
# Compute haircut Sharpe ratio (HSR)
z_stat = qt(1 - p_multiple/2, N-1)
HSR = (z_stat/sqrt(T)) * sqrt(12)
# Sharpe ratio haircut
haircut = (SR_ann - HSR)/SR_ann

The code should output an HSR of 0.325 (HL yields 0.32 due to rounding the single and multiple p-values) which implies a haircut to the original Sharpe ratio of 56.6% (HL report a 60% haircut). The example above is no longer appropriate where t-statistics are dependent. More on this later.

Multiple Multiple-Testing methods

HL mentions 3 well known adjustment methods in the statistics literature, which are originally prescribed in the paper “…and the Cross-Section of Expected Returns” by Harvey, Liu and Zhu. These are Bonferroni, Holm, and Benjamini, Hochberg, and Yekutieli (BHY).

Bonferroni’s adjustment:

{p_i^{Bonferroni}} = min[M * {p_i}, 1]

Bonferroni applies the same adjustment to the p-value of each test, inflating the p-value by the number of tests. The multiple testing p-value is the minimum of each inflated p-value and 1 where 1 (or 100% if you prefer) is the upper bound of probability. HL use the example of p-values from 6 strategies where the p-values are (0.005, 0.009, 0.0128, 0.0135, 0.045, 0.06). According to a 5% significance cutoff the first 5 tests would be considered significant. Using the p.adjust function in R we can get the multiple adjusted p-values and according to Bonferroni only the first test would be considered significant.

# Set the variables
p_values = c(0.005, 0.009, 0.0128, 0.0135, 0.045, 0.06)
# Assign to adj.p_values the output of p.adjust()
adj.p_values = p.adjust(p_values, "bonferroni")
# Print


Holm’s adjustment:

p-value adjustments can be categorized into 2 categories, namely: single-step and sequential. Single-step corrections equally adjust p-values as in Bonferroni. Sequential adjustments are an adaptive procedure based on the distribution of p-values. Sequential methods gained prominence after a seminal paper by Schweder & Spjotvoll (1982) and section 7.3 of this paper gives a useful example of an application of multiple testing hypothesis diagnostic plotting in R. Holm is an example of a sequential multiple testing procedure. For Holm, the equivalent adjusted p-value is

{p_{(i)}^{Holm}} = min[max((M - j + 1)*{p_{(j)}}),1]

Using the same 6 strategies illustrated above, we can use p.adjust() to compute the Holm adjusted p-values.

# Run p.adjust() for Holm
HOLM_adj.p_values = p.adjust(p_values, "holm")


We see in the output above that the first 2 tests are significant at the 5% cutoff, compared to only 1 with Bonferroni. This is due to the fact that Bonferroni adjusts single tests equally, whereas Holm applies a sequential approach. By conclusion it should not surprise you that adjusted Sharpe ratios under Bonferroni will therefore be lower than for Holm. At this point it is useful to mention that both Holm and Bonferroni attempt to prevent even 1 Type I error occurring, controlling what is called the family-wise error rate (FWER). The next adjustment proposed by HL is BHY and the main difference from the previous 2 adjustment methods is that BHY attempts to control the false discovery rate (FDR), implying more lenience than Holm and Bonferroni and therefore expected to yield higher adjusted Sharpe ratios.

BHY – (Benjamini, Hochberg, and Yekutieli):

BHY’s formulation of the FDR can be represented as follows. Firstly all p-values are sorted in descending order and the adjusted p-value sequence is defined by pairwaise comparisons.

BHY - formulation.PNG

c(M) is defined as follows

c(M) = {\sum_{j=1}^M} \frac{1}{j}

making c(M) = 2.45 for the example of M=6 strategies (ie. sum of 1/1 + 1/2 + 1/3 +1/4 + 1/5 + 1/6). Using p.adjust() we can see that there are now 4 significant strategies at the 5% cutoff. The output from p.adjust below is slightly different to HL’s BHY implementation as p.adjust() does not equate {p_{(6)}^{BHY}} with the least significant p-value 0.06.

# Run p.adjust() for BHY
BHY_adj.p_values = p.adjust(sort(p_values, decreasing = TRUE), "BY")


So to summarise these methods BHY leads to 4 significant discoveries versus Holm’s 2 and Bonferroni’s 1. We expect BHY to be more lenient as it controls the false discovery rate whereas Holm and Bonferroni control the family-wise error rate, trying to eliminate making even 1 false discovery. Bonferroni is more stringent than Holm since it is a single-step adjustment versus the sequential approach of Holm. With these 3 methods HL attempt to adjust p-values to account for multiple testing and then convert these to haircut Sharpe ratios and in so doing control for data mining. For both Holm and BHY you need the empirical distribution of p-values from previously tried strategies. Harvey, Liu and Zhu (HLZ) model over 300 risk factors documented in the finance literature. However, using this model for the distribution of p-values is not complete since many tried strategies would not have been documented (referred to as Publication Bias) plus they are potentially correlated thereby violating the requirement for independence between tests. HLZ propose a new distribution to overcome these shortfalls.

Harvey, Liu and Zhu (HLZ)

As previously mentioned HLZ study over 300 factors tested for explaining the cross section of return patterns. They publish a list of their sources here in spreadsheet format. Just looking at the history of published factors they show that factor discovery increases every decade.

History of Factor Discoveries - HLZ.PNG

Using the t-statistics published with those papers (assuming they are economically and statistically sound) HLZ perform the 3 multiple testing procedures described above (Bonferroni, Holm and BHY). They conclude under the assumption that if all tried factors are published then an appropriate minimum threshold t-statistic for 5% significance is 2.8, which equates to a p-value of 0.50% for single tests. Whilst the assumption of all tried factors being published is not reasonable, HLZ argue that the analysis does serve to provide a minimum threshold for accepting significance of future tests.

HLZ limit their factor sample to unique risk factors so as to keep test dependence to a minimum. This is a requirement for the 3 multiple testing procedures described above. But we know the requirements for being published are fairly stringent and most likely limited to tests that show significant t-statistics. HLZ estimate that 71% of tried tests are not published (see appendix B of HLZ for details), and based on this together with their implementation of the 3 adjusting procedures HLZ propose a benchmark t-statistic of 3.18. Intuitively this is greater than previously (2.8) when the assumption was all tried factors were published.

What about test dependence ie. correlation among test statistics?

This will be the case where factors tested model similar forms of risk, and are therefore correlated. Think about the multitude of price multiple factors. Assuming in the extreme that all factors are directly correlated at 100% then no adjustment to the single test would be required as you would be penalizing Type I errors (false discoveries) too harshly at the risk of increasing the Type II error rate (missed discoveries). In multiple testing, the statistics literature attempts to account for correlation through various methods including resampling the entire dataset. In finance research bootstrap procedures are being used to assess significance of individual tests and infer fund manager skill versus chance. Indeed, this is what we are aiming to achieve with mcsim() and txnsim() in the R:blotter package. However, since HLZ do not have access to the datasets used for the factors published in the majority of cases, they instead infer a distribution using the t-statistics published. They suggest a mixed distribution, in which the null hypothesis that mean returns are zero is drawn from a normal distribution and the alternative hypothesis that mean returns are non-zero is drawn from an exponential distribution. This should make economic sense if one assumes more profitable strategies are less likely to exist in an economically scarce world. Using some statistical magic (Section 4 and Appendix A in HLZ which i hope to address more specifically in a future post) HLZ propose model parameter estimates as follows:

Model Parameter Estimates.PNG

 Using the baseline assumptions for the number of unobserved tests, one can use Harvey & Liu’s code in Exhibit 5 & 6 (replicated in R below) to determine the appropriate haircut Sharpe ratio (HSR) or the required mean return for a given significance level, both taking into account an assumed number of tests.


Data mining is unavoidable and in some respects represents knowledge gain. Accounting for it when measuring statistical significance is a requirement. Judgement will have to be exercised when considering significance levels and which multiple testing technique is most appropriate. Bonferroni and Holm are appropriate for mission-critical tests such as for medical research or space missions. Controlling for the error rate of a single event occurring is desirable. Where the application is not a matter of life or death controlling for the rate of false discoveries may be more desirable. Perhaps this applies to finance. Perhaps not. Either way the requirements for statistical significance of trading strategies over time are becoming more stringent. This is natural; most of the true factors would have been discovered and the cost of mining data has diminished and continues to do so at an impressive rate. In other sciences such as physics and medicine/genomics, tests require 5 standard deviations to be considered significant. The field of finance is headed in the same direction as more factors are published. Assuming the widely used Sharpe ratio is an applicable metric for your new strategy’s evaluation, you would be well served to consider its haircut equivalent.


R implementation of Exhibit 5 (Haircut_SR.m and sample_random_multests.m)

require(MASS) # for mvrnorm()
### Generate empirical p-value distributions based on Harvey, Liu and Zhu (2014) —— Harvey and Liu
### (2014): "Backtesting", Duke University
sample_random_multests <- function(rho, m_tot, p_0, lambda, M_simu){
###Parameter input from Harvey, Liu and Zhu (2014) ############
###Default: para_vec = [0.2, 1377, 4.4589*0.1, 5.5508*0.001,M_simu]###########
p_0 <- p_0 ; # probability for a random factor to have a zero mean
lambda <- lambda; # average of monthly mean returns for true strategies
m_tot <- m_tot; # total number of trials
rho <- rho; # average correlation among returns
M_simu <- M_simu; # number of rows (simulations)
sigma <- 0.15/sqrt(12); # assumed level of monthly vol
N <- 240; #number of time-series
sig_vec <- c(1, rho*rep(1, m_tot1))
SIGMA <- toeplitz(sig_vec)
MU <- rep(0,m_tot)
shock_mat <- mvrnorm(MU, SIGMA*(sigma^2/N), n = M_simu)
prob_vec <- replicate(M_simu, runif(m_tot,0,1))
mean_vec <- t(replicate(M_simu, rexp(m_tot, rate=1/lambda)))
m_indi <- prob_vec > p_0
mu_nul <- as.numeric(m_indi)*mean_vec #Null-hypothesis
tstat_mat <- abs(mu_nul + shock_mat)/(sigma/sqrt(N))
sample_random_multests <- tstat_mat
### Sharpe ratio adjustment due to testing multiplicity —— Harvey and Liu
### (2014): "Backtesting", Duke University
Haircut_SR <- function(sm_fre, num_obs, SR, ind_an, ind_aut, rho, num_test, RHO){
####### Parameter inputs ######
### 'sm_fre': Sampling frequency; [1,2,3,4,5] = [Daily, Weekly, Monthly, Quarterly, Annual];
### 'num_obs': No. of observations in the frequency specified in the previous step;
### 'SR': Sharpe ratio; either annualized or in the frequency specified in the previous step;
### 'ind_an': Indicator; if annulized, 'ind_an' = 1; otherwise = 0;
### 'ind_aut': Indicator; if adjusted for autocorrelations, 'ind_aut' = 0; otherwise = 1;
### 'rho': Autocorrelation coefficient at the specified frequency ;
### 'num_test': Number of tests allowed, Harvey, Liu and Zhu (2014) find 315 factors;
### 'RHO': Average correlation among contemporaneous strategy returns.
### Calculating the equivalent annualized Sharpe ratio 'sr_annual', after
### taking autocorrelation into account
if(sm_fre == 1){
fre_out <- 'Daily'
} else if(sm_fre == 2){
fre_out <- 'Weekly'
} else if(sm_fre == 3){
fre_out <- 'Monthly'
} else if(sm_fre == 4){
fre_out <- 'Quarterly'
} else {
fre_out <- 'Annual'
if(ind_an == 1){
sr_out <- 'Yes'
} else {
sr_out <- 'No'
if(ind_an == 1 & ind_aut == 0){
sr_annual <- SR
} else if(ind_an ==1 & ind_aut == 1){
if(sm_fre ==1){
sr_annual <- SR*(1 + (2*rho/(1rho))*(1 ((1rho^(360))/(360*(1rho)))))^(0.5)
} else if(sm_fre ==2){
sr_annual <- SR*(1 + (2*rho/(1rho))*(1 ((1rho^(52))/(52*(1rho)))))^(0.5)
} else if(sm_fre ==3){
sr_annual <- SR*(1 + (2*rho/(1rho))*(1 ((1rho^(12))/(12*(1rho)))))^(0.5)
} else if(sm_fre ==4){
sr_annual <- SR*(1 + (2*rho/(1rho))*(1 ((1rho^(4))/(4*(1rho)))))^(0.5)
} else if(sm_fre ==5){
sr_annual <- SR
} else if(ind_an == 0 & ind_aut == 0){
if(sm_fre ==1){
sr_annual <- SR*sqrt(360)
} else if(sm_fre ==2){
sr_annual <- SR*sqrt(52)
} else if(sm_fre ==3){
sr_annual <- SR*sqrt(12)
} else if(sm_fre ==4){
sr_annual <- SR*sqrt(4)
} else if(sm_fre ==5){
sr_annual = SR
} else if(ind_an == 0 & ind_aut == 1){
if(sm_fre ==1){
sr_annual <- sqrt(360)*sr*(1 + (2*rho/(1rho))*(1 ((1rho^(360))/(360*(1rho)))))^(0.5)
} else if(sm_fre ==2){
sr_annual <- sqrt(52)*sr*(1 + (2*rho/(1rho))*(1 ((1rho^(52))/(52*(1rho)))))^(0.5)
} else if(sm_fre ==3){
sr_annual <- sqrt(12)*sr*(1 + (2*rho/(1rho))*(1 ((1rho^(12))/(12*(1rho)))))^(0.5);
} else if(sm_fre ==4){
sr_annual <- sqrt(4)*sr*(1 + (2*rho/(1rho))*(1 ((1rho^(4))/(4*(1rho)))))^(0.5)
} else if(sm_fre ==5){
sr_annual <- SR
### Number of monthly observations 'N' ###
if(sm_fre ==1){
N <- floor(num_obs*12/360)
} else if(sm_fre ==2){
N <- floor(num_obs*12/52)
} else if(sm_fre == 3){
N <- floor(num_obs*12/12)
} else if(sm_fre == 4){
N <- floor(num_obs*12/4)
} else if(sm_fre == 5){
N <- floor(num_obs*12/1)
### Number of tests allowed ###
M <- num_test;
########### Intermediate outputs ##########
print(paste('Frequency =', fre_out))
print(paste('Number of Observations = ', num_obs))
print(paste('Initial Sharpe Ratio = ', SR))
print(paste('Sharpe Ratio Annualized = ', sr_out))
print(paste('Autocorrelation = ', rho))
print(paste('A/C Corrected Annualized Sharpe Ratio = ', sr_annual))
print(paste('Assumed Number of Tests = ', M))
print(paste('Assumed Average Correlation = ', RHO))
########## Sharpe ratio adjustment #########
m_vec <- 1:(M+1);
c_const <- sum(1./m_vec);
########## Input for Holm & BHY ##########
### Parameter input from Harvey, Liu and Zhu (2014) %%%%%%%
para0 <- matrix(c(0, 1295, 3.9660*0.1, 5.4995*0.001,
0.2, 1377, 4.4589*0.1, 5.5508*0.001,
0.4, 1476, 4.8604*0.1, 5.5413*0.001,
0.6, 1773, 5.9902*0.1, 5.5512*0.001,
0.8, 3109, 8.3901*0.1, 5.5956*0.001),
nrow = 5, ncol = 4, byrow = TRUE)
### Interpolated parameter values based on user specified level of correlation RHO %%%%%%%%%%
if (RHO >= 0 & RHO < 0.2){
para_inter <- ((0.2 RHO)/0.2)*para0[1,] + ((RHO 0)/0.2)*para0[2,]
} else if (RHO >= 0.2 & RHO < 0.4) {
para_inter <- ((0.4 RHO)/0.2)*para0[2,] + ((RHO 0.2)/0.2)*para0[3,]
} else if (RHO >= 0.4 & RHO < 0.6){
para_inter <- ((0.6 RHO)/0.2)*para0[3,] + ((RHO 0.4)/0.2)*para0[4,]
} else if (RHO >= 0.6 & RHO < 0.8){
para_inter <- ((0.8 RHO)/0.2)*para0[4,] + ((RHO 0.6)/0.2)*para0[5,]
} else if (RHO >= 0.8 & RHO < 1.0){
para_inter <- ((0.8 RHO)/0.2)*para0[4,] + ((RHO 0.6)/0.2)*para0[5,]
} else {
### Default: para_vec = [0.2, 1377, 4.4589*0.1, 5.5508*0.001,M_simu]
para_inter <- para0[2,] ### Set at the preferred level if RHO is misspecified
WW <- 2000; ### Number of repetitions
### Generate a panel of t-ratios (WW*Nsim_tests) ###
Nsim_tests <- (floor(M/para_inter[2]) + 1)*floor(para_inter[2]+1); # make sure Nsim_test >= M
t_sample <- sample_random_multests(para_inter[1], Nsim_tests, para_inter[3], para_inter[4], WW)
# Sharpe Ratio, monthly
sr <- sr_annual / (12 ^(1/2))
T <- sr * 120^(1/2)
p_val <- 2 * (1 pt(T, N 1))
# Drawing observations from the underlying p-value distribution; simulate a
# large number (WW) of p-value samples
p_holm <- rep(1, WW)
p_bhy <- rep(1, WW)
for(ww in 1:WW){
yy <- t_sample[ww, 1:M]
t_value <- t(yy)
p_val_sub <- 2*(1 pnorm(t_value,0,1));
### Holm
p_val_all <- append(t(p_val_sub), p_val)
p_val_order <- sort(p_val_all)
p_holm_vec <- NULL
for(i in 1:(M+1)){
p_new <- NULL
for(j in 1:i){
p_new <- append(p_new, (M+1j+1)*p_val_order[j])
p_holm_vec <- append(p_holm_vec, min(max(p_new),1))
p_sub_holm <- p_holm_vec[p_val_order == p_val]
p_holm[ww] <- p_sub_holm[1];
### BHY
p_bhy_vec <- NULL
for(i in 1:(M+1)){
kk <- (M+1) (i1)
if(kk == (M+1)){
p_new <- p_val_order[M+1]
} else {
p_new <- min( (M+1)*(c_const/kk)*p_val_order[kk], p_0)
p_bhy_vec <- append(p_new, p_bhy_vec)
p_0 <- p_new
p_sub_bhy <- p_bhy_vec[p_val_order == p_val]
p_bhy[ww] <- p_sub_bhy[1]
### Bonferroni ###
p_BON <- min(M*p_val,1)
### Holm ###
p_HOL <- median(p_holm)
### BHY ###
p_BHY <- median(p_bhy)
### Average ###
p_avg <- (p_BON + p_HOL + p_BHY)/3
# Invert to get z-score
z_BON <- qt(1 p_BON/2, N 1)
z_HOL <- qt(1 p_HOL/2, N 1)
z_BHY <- qt(1 p_BHY/2, N 1)
z_avg <- qt(1 p_avg/2, N 1)
# Annualized Sharpe ratio
sr_BON <- (z_BON/sqrt(N)) * sqrt(12)
sr_HOL <- (z_HOL/sqrt(N)) * sqrt(12)
sr_BHY <- (z_BHY/sqrt(N)) * sqrt(12)
sr_avg <- (z_avg/sqrt(N))*sqrt(12)
# Calculate haircut
hc_BON <- (sr_annual sr_BON)/sr_annual
hc_HOL <- (sr_annual sr_HOL)/sr_annual
hc_BHY <- (sr_annual sr_BHY)/sr_annual
hc_avg <- (sr_annual sr_avg)/sr_annual
######### Final Output ###########
print("Bonferroni Adjustment:")
print(paste("Adjusted P-value =", p_BON))
print(paste("Haircut Sharpe Ratio =", sr_BON))
print(paste("Percentage Haircut =", hc_BON))
print("Holm Adjustment:")
print(paste("Adjusted P-value =", p_HOL))
print(paste("Haircut Sharpe Ratio =", sr_HOL))
print(paste("Percentage Haircut =", hc_HOL))
print("BHY Adjustment:")
print(paste("Adjusted P-value =", p_BHY))
print(paste("Haircut Sharpe Ratio =", sr_BHY))
print(paste("Percentage Haircut =", hc_BHY))
print("Average Adjustment:")
print(paste("Adjusted P-value =", p_avg))
print(paste("Haircut Sharpe Ratio =", sr_avg))
print(paste("Percentage Haircut =", hc_avg))

R implementation of Exhibit 6 (Profit_Hurdle.m)

Please note you will need the sample_random_multests() function from above to run the Profit_Hurdle R code

### Required returns due to testing multiplicity —— Harvey and Liu
### (2014): "Backtesting", Duke University
Profit_Hurdle <- function (num_tests, num_obs, alpha_sig, vol_annual, RHO){
####### Parameter inputs ######
### 'num_tests': No. of tests one allows for in multiple tests;
### 'num_obs': No. of monthly observations for a strategy;
### 'alpha_sig': Significance level (e.g., 5#);
### 'vol_annual': Annual return volatility (e.g., 0.05 or 5#).
NN <- num_tests
Obs <- num_obs
alpha0 <- alpha_sig
vol_anu <- vol_annual
###Independent test ####
#B_ind = norminv( (1- alpha0/2),0,1);
B_ind <- qnorm( (1 alpha0/2),0,1)
###Bonferroni ####
p0_mat <- alpha0/NN
t0_mat <- qnorm( (1p0_mat/2),0,1)
BF <- t0_mat
###Input for Holm and BHY ####
###Parameter input from Harvey, Liu and Zhu (2014) #######
para0 <- matrix(c(0, 1295, 3.9660*0.1, 5.4995*0.001,
0.2, 1377, 4.4589*0.1, 5.5508*0.001,
0.4, 1476, 4.8604*0.1, 5.5413*0.001,
0.6, 1773, 5.9902*0.1, 5.5512*0.001,
0.8, 3109, 8.3901*0.1, 5.5956*0.001),
nrow = 5, ncol = 4, byrow = TRUE)
### Interpolated parameter values based on user specified level of correlation RHO %%%%%%%%%%
if (RHO >= 0 & RHO < 0.2){
para_inter <- ((0.2 RHO)/0.2)*para0[1,] + ((RHO 0)/0.2)*para0[2,]
} else if (RHO >= 0.2 & RHO < 0.4) {
para_inter <- ((0.4 RHO)/0.2)*para0[2,] + ((RHO 0.2)/0.2)*para0[3,]
} else if (RHO >= 0.4 & RHO < 0.6){
para_inter <- ((0.6 RHO)/0.2)*para0[3,] + ((RHO 0.4)/0.2)*para0[4,]
} else if (RHO >= 0.6 & RHO < 0.8){
para_inter <- ((0.8 RHO)/0.2)*para0[4,] + ((RHO 0.6)/0.2)*para0[5,]
} else if (RHO >= 0.8 & RHO < 1.0){
para_inter <- ((0.8 RHO)/0.2)*para0[4,] + ((RHO 0.6)/0.2)*para0[5,]
} else {
### Default: para_vec = [0.2, 1377, 4.4589*0.1, 5.5508*0.001,M_simu]
para_inter <- para0[2,] ### Set at the preferred level if RHO is misspecified
WW <- 2000; ### Number of repetitions
### Generate a panel of t-ratios (WW*Nsim_tests) ###
Nsim_tests <- (floor(NN/para_inter[2]) + 1)*floor(para_inter[2]+1); # make sure Nsim_test >= num_tests
t_sample <- sample_random_multests(para_inter[1], Nsim_tests, para_inter[3], para_inter[4], WW)
### Holm #####
HL_mat <- NULL
for(ww in 1:WW){
yy <- t_sample[ww, 1:NN] ### Use the ww'th row of t-sample ###
p_sub <- 2*(1pnorm(yy))
p_new <- sort(p_sub)
KK <- length(p_new)
comp_vec <- NULL
for(kk in 1:KK){
comp_vec[kk] <- alpha0/(KK + 1kk)
comp_res <- p_new > comp_vec
comp_new <- cumsum(as.numeric(comp_res))
if(sum(comp_new) == 0){
HL <- 1.96
} else {
p0 <- p_new[comp_new == 1]
HL <- qnorm((1 p0/2),0,1)
HL_mat <- append(HL_mat, HL)
### BHY ####
BHY_mat <- NULL
for(ww in 1:WW){
yy <- t_sample[ww, 1:NN] ### Use the ww'th row of t-sample ###
p_sub <- 2*(1pnorm(yy))
if(length(p_new) <= 1){
BH00 <- 1.96
} else {
p_new11 <- sort(p_sub, decreasing = TRUE)
KK <- length(p_new11)
comp_vec0 <- NULL
cons_vec <- 1:KK
cons_norm <- sum(1/cons_vec)
for(kk in 1:KK){
comp_vec0[kk] <- (alpha0*kk)/(KK*cons_norm)
comp_vec <- sort(comp_vec0, decreasing = TRUE)
comp_res11 <- as.numeric(p_new11 <= comp_vec)
if(sum(comp_res11) == 0){
BH00 <- 1.96;
} else {
p0 <- p_new11[comp_res11 ==1]
b0 <- which(abs(p_new11 p0[1]) == min(abs(p_new11 p0[1])))
if(b0 == 1){
p1 <- p0[1]
} else {
p1 <- p_new11[(b01)]
BH00 <- qnorm((1 (p0[1]+p1)/4),0,1)
BHY_mat <- append(BHY_mat,BH00)
tcut_vec <- c(B_ind, BF, median(HL_mat), median(BHY_mat))
ret_hur <- ((vol_anu/sqrt(12))/sqrt(Obs))*tcut_vec
print(paste('Significance Level = ',alpha0*100))
print(paste('Number of Observations = ', num_obs))
print(paste('Annualized Return Volatility = ', vol_anu*100))
print(paste('Assumed Number of Tests = ', NN))
print(paste('Assumed Average Correlation = ', RHO))
print('Minimum Average Monthly Return:')
print(paste('Independent = ', ret_hur[1]*100))
print(paste('Bonferroni = ', ret_hur[2]*100))
print(paste('Holm = ', ret_hur[3]*100))
print(paste('BHY = ', ret_hur[4]*100))
print(paste('Average for Multiple Tests = ', mean(ret_hur[1])*100))
Profit_Hurdle(300, 240, 0.05, 0.1, 0.4)

Quantitative Strategy Development Overview – Brian Peterson

I have had the pleasure of getting to know and work with Brian Peterson of late building out the blotter::mcsim function in the blotter package. I will be writing about this function soon and where it is headed, but in this post i wanted to share a presentation Brian gave the CapeR User Group last week on Developing and Backtesting Systematic Trading Strategies with an application in R, and in particular using the quantstrat and blotter packages. Enjoy!


Link to the slides here!

Block Bootstrapped Monte Carlo – in R


A few weeks back i wrote a post including the source code for a Monte Carlo simulation function in R. The idea was to randomly sample daily returns produced by a backtest and build a confidence interval distribution of the middle 50% and 90% of returns. Since then Brian Peterson got in touch with me asking if i would work with him in getting some form of Monte Carlo simulation functionality standardized within the R blotter package. The blotter package forms part of the TradeAnalytics project on R-Forge (which includes “quantstrat”) where you will read the project description as…

“Transaction-oriented infrastructure for defining instruments, transactions, portfolios and accounts for trading systems and strategy simulation. Provides portfolio support for multi-asset class and multi-currency portfolios. Still in heavy development.”

Whilst working with Brian he suggested a few enhancements for the function in the original post, one of which was to include the option to perform a block bootstrap of the original backtest in order to capture any autocorrelation effects. This leaves 2 options as far as existing R packages go, namely; meboot() and boot().



Now meboot() is a really neat package which i found out about thanks to this post by James Picerno over at The Capital Spectator. It eliminates the requirement for stationarity in a time series. Below is an example of what a meboot plot looks like using the same simSample file i used in the previous post with 1,000 samples (or replicates as the meboot package refers to it).


Looking at the above chart I am not sure it is the most appropriate tool for the job, since the sampling is done within a very narrow range giving you large swings which tend to revert back to the original backtest. To make sure though i also ran the meboot() function over the backtest results generated by the “longtrend” demo strategy script in the blotter package. Again, samples track the original backtest quite closely.


Whether or not meboot is the best tool for the job will ultimately come down to what the sample statistics say and which metrics you use to determine whether or not to proceed with a strategy into production. Either way it may be a useful option to include in any standardized Monte Carlo simulation functionality within blotter.



The boot package includes functions and datasets for bootstrapping from “Bootstrap Methods and Their Application” by A. C. Davison and D. V. Hinkley (1997, CUP). Of particular interest for time series bootstrapping is the tsboot() function, whose purpose is described in the vignette on CRAN as to “Generate R bootstrap replicates of a statistic applied to a time series. The replicate time series can be generated using fixed or random block lengths or can be model based replicates.”

Now the tsboot object returned by the function does not include the samples themselves, but instead you will have to call the boot.array() function to get the sample indexing. Off the bat though i wasn’t sure the tsboot function was appropriate for simply block sampling a time series, albeit very eloquently and efficiently achieving just that on top of calculating a statistic of each sampled series. So i spent some time debugging the function trying to learn how it goes about performing the block bootstrapping and managed to come up with my own function which produces the below graph after sampling simSample 1,000 times (R = 1000) using a block length of 5 (l = 5) [full source code below the graph].


# Record script start time —————————————-
t1 <- Sys.time()
# Load required packages ——————————————
# Build the function ———————————————-
mcblocksim <- function(R, l){
# Read price data and build xts object
data <- read.csv("yourdirectory/simSample.csv", header = TRUE, stringsAsFactors=F)
s1.dates <- as.POSIXct(data[,2], format="%d-%m-%y") #beware the formatting may need adjusting for your excel settings
s1 <- xts(data[,3], s1.dates)
# Declare and assign values to other variables
n <- length(s1) #length of backtest results
#l <- 5 #block length for sampling (user-defined)
n.sim <- n
#R <- 5000 #number of samples required (user-defined)
endpt <- n l + 1
nn <- ceiling(n.sim/l) #number of blocks
lens <- c(rep(l, nn 1), 1 + (n.sim 1)%%l) #lengths of each block
st <- matrix(, nn * R, replace = TRUE), R) #start points of each block
i.a <- list(starts = st, lengths = lens) #list of start points and lengths of each block
# Get sample indices – store in "inds" matrix
block <- NULL
indsim <- NULL
inds <- NULL
for(r in 1:nrow(i.a$starts)){
for(c in 1:ncol(i.a$starts)){
block <-$starts[r, c], i.a$starts[r, c] + i.a$lengths[c] 1, length.out = i.a$lengths[c])
indsim <- append(indsim, block)
inds <- cbind(inds, indsim)
indsim <- NULL
# Declare and assign values to other variables
k <- NULL
tsbootARR <- NULL
tsbootxts <- NULL
tmp <- NULL
ret <- data.frame(ROC(s1[,1]))
ret[] <- 0
# Subset ret dataframe with inds and build "tsbootARR" using apply with cumsum()
for(k in 1:ncol(inds)){
tmp <- cbind(tmp, ret[inds[,k],])
tsbootARR <- apply(tmp, 2, function(x) cumsum(x))
which( #double-check no NAs in ret as a result ROC() call above
# Build xts object for use in plot.zoo
tsbootxts <- xts(tsbootARR, s1.dates)
p <- plot.zoo(tsbootxts, main=paste(R,"sims of simSample", sep = " "), plot.type = "single", col = "lightgray", ylab = "Cumulative Return Distributions")
p <- lines(cumsum(ROC(s1)[1,]), col = "red")
# End of function ————————————————-
# Run function —————————————————-
mcblocksim(1000, 5)
# Record and print time to run script —————————–
t2 <- Sys.time()

view raw


hosted with ❤ by GitHub

Your choice of block length (l) can be anything (as long as its less than the number of periods in your backtest) and how long you make your blocks could be based on a number of factors including the average length that your strategy is in the market or the average period in which your strategy experiences autocorrelation. Remember, the whole purpose of adding block bootstrap functionality was to account for autocorrelation characteristics in your backtest.

So, whilst the above mcblocksim() function is likely an improvement on the mcsimr() function i wrote about in my previous post, there is still quite a lot of work to do to get it ready for testing and committing to the blotter package. Particular to-do items include:

  • Revisiting the tsboot() function and determining whether or not some form of statistic can help users identify the optimal number of samples required
  • Potentially using the drawdowns() function in the Performance Analytics package as a statistic in the tsboot() function
  • Applying confidence intervals to the samples, thereby eliminating outlying returns (as in mcsimr())
  • Adding any other strategy-specific noise to the samples

So lots to do but as they say, Rome wasn’t built in a day. Hopefully you find some use from the above source code in the mean time.

Thanks to Brian Peterson for his valuable input thus far.


Thanks for reading.