Monte Carlo Simulation for your Portfolio PL

‘Once you have a system, the biggest obstacle is trusting it.’
Tom Wills

‘What is the last thing you do before you climb on a ladder? You shake it. And that
is Monte Carlo simulation.’
Sam Savage, Stanford University



In my early days of looking at trading strategies, getting to the equity curve felt like the final piece of the puzzle. The piece that would provide enough perspective so that you could make the decision whether or not to proceed to production. I was always more comfortable with the scientific approach than my gut feel. Perhaps this was a function of my limited time in the market. After collaborating with Brian Peterson and studying the many great articles written by he and so many others in the community it became clear the equity curve was only the first piece of the puzzle.

Questions one should ask subsequent to generating a complete backtest are:

  • Is the result realistic?
  • Could we have overfit?
  • What range of values can we expect out of sample?
  • How much confidence can we have in the model?

A backtest generally represents limited information. It is merely one run of a particular strategy whose out-of-sample performance could be ambiguous. Sampling from this limited information however may allow us to get closer to the true process, thereby adding statistical significance to any inferences made from the backtest. mcsim() gives us the potential to do just that, simulating portfolio PL from either post-trade or backtested transactions, as well as providing multiple options for tweaking the sampling procedure according to strategy characteristics.

Whilst mcsim() has been alive and well in blotter for a while now (blotter is the open source transaction infrastructure package in R for defining instruments, accounts and portfolios for trading systems and simulations) i am yet to write a post about it. My prior posts here and here were a prelude to the superior functionality now available with mcsim().



Included in the documentation for the mcsim() function is an example of its use with the longtrend demo. I will refer to the mcsim() output of that strategy for this post. mcsim() comes with 10 parameters which i describe next.

  • Portfolio, Account

These are objects stored in your blotter environment. If you are using post-trade cash PL in your simulation then these parameters can be left as their default NULL values. If using the longtrend demo for example you can use “longtrend” for both.

The Account object is used solely for retrieving your initial equity amount specified at the start of your strategy. The Portfolio object is used for retrieving dailyEqPL or dailyTxnPL (both existing blotter functions) depending on the ‘use’ parameter, which we cover a few parameters down.

  • n

‘n’ is used to specify the number of simulations you would like to run on your backtest. The longtrend help file uses 1k simulations. This takes me ~15secs on my Intel i5-4460. Disclaimer: no optimisation or refactoring was carried out on mcsim()…this is a TODO item. In my experience 1k simulations almost certainly achieves an asymptotically normal distribution with an acceptable level of statistical significance. For investigating the general shape, however, 100 simulations should suffice.

  • replacement

Any sampling procedure can generally be run with or without replacement. Sampling without replacement means results will have the same mean and final PL but different paths. Sampling with replacement results in a wider distribution of the various paths the strategy could have taken, as well as results which vary final PL and the other sample statistics (maxDD, stddev, quasi-Sharpe). This differentiaion will be clear when running mcsim() for a given strategy with and without replacement, and calling the plot and hist functions on the mcsim object the results were assigned to. See the examples in the help docs, which we cover a little further down.

  • use

The ‘use’ parameter lets you specify whether or not you would like to use daily-marked portfolio PL, transaction PL (think intraday strategies) or cash PL if using an external xts object of post-trade returns. The method used in Tomasini & Jaekle (2009), Chapter 4 resembles that of sampling with use=’txns’.

  • l

The ‘l’ parameter (lower case L) stands for ‘length’ and allows you to sample blocks of PL observations. Assuming your strategy exhibits some form of autocorrelation it may make sense to incorporate a block length in your sampling procedure based on this observed autocorrelation. Alternatively some inverse multiple of your average holding period could also justify a value for block length ‘l’.

  • varblock

‘varblock’ is a boolean, which allows you to sample with a variably sized block length. If replacement=TRUE, the sampling is based on the tsboot() function in the boot package which you will note is a dependency for running mcsim(). The tsboot() function was the perfect solution for sampling time series, but it only allows for sampling with replacement which should give a sense of what the general preference is from a statistical perspective. When ‘varblock’ is set to TRUE together with ‘replacement’ the resulting block lengths are sampled from a geometric distribution with mean ‘l’. If replacement=FALSE the distribution of block lengths is uniformly random around ‘l’.

  • gap

For strategies which require leading market data before computing indicators on which to base signals and subsequent rules, users can use the ‘gap’ parameter to specify how many periods from the strategy initialization date to start the simulations. For longtrend we use a gap of 10 since it uses an SMA lookback period of 10 (monthly) observations.

  • CI

CI or Confidence Interval is the parameter used to specify the confidence level to display in the hist() function which charts a range of sample statistics for the strategy. Default is 95%.

  • cashPL

This parameter is used when running mcsim() on post-trade cash PL, and refers to a separate xts object created by the user.



There are currently 4x S3 methods related to mcsim(), all of which return some portion of the ‘mcsim’ object as either a plot, hist or table (summary and quantile). For each method you can choose to set normalize=TRUE or FALSE. This will simply define the return space your simulations operate in…either cash or percent. Depending on the strategy duration and your reinvestment assumptions, there could be pros and cons to either approach. The distributions should be fairly similar in any event. In addition to ‘normalize’ there is also a ‘methods’ parameter for hist() which lets you specify the sample statistics to return in histogram plots. There are currently 5 stats, namely: ‘mean’, ‘median’, ‘stddev’, ‘maxDD’ and ‘sharpe’. The default is to return all methods as 5 separate histogram plots. This may be inappropriate when setting replacement=FALSE since all sample statistics barring maxDD will be identical (in the case of cash returns).

[Note: There are numerous benefits to using S3 methods, including for example that calls to plot() and hist() dispatch the appropriately defined methods related to mcsim(). In fact a total of 20 slots are returned with a call to mcsim() and can be analysed directly in your R environment. The Help file covers the 20 values returned from a call to mcsim().]

Looking at the first example in the Help file we see the following assigned call to mcsim where replacement is FALSE: and stored in ‘nrsim’ for ‘no-replacement simulation.’

nrsim = mcsim("longtrend", "longtrend", n=1000, replacement=FALSE, l=1, gap=10, CI=0.95)



Calling plot() on the mcsim assigned object ‘nrsim’ returns the below xts plot (which will differ slightly on each simulation):

plot(nrsim, normalize=TRUE)


The same plot() call with normalize=FALSE yields a plot that converges to the same final PL.

plot(nrsim, normalize=FALSE)


The reason the plot with normalization does not converge to the same final number is purely a function of how mcsim() normalization is carried out on the replicates in the cash return space. Since sampling without replacement means sequences are not perfectly independent, using percent returns for without replacement may be a useful medium between cash returns with and without replacement which fan out and converge respectively. Below is an example of the same call, however this time with replacement;

wrsim = mcsim("longtrend", "longtrend", n=1000, replacement=TRUE, l=1, gap=10, CI=0.95)
plot(wrsim, normalize=FALSE)




From the range of histogram plots mcsim() generates we can infer expectations of the distribution of values for the 5 methods: mean, median, stddev, maxDD and sharpe. These are based on the frequency of the ‘use’ parameter except for maxDD. If use=’equity’ for example then the frequency is daily ie. distribution of daily mean PL, median PL, stddev PL and daily quasi-sharpe (we use quasi as the formula is not a true implementation of the original Sharpe ratio). Using user-specified confidence intervals (and thanks to the Central Limit theorem) it is possible to visualize the range of values that can be expected from the backtest for the sample statistics in question. In presumably most instances the distribution will be centered around the backtest. However, in the case of the max Drawdown the simulation mean and the backtest could be more significantly different. Let’s look at some output from the longtrend demo.

hist(rsim, normalize=FALSE, methods="mean")


hist(rsim, normalize=FALSE, methods="stddev")


hist(rsim, normalize=FALSE, methods="maxDD")


hist(rsim, normalize=FALSE, methods="sharpe")


It is evident from the above charts that the Monte Carlo simulations result in an asymptotic normal distribution of the sample statistics (thanks to the Central Limit Theorem). In all cases excluding maxDD the distribution is centered around the mean, with the range of values for a 95% confidence interval quite clear to view. In the case of maxDD, the backtest has a max Drawdown of 30k, whilst the mean is closer to 40k. A strategy with any form of risk management should have a better max Drawdown than the average of the replicates, which are randomized re-ordered sequences of PL observations. In Tomasini & Jaekle (2009) the authors infer that the 99th percentile max Drawdown could be an expected outcome for your strategy with a 1% probability. Assuming your backtest employs a risk management strategy i would argue this probability is well below 1%. Nevertheless, viewing the results with this level of conservatism may be prudent especially considering the likely presence of many other biases not accounted for.


print() – recent update thanks to Brian, which tidies up the output by rounding to 3 decimals. Note that the simulation is different to the results generated in the charts above, as the update was done post the publishing of this post –

To interrogate the values from the charts more directly, a call to print() (which summarizes the output from the summary.mcsim() S3 method) will suffice. Below is an example with the subsequent output (note all sample stat outputs are returned):

print(rsim, normalize=FALSE)


We can now see the exact values for the sample statistic metrics.




Monte Carlo and mcsim() limitations and disadvantages of Portfolio PL simulations 

As with any model there are assumptions. In the case of Monte Carlo analysis using mcsim() one of the most significant assumptions is that returns follow a Gaussian (normal) distribution. Depending on the distribution of the sample statistic of interest, you may be able to propose a more suitable distribution. The exponential impact of ‘fat tail’ events also cannot be underestimated, with one of the world’s earliest and most successful hedge funds succumbing to these limitations.

Another limitation of Monte Carlo analysis is the sample from which it simulates may be overfit or overly optimized for in-sample out-performance. It is the job of the analyst to manage this risk though, and hopefully with the recently added quantstrat functions degfreedom(), deflated.Sharpe() and haircut.Sharpe() the analyst will be more empowered to manage this risk. For a useful demonstration of their use in quantstrat see the macdParameters demo.

Sampling from Portfolio PL may also be unrealistic since no consideration is given to market regimes and strategy-specific stylized facts such as the individual positions’ time in and out the market. Where this data is available however, the analyst has another recently developed tool at their disposal, namely txnsim()This will be the subject of my next post.

Thanks for reading.

Thanks must also go to my co-author, Brian Peterson, from whose work and commitment all of what i have written would be impossible. Brian’s invaluable insights related to this and so many other topics have benefited the industry to no end. Another legend worth a mention is Joshua Ulrich who never gave up on me (or never gave away that he had) w.r.t my supposed git ignorance. The R/Finance open source community is strong and i am honoured to be a part of it.


R-view: Backtesting – Harvey & Liu (2015)

In this post i take an R-view of “Backtesting – Harvey & Liu (2015).” The authors propose an alternative to the commonly practiced 50% discount that is applied to reported Sharpe ratios when evaluating backtests of trading strategies. The reason for the discount is due to the inevitable data mining inherent in research, both past and present. Harvey & Liu (HL) propose a multiple hypothesis testing framework with a more stringent requirement for determining a test’s significance level compared with single tests. HL suggest that the single hypothesis test used by most researchers leads to too many false discoveries (Type I errors). This makes intuitive sense if you consider that the probability of finding a significant factor increases with the number of factors tested or the number of tests conducted.

To give an example quoted in Harvey & Liu – Evaluating Trading Strategies (which you may have heard before), imagine you receive a mail from a manager promoting a particular stock. The mail says to track the manager’s recommendations over time. The recommendation is either to go long or short. The manager is correct 10 weeks in a row with his recommendations. The probability of such a feat you figure is very small, in fact a 0.09765% probability of being a false discovery (0.5^10 = 0.0009765625) based on single hypothesis testing. But what if the manager sent a similar mail to 100k people telling half to go long and the other half to go short the same stock. Every week the manager trims half the people off his mailing list, the half for whom the recommendation did not work. Every week for 10 weeks the list is trimmed. By the end of week 10 a total of 97 lucky people have been the recipients of these perfect picks. The results were random, and the recipients would have been fooled.

In multiple hypothesis testing the challenge is to guard against false discoveries. HL argue that the appropriate “haircut Sharpe ratio” is non-linear, in that the highest Sharpe ratios (SR’s) are only moderately penalized whilst marginal SR’s more so. The implication is that high SR’s are more likely true discoveries in a multiple hypothesis testing framework.

If you would like to jump straight to my R implementation it is at the end of this post. The authors make their Matlab code for Exhibits 5 & 6 available here. I simply converted it to R and added some extra bits for illustrative purposes. [As of Aug ’17 these functions are available in quantstrat as haircut.Sharpe and profit.hurdle]

HL mention 5 caveats to their framework, namely;

  • Sharpe ratios may not be appropriate metrics for strategies with negatively skewed expected payoffs, such as option strategies.
  • Sharpe ratios normalize returns based on their volatility (ie. market risk), which may not be the most appropriate reflection of risk for a strategy.
  • Determining the appropriate significance level for multiple testing (where in single tests 5% is the normal cutoff).
  • Which multiple testing method you choose could yield different conclusions. HL proposes 3 methods together with an average.
  • Lastly, and i would say most sensitively is the number of tests used.

Linking Sharpe ratio with t-statistic

To explain the link between the Sharpe ratio and the t-stat and the application of a multiple testing p-value adjustment, HL use the simplest case of an individual investment strategy. Assume a null hypothesis in which the strategy’s mean return is significantly different from zero, therefore implying a 2-sided alternative hypothesis. A strategy can be regarded as profitable if its mean returns are either side of zero since investors can generally go long or short. Since returns will at least be asymptotically normally distributed (thanks to the Central Limit Theorem) a t-statistic following a t-distribution can be constructed and tested for significance. However, due to the link between the Sharpe ratio and the t-stat it is possible to assess the significance of a strategy’s excess returns directly using the Sharpe ratio. Assume \hat{\mu} denotes the mean of your sample of historical returns (daily or weekly etc) and \hat{\sigma} denotes standard deviation, then

t-statistic = \frac{ \hat{\mu}}{(\hat{\sigma}/\sqrt{T})}

where T-1 is degrees of freedom and since

\widehat{SR} =  \hat{\mu}/\hat{\sigma}

it can be shown that

\widehat{SR} = t-statistic/\sqrt{T}

By implication a higher Sharpe ratio equates to a higher t-ratio, implying a higher significance level (lower p-value) for an investment strategy. If we denote p-value of the single test as {p^s} then we can present the p-value of the single test as

{p^s} = Pr( |r| > t-ratio)


{p^s} = Pr( |r| > \widehat{SR}.\sqrt{T})

Now if the researcher was exploring a particular economic theory then this p-value might make sense, but what if the researcher has tested multiple strategies and presents only the most profitable one? In this case the p-value of the single test may severely overstate the actual significance. A more truthful p-value would be an adjusted multiple testing p-value which assuming we denote as {p^m} could be represented as

{p^m} = Pr( max{|{r_i}|, i = 1,...,N} > t-ratio)


{p^m} = 1 - (1 - {p^s}{)^N}

By equating the p-value of a single test to a multiple test p-value we get the defining equation of {p^m} which is

{p^m} = Pr( {|{r_i}|} > \widehat{SR}.\sqrt{T})

Using the example in HL, assuming a strategy with 20 years of monthly returns (T=240), an annual Sharpe ratio (SR) of 0.75, you would have a p-value for the single test of

{p^s} = Pr( |r| > t-ratio)

{p^s} \approxeq 0.0008

To compute the single p-value and the multiple testing p-value assuming researchers have tried N = 200 strategies, you can use the below R code to compute the adjusted or “haircut” Sharpe ratio;

# Set the variables
N = 200 # Number of strategies tested
T = 240 # Number of observations or degrees of freedom
SR_ann = 0.75 # Annualised Sharpe ratio
# Compute the single test p-value
p_single = 2 * (1-pnorm(SR_ann*sqrt(T/12)))
# Compute the multiple test p-value
p_multiple = 1 - (1 - p_single) ^ N
# Compute haircut Sharpe ratio (HSR)
z_stat = qt(1 - p_multiple/2, N-1)
HSR = (z_stat/sqrt(T)) * sqrt(12)
# Sharpe ratio haircut
haircut = (SR_ann - HSR)/SR_ann

The code should output an HSR of 0.325 (HL yields 0.32 due to rounding the single and multiple p-values) which implies a haircut to the original Sharpe ratio of 56.6% (HL report a 60% haircut). The example above is no longer appropriate where t-statistics are dependent. More on this later.

Multiple Multiple-Testing methods

HL mentions 3 well known adjustment methods in the statistics literature, which are originally prescribed in the paper “…and the Cross-Section of Expected Returns” by Harvey, Liu and Zhu. These are Bonferroni, Holm, and Benjamini, Hochberg, and Yekutieli (BHY).

Bonferroni’s adjustment:

{p_i^{Bonferroni}} = min[M * {p_i}, 1]

Bonferroni applies the same adjustment to the p-value of each test, inflating the p-value by the number of tests. The multiple testing p-value is the minimum of each inflated p-value and 1 where 1 (or 100% if you prefer) is the upper bound of probability. HL use the example of p-values from 6 strategies where the p-values are (0.005, 0.009, 0.0128, 0.0135, 0.045, 0.06). According to a 5% significance cutoff the first 5 tests would be considered significant. Using the p.adjust function in R we can get the multiple adjusted p-values and according to Bonferroni only the first test would be considered significant.

# Set the variables
p_values = c(0.005, 0.009, 0.0128, 0.0135, 0.045, 0.06)
# Assign to adj.p_values the output of p.adjust()
adj.p_values = p.adjust(p_values, "bonferroni")
# Print


Holm’s adjustment:

p-value adjustments can be categorized into 2 categories, namely: single-step and sequential. Single-step corrections equally adjust p-values as in Bonferroni. Sequential adjustments are an adaptive procedure based on the distribution of p-values. Sequential methods gained prominence after a seminal paper by Schweder & Spjotvoll (1982) and section 7.3 of this paper gives a useful example of an application of multiple testing hypothesis diagnostic plotting in R. Holm is an example of a sequential multiple testing procedure. For Holm, the equivalent adjusted p-value is

{p_{(i)}^{Holm}} = min[max((M - j + 1)*{p_{(j)}}),1]

Using the same 6 strategies illustrated above, we can use p.adjust() to compute the Holm adjusted p-values.

# Run p.adjust() for Holm
HOLM_adj.p_values = p.adjust(p_values, "holm")


We see in the output above that the first 2 tests are significant at the 5% cutoff, compared to only 1 with Bonferroni. This is due to the fact that Bonferroni adjusts single tests equally, whereas Holm applies a sequential approach. By conclusion it should not surprise you that adjusted Sharpe ratios under Bonferroni will therefore be lower than for Holm. At this point it is useful to mention that both Holm and Bonferroni attempt to prevent even 1 Type I error occurring, controlling what is called the family-wise error rate (FWER). The next adjustment proposed by HL is BHY and the main difference from the previous 2 adjustment methods is that BHY attempts to control the false discovery rate (FDR), implying more lenience than Holm and Bonferroni and therefore expected to yield higher adjusted Sharpe ratios.

BHY – (Benjamini, Hochberg, and Yekutieli):

BHY’s formulation of the FDR can be represented as follows. Firstly all p-values are sorted in descending order and the adjusted p-value sequence is defined by pairwaise comparisons.

BHY - formulation.PNG

c(M) is defined as follows

c(M) = {\sum_{j=1}^M} \frac{1}{j}

making c(M) = 2.45 for the example of M=6 strategies (ie. sum of 1/1 + 1/2 + 1/3 +1/4 + 1/5 + 1/6). Using p.adjust() we can see that there are now 4 significant strategies at the 5% cutoff. The output from p.adjust below is slightly different to HL’s BHY implementation as p.adjust() does not equate {p_{(6)}^{BHY}} with the least significant p-value 0.06.

# Run p.adjust() for BHY
BHY_adj.p_values = p.adjust(sort(p_values, decreasing = TRUE), "BY")


So to summarise these methods BHY leads to 4 significant discoveries versus Holm’s 2 and Bonferroni’s 1. We expect BHY to be more lenient as it controls the false discovery rate whereas Holm and Bonferroni control the family-wise error rate, trying to eliminate making even 1 false discovery. Bonferroni is more stringent than Holm since it is a single-step adjustment versus the sequential approach of Holm. With these 3 methods HL attempt to adjust p-values to account for multiple testing and then convert these to haircut Sharpe ratios and in so doing control for data mining. For both Holm and BHY you need the empirical distribution of p-values from previously tried strategies. Harvey, Liu and Zhu (HLZ) model over 300 risk factors documented in the finance literature. However, using this model for the distribution of p-values is not complete since many tried strategies would not have been documented (referred to as Publication Bias) plus they are potentially correlated thereby violating the requirement for independence between tests. HLZ propose a new distribution to overcome these shortfalls.

Harvey, Liu and Zhu (HLZ)

As previously mentioned HLZ study over 300 factors tested for explaining the cross section of return patterns. They publish a list of their sources here in spreadsheet format. Just looking at the history of published factors they show that factor discovery increases every decade.

History of Factor Discoveries - HLZ.PNG

Using the t-statistics published with those papers (assuming they are economically and statistically sound) HLZ perform the 3 multiple testing procedures described above (Bonferroni, Holm and BHY). They conclude under the assumption that if all tried factors are published then an appropriate minimum threshold t-statistic for 5% significance is 2.8, which equates to a p-value of 0.50% for single tests. Whilst the assumption of all tried factors being published is not reasonable, HLZ argue that the analysis does serve to provide a minimum threshold for accepting significance of future tests.

HLZ limit their factor sample to unique risk factors so as to keep test dependence to a minimum. This is a requirement for the 3 multiple testing procedures described above. But we know the requirements for being published are fairly stringent and most likely limited to tests that show significant t-statistics. HLZ estimate that 71% of tried tests are not published (see appendix B of HLZ for details), and based on this together with their implementation of the 3 adjusting procedures HLZ propose a benchmark t-statistic of 3.18. Intuitively this is greater than previously (2.8) when the assumption was all tried factors were published.

What about test dependence ie. correlation among test statistics?

This will be the case where factors tested model similar forms of risk, and are therefore correlated. Think about the multitude of price multiple factors. Assuming in the extreme that all factors are directly correlated at 100% then no adjustment to the single test would be required as you would be penalizing Type I errors (false discoveries) too harshly at the risk of increasing the Type II error rate (missed discoveries). In multiple testing, the statistics literature attempts to account for correlation through various methods including resampling the entire dataset. In finance research bootstrap procedures are being used to assess significance of individual tests and infer fund manager skill versus chance. Indeed, this is what we are aiming to achieve with mcsim() and txnsim() in the R:blotter package. However, since HLZ do not have access to the datasets used for the factors published in the majority of cases, they instead infer a distribution using the t-statistics published. They suggest a mixed distribution, in which the null hypothesis that mean returns are zero is drawn from a normal distribution and the alternative hypothesis that mean returns are non-zero is drawn from an exponential distribution. This should make economic sense if one assumes more profitable strategies are less likely to exist in an economically scarce world. Using some statistical magic (Section 4 and Appendix A in HLZ which i hope to address more specifically in a future post) HLZ propose model parameter estimates as follows:

Model Parameter Estimates.PNG

 Using the baseline assumptions for the number of unobserved tests, one can use Harvey & Liu’s code in Exhibit 5 & 6 (replicated in R below) to determine the appropriate haircut Sharpe ratio (HSR) or the required mean return for a given significance level, both taking into account an assumed number of tests.


Data mining is unavoidable and in some respects represents knowledge gain. Accounting for it when measuring statistical significance is a requirement. Judgement will have to be exercised when considering significance levels and which multiple testing technique is most appropriate. Bonferroni and Holm are appropriate for mission-critical tests such as for medical research or space missions. Controlling for the error rate of a single event occurring is desirable. Where the application is not a matter of life or death controlling for the rate of false discoveries may be more desirable. Perhaps this applies to finance. Perhaps not. Either way the requirements for statistical significance of trading strategies over time are becoming more stringent. This is natural; most of the true factors would have been discovered and the cost of mining data has diminished and continues to do so at an impressive rate. In other sciences such as physics and medicine/genomics, tests require 5 standard deviations to be considered significant. The field of finance is headed in the same direction as more factors are published. Assuming the widely used Sharpe ratio is an applicable metric for your new strategy’s evaluation, you would be well served to consider its haircut equivalent.


R implementation of Exhibit 5 (Haircut_SR.m and sample_random_multests.m)

R implementation of Exhibit 6 (Profit_Hurdle.m)

Please note you will need the sample_random_multests() function from above to run the Profit_Hurdle R code

Quantitative Strategy Development Overview – Brian Peterson

I have had the pleasure of getting to know and work with Brian Peterson of late building out the blotter::mcsim function in the blotter package. I will be writing about this function soon and where it is headed, but in this post i wanted to share a presentation Brian gave the CapeR User Group last week on Developing and Backtesting Systematic Trading Strategies with an application in R, and in particular using the quantstrat and blotter packages. Enjoy!


Link to the slides here!

Block Bootstrapped Monte Carlo – in R


A few weeks back i wrote a post including the source code for a Monte Carlo simulation function in R. The idea was to randomly sample daily returns produced by a backtest and build a confidence interval distribution of the middle 50% and 90% of returns. Since then Brian Peterson got in touch with me asking if i would work with him in getting some form of Monte Carlo simulation functionality standardized within the R blotter package. The blotter package forms part of the TradeAnalytics project on R-Forge (which includes “quantstrat”) where you will read the project description as…

“Transaction-oriented infrastructure for defining instruments, transactions, portfolios and accounts for trading systems and strategy simulation. Provides portfolio support for multi-asset class and multi-currency portfolios. Still in heavy development.”

Whilst working with Brian he suggested a few enhancements for the function in the original post, one of which was to include the option to perform a block bootstrap of the original backtest in order to capture any autocorrelation effects. This leaves 2 options as far as existing R packages go, namely; meboot() and boot().



Now meboot() is a really neat package which i found out about thanks to this post by James Picerno over at The Capital Spectator. It eliminates the requirement for stationarity in a time series. Below is an example of what a meboot plot looks like using the same simSample file i used in the previous post with 1,000 samples (or replicates as the meboot package refers to it).


Looking at the above chart I am not sure it is the most appropriate tool for the job, since the sampling is done within a very narrow range giving you large swings which tend to revert back to the original backtest. To make sure though i also ran the meboot() function over the backtest results generated by the “longtrend” demo strategy script in the blotter package. Again, samples track the original backtest quite closely.


Whether or not meboot is the best tool for the job will ultimately come down to what the sample statistics say and which metrics you use to determine whether or not to proceed with a strategy into production. Either way it may be a useful option to include in any standardized Monte Carlo simulation functionality within blotter.



The boot package includes functions and datasets for bootstrapping from “Bootstrap Methods and Their Application” by A. C. Davison and D. V. Hinkley (1997, CUP). Of particular interest for time series bootstrapping is the tsboot() function, whose purpose is described in the vignette on CRAN as to “Generate R bootstrap replicates of a statistic applied to a time series. The replicate time series can be generated using fixed or random block lengths or can be model based replicates.”

Now the tsboot object returned by the function does not include the samples themselves, but instead you will have to call the boot.array() function to get the sample indexing. Off the bat though i wasn’t sure the tsboot function was appropriate for simply block sampling a time series, albeit very eloquently and efficiently achieving just that on top of calculating a statistic of each sampled series. So i spent some time debugging the function trying to learn how it goes about performing the block bootstrapping and managed to come up with my own function which produces the below graph after sampling simSample 1,000 times (R = 1000) using a block length of 5 (l = 5) [full source code below the graph].


Your choice of block length (l) can be anything (as long as its less than the number of periods in your backtest) and how long you make your blocks could be based on a number of factors including the average length that your strategy is in the market or the average period in which your strategy experiences autocorrelation. Remember, the whole purpose of adding block bootstrap functionality was to account for autocorrelation characteristics in your backtest.

So, whilst the above mcblocksim() function is likely an improvement on the mcsimr() function i wrote about in my previous post, there is still quite a lot of work to do to get it ready for testing and committing to the blotter package. Particular to-do items include:

  • Revisiting the tsboot() function and determining whether or not some form of statistic can help users identify the optimal number of samples required
  • Potentially using the drawdowns() function in the Performance Analytics package as a statistic in the tsboot() function
  • Applying confidence intervals to the samples, thereby eliminating outlying returns (as in mcsimr())
  • Adding any other strategy-specific noise to the samples

So lots to do but as they say, Rome wasn’t built in a day. Hopefully you find some use from the above source code in the mean time.

Thanks to Brian Peterson for his valuable input thus far.


Thanks for reading.


A Monte Carlo Simulation function for your back-test results – in R

In this post on, Andrew Swanscott interviews Kevin Davey from KJ Trading Systems who discusses why looking at your back-test historical equity curve alone might not give you a true sense of a strategy’s risk profile. Kevin Davey also writes on the topic here for So i wrote a Monte Carlo-type simulation function (in R) to see graphically how my back-test results would have looked had the order of returns been randomized. I use ‘n’ to denote the number of simulations i would like to do, and exclude the highest and lowest 5% of return streams to form the blue ribbon. For the middle red ribbon i use the middle 50% of return streams (ie. excluding the 1st and last quartile). The second and last parameter (default TRUE) is a boolean which if TRUE will run the sampling procedure in R with replacement, and without replacement if set to FALSE. See this link if you need an intuitive explanation of what sampling with and without replacement means. Below is an example of the output for 10k simulations, with and without replacement for a random set of returns on a hypothetical strategy spanning 7 months. The black line is the actual historical equity curve. Whether or not the return distributions satisfy your risk appetite will come down to a more refined drawdown analysis together with the host of other risk metrics available for consideration. At least with the function below you have the stats in a dataframe and a graph to start…the starting point for any statistical analysis.


I read the data into R using the read.csv function. You can use the same data here (simSample.xlsx), just be sure to change the file type to CSV first, as WordPress only allows for storing .xls and .xlsx Excel files. Below is the source code to add the function to your environment. I add the required packages then create the mcsimr() function to do the import, manipulate the data and build the graph using ggplot. Remember to set your working directory as the directory in which you have stored the simSample.csv file.

If you would like to run the simulation 1,000 times without replacement, then the function should read:

mcsimr(1000, FALSE)

If you would like to run with replacement, include TRUE for the second argument or leave it blank, as the default is TRUE.


Ok, below is the full source code

I am aiming to have the function code on github soon [EDIT: The above is hosted (with love) by GitHub, however enhancements are still a work in progress], hopefully with some enhancements allowing the user to specify an xts object for the raw data (ie. backtest results) as well as letting the user decide which percentages to use for the bands. Some drawdown stats would be useful too. I have built something which incorporates a drawdown analysis of every ‘n’ return stream, however, it needs some more work to speed it up as you can imagine 10k xts objects with 500+ of daily return data could take some time to analyse individually.

Lastly, i wonder if something like this could fit into the QuantStrat or TradeAnalytics packages…or if there already is something similar do let me know. [EDIT: Since publishing this post Brian Peterson (maintainer of multiple Finance R packages like ReturnAnalytics, quantmod and Quanstrat) got in touch and asked if i would work with him in getting this function into the blotter package, a dependency of Quantstrat. Brian has also suggested a few improvements to the function including using Block Bootstrap to account for autocorrelation and plot.xts as opposed to ggplot for faster graphical rendering. So look out for an updated post on that.]

Happy hacking!


How i used 18 lines of R code to crack the Triangle Challenge

I recently came across a question that required logic and coding skills to solve. It went a little something like this:

Let there be a stick of length 1. Pick 2 points uniformly at random along the stick, and break the stick at those points. What is the probability of the three resulting pieces being able to form a triangle?

Interesting i thought. I do enjoy these types of challenges, and had fun trying to solve this problem. It took me a while to understand that the longest side (or one of the longest sides in the case of an isosceles triangle)  had to be smaller than the sum of the 2 smaller sides. In pseudo code a triangle would be possible IF:

Length of longest side < sum of other 2 sides, which is true IF

Length of longest side < 0.5 (half the length of the stick)

When i saw (in my mind’s eye) me breaking a stick in half and folding it so that both pieces overlap, it became obvious that i needed one of the pieces to be longer and then break that piece into 2 pieces of any length to form a would-be triangle. As a side note, i have been hoping to create my first gif for a while and this might be a useful one to animate a stick being broken into half, folded over and then one half growing in length and then breaking into 2 and forming a triangle.

I turned to the statistical computing language R to solve my problem, knowing i could randomly draw values from a uniform distribution with the runif() function.

Continue reading