A replication of the Practical Application section in ‘The Probability of Backtest Overfitting’ – Bailey et al.

In their paper “The Probability of Backtest
Overfitting” [https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2326253] Bailey et al. introduce a method for detecting overfitting. They refer to the method as CSCV or Combinatorially Symmetric Cross Validation. Bailey et al. proceed to show that CSCV produces reasonable estimates of PBO for several useful examples.

To illustrate the ease with which a random series can produce an overfit trading rule, Bailey et al. in “Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance” [http://www.ams.org/notices/201405/rnoti-p458.pdf] seek to identify an optimal monthly trading rule based on 4 parameters, namely: Entry day, Holding Period, Stop Loss and Side. Using the ranges C(1:22), c(1:20), c(0:10) and c(-1,1) respectively, their program iterates through each permutation (8800 in total) to determine the strategy with the highest Sharpe ratios (SR). Their example produces an optimal SR of 1.27. The authors extend the data mining example by applying a seasonal adjustment to the first 5 days of each month. Particularly, they scale the daily returns by a factor of a quarter of the sigma used to generate the original series. The python program for replicating this data mining process can be found here [http://www.quantresearch.info/CSCV_3.py.txt]. Using their CSCV method for computing the PBO of each series of trials the authors are able to illustrate examples of how CSCV diagnoses the probability of backtest overfitting for an overfit random strategy versus a strategy with a legitimate seasonal component.

The below code and analysis is an R-ported version of the python program together with some PBO output derived using the new blotter:::pbo function (currently in the, soon-to-be-merged with master, 73_pbo branch) for the purposes of replicating the paper introducing the CSCV method.

The main function is the dataMine() function, called backTest() in the original python program. We store the output as an S3 object, making possible the extension of additional analysis on the pbo output. There are a number of helper functions, namely:

  • numBusinessDays – for assigning an integer to the respective date, representing which business day in the month it is. This is used when applying the seasonality adjustment to the first ‘nDays’ of each month. Applicable when param ‘nDays’ is greater than 0.
  • getRefDates_MonthBusinessDate – for determining the respective nth business day in each month. Each month will have a 1st business day, so we assign dates for each 1st business day, then again for each 2nd business day of each month, etc.
  • getTrades – taking into consideration the constraints of the strategy (entry day, holding period {param name ‘exit’}, stop-loss and side), this function returns the trades.
  • attachTimeSeries – we keep a record of the daily returns for each trial (8800 in the paper) which we apply to the pbo() function. This function attaches each iterated trial.
  • evalPerf – used for recording various metrics for computing the psrStat as well as the Sharpe ratio, which is the performance metric used in the paper. Bailey et al. emphasise that other metrics can be used with their framework (eg. Sortino ratio, Jensen’s Alpha, Probabilistic Sharpe Ratio, etc), and introducing a range of methods to the analyst and allowing them to rank the performance of each for the purposes of detecting pbo is an exciting piece of future work we will look to implement in quantstrat in due course.
  • computePSR – a helper function within evalPerf() for computing the psrStat and psr.

We have kept the names of the helper functions identical to those used in the python program.

When the dataMine() function is called, we print the SR along with other strategy characteristics to the console for the first strategy and every subsequent strategy which improves on the prior SR. The last iteration to print once the function has completed, will be the iteration which generated the max Sharpe ratio. Each line prints 8 different details, which are:

  1. iteration number,
  2. business day in month used for entry day,
  3. holding period,
  4. stop-loss,
  5. side (-1 for sell, 1 for buy),
  6. Sharpe ratio,
  7. Frequency (see calc in evalPerf function),
  8. psrStat or probabilistic Sharpe ratio statistic
dataMine <- function(nDays=0, factor=0){
  # 1) Input parameters --- to be changed by the user
  holdingPeriod <- 20
  sigma <- 1
  stopLoss <- 10
  length <- 1000
  # 2) Prepare series
  date <- as.Date("2000-01-01") # original python program uses 01/01/2000 start date
  dates <- list()

  while(length(dates) < length){
    if(as.POSIXlt(date)$wday != 0 & as.POSIXlt(date)$wday != 6 & length(dates) == 0){
      dates <- date
    } else if(as.POSIXlt(date)$wday != 0 & as.POSIXlt(date)$wday != 6 & length(dates) != 0){
      dates <- append(dates, date)
    }
    date <- as.Date(date) + 1
  }

  series <- matrix(0, nrow = length, ncol = 1)
  for(i in 1:length(series)){
    series[i] <- rnorm(1, mean = 0, sd = sigma)
    pDay <- as.Date(paste(format(as.Date(dates[i], format="%d/%m/%Y"),"%Y"),
                          format(as.Date(dates[i], format="%d/%m/%Y"),"%m"),
                          1,
                          sep = "-")
    )
    if(numBusinessDays(pDay, dates[i]) <= nDays){
      series[i] <- series[i] + sigma * factor
    }
  }
  series <- cumsum(series)
  # 3) Optimize
  # get the daily index in each month for each date in the series,
  # so the 2nd BD in Jan 2000 is 03/01/2000 and in Feb 2000 is
  # 02/02/2000. refDates[[2]] will be a list of elements with these dates.
  refDates <- getRefDates_MonthBusinessDate(dates) 
  psr <- sr <- trades <- sl <- freq <- pDay <- pnl <- NULL
  count <- 0
  # start of main loop - we loop over each date in the refDates list
  sr <- NULL
  for(pDay in 1:length(refDates)){
    refDates_0 <- refDates[[pDay]]
    if(length(refDates_0) == 0){
      next
    }
    # 4) Get trades
    iterdf <- expand.grid(z=c(-1,1), y=c(-10:0), x=c(1:20))
    iterdf <- iterdf[c("x","y","z")]
    for(prod in 1:nrow(iterdf)){
      count = count + 1
      trades_0 <- getTrades(series, dates, refDates_0, iterdf[prod,1], iterdf[prod,2], iterdf[prod,3])
      dates_0 <- NULL
      pnl_0 <- NULL
      if(length(trades_0) == 0) { next }
      for(j in 1:length(trades_0)) { dates_0 <- append(dates_0, trades_0[[j]][1]) }
      for(j in 1:length(trades_0)) { pnl_0 <- append(pnl_0, as.numeric(trades_0[[j]][4])) }
      # 5) Eval performance
      if(length(pnl_0) > 2){
        # 6) Reconcile PnL
        pnl <- attachTimeSeries(pnl, pnl_0, dates_0, count)
        # 7) Evaluate
        eval_list <- list()
        eval_list <- evalPerf(pnl_0, dates[1], dates[length(dates)])
        names(eval_list) <- c("sr_0","psr_0","freq_0")
        for(j in 2:length(pnl_0)){
          pnl_0[j] <- pnl_0[j] + pnl_0[j - 1]
        }
        if(is.null(sr)){
          sr <- 0 # give sr a numeric value
        }
        if(eval_list["sr_0"] > sr){ # if sr == None or sr_ > sr:
          psr <- eval_list["psr_0"]
          sr <- eval_list["sr_0"]
          trades_df <- data.frame(matrix(unlist(trades_0), nrow=length(trades_0), byrow=T),stringsAsFactors=FALSE)
          trades_df[,1] <- as.Date(trades_df[,1])
          colnames(trades_df) <- c("txndate","num","side","daypl")
          freq <- eval_list["freq_0"]
          pDay <- pDay # freq, pDay, prod = freq_, pDay_, prod_
          constraints <- c(iterdf[prod,1],iterdf[prod,2],iterdf[prod,3])
          cat(count, pDay, constraints, round(sr,2), round(freq,2), round(psr,2), fill = TRUE, sep = ",")
        }
      }
    }
  }
  ret <- list(Total_Iterations = count,
              PL = pnl,
              PSR = psr,
              SR = sr,
              trades = trades_df,
              Freq = freq,
              Business_Day = pDay,
              HP_SL_Side = constraints, # HP = Holding Period, SL = Stop Loss, Side = Side
              Dates = dates,
              cumsum_Series = series) # series is a cumsum of the original return series

  class(ret) <- "dataMine"
  ret
}

# ----------------------------------------------------------------------------------------
evalPerf <- function(pnl, date0, date1, sr_ref=0){
  freq <- (length(pnl) / as.numeric((difftime(date1, date0) + 1)) * 365.25)
  m1 = mean(pnl)
  m2 = sd(pnl)
  m3 = skewness(pnl)
  m4 = kurtosis(pnl, methods = "fisher") # fisher=False in original, but unsure what alternative to use
  sr = m1 / m2 * freq ^ 0.5
  psr = computePSR(c(m1, m2, m3, m4), length(pnl), sr_ref=sr_ref / freq ^ 0.5, moments=4)[1]
  return(c(sr, psr, freq))
}

# ----------------------------------------------------------------------------------------
computePSR <- function(stats, obs, sr_ref=0, moments=4){
  # Compute PSR
  stats_0 = c(0, 0, 0, 3)
  stats_0 = stats[1:moments]
  sr = stats_0[1] / stats_0[2]
  psrStat = (sr - sr_ref) * (obs - 1) ^ 0.5 / (1 - sr * stats_0[3] + sr ^ 2 * (stats_0[4] - 1) / 4.) ^ 0.5
  psr = pnorm((sr - sr_ref) * (obs - 1) ^ 0.5 / (1 - sr * stats_0[2] + sr ** 2 * (stats_0[3] - 1) / 4.) ^ 0.5)
  return(c(psrStat, psr))
}

# ----------------------------------------------------------------------------------------
attachTimeSeries <- function(series, series_0, index=None, label='', how='outer'){
  if(is.null(series)){
    series <- xts(series_0, as.Date(unlist(index)))
  } else {
    series_0 <- xts(series_0, as.Date(unlist(index)))
    series <- merge(series, series_0)
  }
  return(series)
}


# ----------------------------------------------------------------------------------------
getTrades <- function(series, dates, refDates, exit, stopLoss, side){
  # Get trades
  trades <- list()
  pnl <- position_0 <- 0
  j <- 1
  num <- NULL
  for(i in 2:length(dates)){
    # Find next roll and how many trading dates to it
    if(dates[i] >= refDates[j]){
      if(dates[i - 1] < refDates[j]){ num <- pnl <- 0 }
      if(j < length(refDates) - 1){
        while(dates[i] > refDates[j]) { j = j + 1 }
      }
    }
    if(is.null(num)){ next }
    # Trading rule
    position <- 0
    if(num < exit & pnl > stopLoss){ position <- side }
    if(position != 0 | position_0 != 0) {
      trades <- append(trades, list(list(as.Date(dates[i]), as.numeric(num), 
                                         as.numeric(position), 
                                         as.numeric(position_0 * (series[i] - series[i - 1])))))
      pnl <- pnl + as.numeric(trades[[length(trades)]][4])
      position_0 <- position
    }
    num <- num + 1
  }
  return(trades)
}

# ----------------------------------------------------------------------------------------
getRefDates_MonthBusinessDate <- function(dates){
  refDates <- list() # python dict in de Prado program
  pDay <- list()
  first <- as.Date(paste(format(as.Date(dates[1], format="%d/%m/%Y"),"%Y"),
                         format(as.Date(dates[1], format="%d/%m/%Y"),"%m"),
                         1,
                         sep = "-")
  )
  m <- as.numeric(format(dates[1],"%m"))
  d <- numBusinessDays(first, dates[1]) + 1
  # For all dates, get the day (int)...will be all business days, and assign to pDay
  for(i in 1:length(dates)){
    if(m != as.numeric(format(dates[i],"%m"))){ 
      m <- as.numeric(format(as.Date(dates[i], format="%d/%m/%Y"),"%m"))
      d <- 1
    }
    pDay <- append(pDay, d)
    d <- d + 1
  }
  for(j in 1:30){ # python "range(1,30)"
    lst <- list()
    for(i in 1:length(dates)){
      if(pDay[i] == j){
        lst <- append(lst, dates[i])
      }
    }
    refDates[[j]] = lst
  }
  return(refDates)
}

# ----------------------------------------------------------------------------------------
numBusinessDays <- function(date0, date1){
  m <- 1
  date_0 <- date0
  while(TRUE){
    date_0 <- as.Date(date_0) + 1
    if(date_0 >= date1){ break }
    if(as.POSIXlt(date_0)$wday != 0 & as.POSIXlt(date_0)$wday != 6){
      m <- m + 1 
    }
  }
  return(m)
}

We will implement the dataMine() function twice, as in the paper, once for a purely random series and again for a seasonally adjusted series. In the first case we leave the default values for parameters ‘nDays’ and ‘factor’, which are zero. In the second run, we use ‘nDays’ = 5 and ‘factor’ = 0.25.

Random

# First build our Sharpe function - the performance metric used in the paper for determining PBO
# which gets passed to pbo() as the performance metric used
sharpe <- function(x,rf=0.03/252) {
  sr <- apply(x,2,function(col) {
    er = col - rf
    return(mean(er)/sd(er))
  })
  return(sr)
}

# Run dataMine() for a random series
set.seed(777)
t1 <- Sys.time()
dm_random <- dataMine(nDays = 0, factor = 0)
#> 1,1,1,-10,-1,0.77,22.99,1.52
#> 68,1,4,-10,1,0.93,57.48,1.85
#> 84,1,4,-2,1,0.93,56.43,1.86
#> 464,2,2,-10,1,1.29,34.49,2.7
#> 486,2,3,-10,1,1.46,45.98,2.97
#> 502,2,3,-2,1,1.54,45.72,3.16
#> 504,2,3,-1,1,1.55,44.94,3.19
t2 <- Sys.time()
timediff <- difftime(t2,t1)
timediff
#> Time difference of 22.62471 mins

# blotter:::pbo
dm_random$PL[!is.finite(dm_random$PL)] <- 0
pbo_random  <- blotter:::pbo(dm_random$PL, f=sharpe)
summary(pbo_random)
#> Performance function sharpe with threshold 0
#>    p_bo   slope    ar^2  p_loss 
#> 0.50000 0.11922 0.08100 0.48600
# histogram(random_series,type="density")
# xyplot(random_series,plotType="degradation")

Seasonal

# First build our Sharpe function - the performance metric used in the paper for determining PBO
# which gets passed to pbo() as the performance metric used
sharpe <- function(x,rf=0.03/252) {
  sr <- apply(x,2,function(col) {
    er = col - rf
    return(mean(er)/sd(er))
  })
  return(sr)
}
# Run dataMine() for a seasonal series
t3 <- Sys.time()
set.seed(777)
dm_season <- dataMine(nDays = 5, factor = 0.25)
#> 2,1,1,-10,1,0.19,22.99,0.37
#> 24,1,2,-10,1,1.37,34.49,2.85
#> 46,1,3,-10,1,2.13,45.98,4.65
#> 68,1,4,-10,1,2.63,57.48,5.64
#> 486,2,3,-10,1,2.87,45.98,6.43
#> 504,2,3,-1,1,2.9,45.46,6.55
t4 <- Sys.time()
print(t4-t3)
#> Time difference of 21.87491 mins
dm_season$PL[!is.finite(dm_season$PL)] <- 0 # Replace NAs with zeros
pbo_seasonal  <- blotter:::pbo(dm_season$PL, f=sharpe)
summary(pbo_seasonal)
#> Performance function sharpe with threshold 0
#>    p_bo   slope    ar^2  p_loss 
#> 0.00000 0.19634 0.05000 0.00000
# histogram(season_series,type="density")
# xyplot(season_series,plotType="degradation")

The PBO results from both the random and seasonally adjusted series should be very similar to what is reported in the paper. Bailey et al. get an optimal SR of 1.27 from all their trials for the random series and a PBO of 55%. For the seasonal series they report an optimal SR of 1.54 and a PBO of only 13%. When we set the seed for both runs of dataMine() to be equivalent and equating to 777 (jackpot) for the purposes of comparison and replication, we get an optimal SR of 1.55 for the random series and a PBO of 50%. For the seasonal series we get an optimal SR of 2.9 and a PBO of 0%. The PBO scores using the results from our dataMine() program reflect similar results compared with Bailey et al. (55% and 13% respectively). The optimal iteration occurs at number 504 for both series’, when the random seed is set to 777.

Future work

  • Adding different performance metrics as methods to blotter:::pbo.
  • Adding a function for comparing overfit statistics for the different performance metrics. Presumably using the Sharpe ratio as your metric will give different results compared to using max drawdown, or Sortino Ratio etc. Can this tell us anything about the relevance and/or applicability of the different performance metrics?
  • Apply output from txnsim to pbo(), since txnsim() attempts to evaluate overfitting in the context of randomized versions of the original strategy whilst honouring the outward resemblance of the original strategy.
  • Optimise dataMine() with vectorization.

 

Thanks to Duncan Temple Lang (@duncantl), who developed the R package, RWordPress, and William K. Morris (@wkmor1), Yihui Xie (@xieyihui), and Jared Lander (@jaredlander), who developed the knit2wp function in knitr. These two packages make it possible to write a blog post in R Markdown and then publish it directly to a WordPress site as i did with this post. For instructions on how to do this, see this post by @HeatherUrry.

Advertisements

Round Turn Trade Simulation – in R

I was fortunate enough to talk about my latest open source work with Brian Peterson at the R/Finance conference in Chicago just less than 1 month ago. It was my first time to the conference, and I will be back again for sure. The topics and their presentations are available on the website. With this post, I hope to share the main ideas of my talk.

Back in August 2017 I wrote a post about a new function in the R blotter package called mcsim(). The objective for writing that post was to introduce the function, and how and why it may be useful when evaluating quantitative trading strategies. In this article I aim to achieve the same goal for a new function named txnsim(), with a similar application but with more analytical value.

With mcsim() an analyst is able to simulate a trading strategy’s portfolio PL which will give a person some information on the statistical properties of the strategy overall. Simulating portfolio PL can have drawbacks; in particular that simulations on a portfolio level may lack transparency or may not line up with historical market regimes. In these instances sampling round turn trades may be more useful.

With txnsim() the analyst is able to construct a random strategy that preserves as many of the stylized facts (or style) of the observed strategy as possible, while demonstrating no skill. The round turn trades of the random replicate strategies, while outwardly resembling the original strategy in summary time series statistics, are the result of random combinations of observed features taking place at random times in the tested time period. This effectively creates simulated traders with the same style but without skill. For this reason txnsim() is most appropriate for discerning skill vs. luck or overfitting.

 

Performance Simulations in the Literature

  • Pat Burns (2004) covers the use of random portfolios for performance measurement and in a subsequent paper in 2006 for evaluating trading strategies which he terms a related but distinct task. He goes on to mention in his evaluating strategies paper that statistical tests for a signal’s predictiveness was generally possible even in the presence of potential data snooping bias. Things have likely changed in the 12 years since in that data snooping has become more prevalent, with more data, significantly advanced computing power and the ability to fit an open source model to almost any dataset.

Pat Burns – “If we generate a random subset of the paths, then we can make statistical statements about the quality of the strategy.”

  • Jaekle & Tomasini in their Trading Systems book refer to the analysis of trading systems using Monte Carlo analysis of trade PNL. In particular they mention the benefit of a confidence interval estimation for max drawdowns.

Jaekle & Tomasini – “Changing the order of the performed trades gives you valuable estimations about expected maximum drawdowns.”

  • In the Probability of Backtest Overfitting paper by Lopez de Prado et al in 2015, they present a method for assessing data snooping as it relates to backtests, which are used by investment firms and portfolio managers to allocate capital.

Lopez de Prado et al – “…because the signal-to-noise ratio is so weak, often
the result of such calibration is that parameters are chosen to profit from
past noise rather than future signal.”

Thanks to the consent of Matt Barry, we will soon be implementing code from the pbo package into blotter and extending from there to fit in with the general framework of blotter and our ambitions of broadening the overfit detection methods available in quantstrat and blotter.

  • Harvey et al, in their series of papers including Backtesting and the Cross-Section of Expected Returns discuss their general dismay at the reported significance of papers attempting to explain the cross-section of expected returns. They propose a method for deflating the Sharpe Ratio when taking into account the data snooping bias otherwise referred to as Multiple Hypothesis testing.

Harvey et al – “We argue that most claimed research findings in financial economics are likely false.”

We implemented Harvey’s haircut Sharpe ratio model in quantstrat.

What all these methods have in common, is an element of random sampling based on some constraint. What we propose in txnsim() is the random sampling of round turn trades bound by the constraint of the stylized facts of the observed strategy.

Compared with more well-known simulation methods, such as simulating portfolio P&L, Round Turn Trade Simulation has the following benefits:

  1. Increased transparency, since you can view the simulation detail down to the exact transaction, thereby comparing the original strategy being simulated to random entries and exits with the same overall dynamic.
  2. More realistic since you sample from trade durations and quantities actually observed inside the strategy, thereby creating a distribution around the trading dynamics, not just the daily P&L.

What all this means, of course, is you are effectively creating simulated traders with the same style but zero skill.

 

Stylized facts

If you consider the stylized facts of a series of transactions that are the output of a discretionary or systematic trading strategy, it should be clear that there is a lot of information available to work with. The stylized facts txnsim() uses for simulating round turns include;

  • Round turn trade durations
  • Ratio of long:short durations
  • Quantity of each round turn trade
  • Direction of round turns
  • Number of layers entered, limited by maximum position

Using these stylized facts, txnsim() samples either with or without replacement between flat periods, short periods and long periods and then layers onto these periods the sampled quantities from the original strategy with their respective durations.

 

Round Turn Trades

In order to sample round turn trades, the analyst first needs to define what a round turn trade is for their purposes. In txnsim() there is a parameter named tradeDef which can take one of 3 arguments, 1. “flat.to.flat”, 2. “flat.to.reduced”, 3. “increased.to.reduced”. The argument is subsequently passed to the blotter::perTradeStats() function from which we extract the original strategy’s stylized facts.

For a more comprehensive explanation of the different trade definitions, i will refer you to the help documentation for the perTradeStats() function as well as the documentation for the txnsim() function in blotter.

 

Longtrend

The first empirical example we will take a look at is an analysis using txnsim() and the longtrend demo in blotter. My only modification to the demo was to end the strategy in Dec 2017, purely for the purposes of replicating my results.

As we can see in the blue Positionfill window, the strategy only enters into a position once, before exiting.

longtrend performance-1

If we look at how the strategy performed overall (after setting the seed to lucky number 333), relative to its random replicates we can see fairly quickly that it is a difficult strategy to beat.

txnsim - longtrend

To observe the stylized facts of the original versus the winning replicate strategy, we can contrast the Positionfills of both.

Positionfil longtrend-1.png

Replicate number 664 out of 1,000 was the most profitable strategy overall (we confirm this in a moment), and we get a good sense of how txnsim() honoured the stylized facts of the original strategy when determining the random entries and exits of the ultimate winning replicate number 664.

One of the many slots returned in the txnsim object is named “replicates” and includes the replicate start timestamps, the durations and the corresponding quantities. With the duration information in particular, we are able to chart the distribution of long period durations and flat period durations.

histogram long period durations longtrend_txnsim flat.to.flat with replacement-1

Perhaps not surprisingly, we see the original strategy duration for long and flat periods roughly in the middle of the distributions of the replicates.

Since longtrend was a long only strategy, the flat period distribution is a mirror image of the long period distribution.

histogram flat period durations longtrend_txnsim flat.to.flat with replacement-1.png

 

Included in the returned list object of class txnsim, are ranks and pvalues which summarise the performance of the originally observed strategy versus the random replicates.

longtrend - ranks_pvalues.PNG

longtrend - pvalues

As we can see, longtrend was in the 90th percentile for all performance metrics analysed except for stddev where it ranked inside the 84th percentile.

Using the hist method for objects of type txnsim we can plot any of the performance metric distributions to gauge how the observed strategy fared overall. Looking at the Sharpe ratio, we see graphically how longtrend does relative to the replicates as well as relative to configurable confidence intervals.

longtrend - hist sharpe

Maximum drawdown is another one of the performance metrics used and generally a favorite for simulations. We can see again, visually, that longtrend outperforms most random replicates on this measure.

longtrend - hist maxdd.png

 

Layers and Long/Short strategies with ‘bbands’

For any round turn trade methodology which is not measuring round turns as flat.to.flat, things get more complicated.

The first major complication with any trade that levels into a position is that the sum of trade durations will be longer than the market data. The general pattern of the solution is that we sample as usual, to a duration equivalent to the duration of the first layer of the strategy. In essence we are sampling assuming round turns are defined as “flat.to.flat”. Any sampled durations beyond this first layer are overlapped onto the first layer. The number of layers is determined by the amount of times the first layer total duration is divisible into the total duration. In this way the total number of layers and their duration is directly related to the original strategy.

The next complication is max position. Now, a strategy may or may not utilize position limits. This is irrelevant. We have no idea which parameters are used within a strategy, only what is observable ex post. For this reason we store the maximum long and short positions observed as a stylized fact. To ensure we do not breach these observed max long and short positions during layering we keep track of the respective cumsum of each long and short leveled trade.

For any trade definition other than flat.to.flat, however, we need to be cognizant of flat periods when layering to ensure we do not layer into an otherwise sampled flat period. For this reason we match the sum duration of flat periods in the original strategy for every replicate. To complete the first layer with long and short periods, we sample these separately and truncate the respectively sampled long and short duration which takes us over our target duration. When determining a target long and short total duration to sample to, we use the ratio of long periods to short periods from the original strategy to distinguish between the direction of non-flat periods.

To highlight the ability of txnsim() to capture the stylized facts of more comprehensive strategies including Long/Short strategies with leveling we use a variation of the ‘bbands’ strategy. Since we apply a sub-optimal position-sizing adjustment to the original demo strategy in order to illustrate leveling, we do not expect the strategy to outperform the majority of its random counterparts.

A quick look at the chart.Posn() output of bbands should highlight the difference in characteristics between longtrend and bbands.

bbands chart and equity curve-1.png

We run 1k replicates and the resulting equity curves as you can see here confirm our suspicions. We have a lower probability of outperforming random replicates for this version of ‘bbands’. In fact you will see there are periods during the backtest that we severely underperform the other random agents. Something I touch on in the future work section is the addition of something similar to Burns’ non-overlapping periodic p-values so we can better visualize just how the backtest performed through time.

bbands - txnsim.png

Taking a closer look at the performance and position taking of the “winning” random replicate, we get a sense of how the strategy attempts to mirror the original in terms of position sizing and duration of long versus short positions overall. It should also be evident how the replicate has honored the maximum long and short positions observed in the original strategy.

bbands txnsim winner.png

Comparing the position fills of the original strategy and the winning replicate more directly we get a better sense of the overall dynamic of both the original and the winning replicate.

bbands - Positionfills.png

What is potentially a red flag from this chart, is the difference in padding. The replicate clearly has less padding, meaning the total duration that the strategy is in the market will be less than the original.

When we plot the long and short duration distributions of the replicates and compare these to the original strategy, it highlights the magnitude of the discrepancy and is something we hope to resolve soon so we can move onto the txnsim vignette and hopefully the start of a paper on Round Turn Trade Simulations.

bbands - replicate long period durations.png

 

bbands - replicate short period durations.png

 

Future Work

As mentioned previously and in no particular order, future work items will include:

  • Refining the layering process to better replicate total trade duration of the original strategy
  • Adding p-value visualization through time, similar to Pat Burns’ 10-day non-overlapping pvalues
  • Adding other simulation methodologies
  • Simulation studies of multivariate portfolios bound by capital constraints
  • Basing simulations on simulated or resampled market data
  • Applying txnsim stylized facts to “OOS” market data
  • And of course, a vignette and hopefully a paper on Round Turn Trade Simulation

 

Conclusion

Round turn trade Monte Carlo simulates random traders who behave in a similar manner to an observed series of real or backtest transactions. We feel that round turn trade simulation offers insights significantly beyond what is available from:

  • equity curve Monte Carlo (implemented in blotter in mcsim),
  • from simple resampling (e.g. from pbo or boot),
  • or from the use of simulated input data (which typically fails to recover many important stylized facts of real market data).

Round turn trade Monte Carlo as implemented in txnsim directly analyzes what types of trades and P&L were plausible with a similar trade cadence to the observed series. It acts on the same real market data as the observed trades, efficiently searching the feasible space of possible trades given the stylized facts. It is, in our opinion, a significant contribution for any analyst seeking to evaluate the question of “skill vs. luck” of the observed trades, or for more broadly understanding what is theoretically possible with a certain trading cadence and style.

The source code for my talk and this post is on GitHub.

Thanks to my co-author Brian Peterson!

Monte Carlo Simulation for your Portfolio PL

‘Once you have a system, the biggest obstacle is trusting it.’
Tom Wills

‘What is the last thing you do before you climb on a ladder? You shake it. And that
is Monte Carlo simulation.’
Sam Savage, Stanford University

 

Introduction

In my early days of looking at trading strategies, getting to the equity curve felt like the final piece of the puzzle. The piece that would provide enough perspective so that you could make the decision whether or not to proceed to production. I was always more comfortable with the scientific approach than my gut feel. Perhaps this was a function of my limited time in the market. After collaborating with Brian Peterson and studying the many great articles written by he and so many others in the community it became clear the equity curve was only the first piece of the puzzle.

Questions one should ask subsequent to generating a complete backtest are:

  • Is the result realistic?
  • Could we have overfit?
  • What range of values can we expect out of sample?
  • How much confidence can we have in the model?

A backtest generally represents limited information. It is merely one run of a particular strategy whose out-of-sample performance could be ambiguous. Sampling from this limited information however may allow us to get closer to the true process, thereby adding statistical significance to any inferences made from the backtest. mcsim() gives us the potential to do just that, simulating portfolio PL from either post-trade or backtested transactions, as well as providing multiple options for tweaking the sampling procedure according to strategy characteristics.

Whilst mcsim() has been alive and well in blotter for a while now (blotter is the open source transaction infrastructure package in R for defining instruments, accounts and portfolios for trading systems and simulations) i am yet to write a post about it. My prior posts here and here were a prelude to the superior functionality now available with mcsim().

 

mcsim()

Included in the documentation for the mcsim() function is an example of its use with the longtrend demo. I will refer to the mcsim() output of that strategy for this post. mcsim() comes with 10 parameters which i describe next.

  • Portfolio, Account

These are objects stored in your blotter environment. If you are using post-trade cash PL in your simulation then these parameters can be left as their default NULL values. If using the longtrend demo for example you can use “longtrend” for both.

The Account object is used solely for retrieving your initial equity amount specified at the start of your strategy. The Portfolio object is used for retrieving dailyEqPL or dailyTxnPL (both existing blotter functions) depending on the ‘use’ parameter, which we cover a few parameters down.

  • n

‘n’ is used to specify the number of simulations you would like to run on your backtest. The longtrend help file uses 1k simulations. This takes me ~15secs on my Intel i5-4460. Disclaimer: no optimisation or refactoring was carried out on mcsim()…this is a TODO item. In my experience 1k simulations almost certainly achieves an asymptotically normal distribution with an acceptable level of statistical significance. For investigating the general shape, however, 100 simulations should suffice.

  • replacement

Any sampling procedure can generally be run with or without replacement. Sampling without replacement means results will have the same mean and final PL but different paths. Sampling with replacement results in a wider distribution of the various paths the strategy could have taken, as well as results which vary final PL and the other sample statistics (maxDD, stddev, quasi-Sharpe). This differentiaion will be clear when running mcsim() for a given strategy with and without replacement, and calling the plot and hist functions on the mcsim object the results were assigned to. See the examples in the help docs, which we cover a little further down.

  • use

The ‘use’ parameter lets you specify whether or not you would like to use daily-marked portfolio PL, transaction PL (think intraday strategies) or cash PL if using an external xts object of post-trade returns. The method used in Tomasini & Jaekle (2009), Chapter 4 resembles that of sampling with use=’txns’.

  • l

The ‘l’ parameter (lower case L) stands for ‘length’ and allows you to sample blocks of PL observations. Assuming your strategy exhibits some form of autocorrelation it may make sense to incorporate a block length in your sampling procedure based on this observed autocorrelation. Alternatively some inverse multiple of your average holding period could also justify a value for block length ‘l’.

  • varblock

‘varblock’ is a boolean, which allows you to sample with a variably sized block length. If replacement=TRUE, the sampling is based on the tsboot() function in the boot package which you will note is a dependency for running mcsim(). The tsboot() function was the perfect solution for sampling time series, but it only allows for sampling with replacement which should give a sense of what the general preference is from a statistical perspective. When ‘varblock’ is set to TRUE together with ‘replacement’ the resulting block lengths are sampled from a geometric distribution with mean ‘l’. If replacement=FALSE the distribution of block lengths is uniformly random around ‘l’.

  • gap

For strategies which require leading market data before computing indicators on which to base signals and subsequent rules, users can use the ‘gap’ parameter to specify how many periods from the strategy initialization date to start the simulations. For longtrend we use a gap of 10 since it uses an SMA lookback period of 10 (monthly) observations.

  • CI

CI or Confidence Interval is the parameter used to specify the confidence level to display in the hist() function which charts a range of sample statistics for the strategy. Default is 95%.

  • cashPL

This parameter is used when running mcsim() on post-trade cash PL, and refers to a separate xts object created by the user.

 

Output

There are currently 4x S3 methods related to mcsim(), all of which return some portion of the ‘mcsim’ object as either a plot, hist or table (summary and quantile). For each method you can choose to set normalize=TRUE or FALSE. This will simply define the return space your simulations operate in…either cash or percent. Depending on the strategy duration and your reinvestment assumptions, there could be pros and cons to either approach. The distributions should be fairly similar in any event. In addition to ‘normalize’ there is also a ‘methods’ parameter for hist() which lets you specify the sample statistics to return in histogram plots. There are currently 5 stats, namely: ‘mean’, ‘median’, ‘stddev’, ‘maxDD’ and ‘sharpe’. The default is to return all methods as 5 separate histogram plots. This may be inappropriate when setting replacement=FALSE since all sample statistics barring maxDD will be identical (in the case of cash returns).

[Note: There are numerous benefits to using S3 methods, including for example that calls to plot() and hist() dispatch the appropriately defined methods related to mcsim(). In fact a total of 20 slots are returned with a call to mcsim() and can be analysed directly in your R environment. The Help file covers the 20 values returned from a call to mcsim().]

Looking at the first example in the Help file we see the following assigned call to mcsim where replacement is FALSE: and stored in ‘nrsim’ for ‘no-replacement simulation.’

nrsim = mcsim("longtrend", "longtrend", n=1000, replacement=FALSE, l=1, gap=10, CI=0.95)

 

plot()

Calling plot() on the mcsim assigned object ‘nrsim’ returns the below xts plot (which will differ slightly on each simulation):

plot(nrsim, normalize=TRUE)

nrsim.plot

The same plot() call with normalize=FALSE yields a plot that converges to the same final PL.

plot(nrsim, normalize=FALSE)

nrsim.plot.normalizeFALSE

The reason the plot with normalization does not converge to the same final number is purely a function of how mcsim() normalization is carried out on the replicates in the cash return space. Since sampling without replacement means sequences are not perfectly independent, using percent returns for without replacement may be a useful medium between cash returns with and without replacement which fan out and converge respectively. Below is an example of the same call, however this time with replacement;

wrsim = mcsim("longtrend", "longtrend", n=1000, replacement=TRUE, l=1, gap=10, CI=0.95)
plot(wrsim, normalize=FALSE)

wrsim.plot.normalizeFALSE

 

hist()

From the range of histogram plots mcsim() generates we can infer expectations of the distribution of values for the 5 methods: mean, median, stddev, maxDD and sharpe. These are based on the frequency of the ‘use’ parameter except for maxDD. If use=’equity’ for example then the frequency is daily ie. distribution of daily mean PL, median PL, stddev PL and daily quasi-sharpe (we use quasi as the formula is not a true implementation of the original Sharpe ratio). Using user-specified confidence intervals (and thanks to the Central Limit theorem) it is possible to visualize the range of values that can be expected from the backtest for the sample statistics in question. In presumably most instances the distribution will be centered around the backtest. However, in the case of the max Drawdown the simulation mean and the backtest could be more significantly different. Let’s look at some output from the longtrend demo.

hist(rsim, normalize=FALSE, methods="mean")

rsim.mean.hist.normalizeFALSE.PNG

hist(rsim, normalize=FALSE, methods="stddev")

rsim.stddev.hist.normalizeFALSE.PNG

hist(rsim, normalize=FALSE, methods="maxDD")

rsim.maxDD.hist.normalizeFALSE

hist(rsim, normalize=FALSE, methods="sharpe")

rsim.quasiSharpe.hist.normalizeFALSE

It is evident from the above charts that the Monte Carlo simulations result in an asymptotic normal distribution of the sample statistics (thanks to the Central Limit Theorem). In all cases excluding maxDD the distribution is centered around the mean, with the range of values for a 95% confidence interval quite clear to view. In the case of maxDD, the backtest has a max Drawdown of 30k, whilst the mean is closer to 40k. A strategy with any form of risk management should have a better max Drawdown than the average of the replicates, which are randomized re-ordered sequences of PL observations. In Tomasini & Jaekle (2009) the authors infer that the 99th percentile max Drawdown could be an expected outcome for your strategy with a 1% probability. Assuming your backtest employs a risk management strategy i would argue this probability is well below 1%. Nevertheless, viewing the results with this level of conservatism may be prudent especially considering the likely presence of many other biases not accounted for.

 

print() – recent update thanks to Brian, which tidies up the output by rounding to 3 decimals. Note that the simulation is different to the results generated in the charts above, as the update was done post the publishing of this post –

To interrogate the values from the charts more directly, a call to print() (which summarizes the output from the summary.mcsim() S3 method) will suffice. Below is an example with the subsequent output (note all sample stat outputs are returned):

print(rsim, normalize=FALSE)

print.rsim

We can now see the exact values for the sample statistic metrics.

 

 

 

Monte Carlo and mcsim() limitations and disadvantages of Portfolio PL simulations 

As with any model there are assumptions. In the case of Monte Carlo analysis using mcsim() one of the most significant assumptions is that returns follow a Gaussian (normal) distribution. Depending on the distribution of the sample statistic of interest, you may be able to propose a more suitable distribution. The exponential impact of ‘fat tail’ events also cannot be underestimated, with one of the world’s earliest and most successful hedge funds succumbing to these limitations.

Another limitation of Monte Carlo analysis is the sample from which it simulates may be overfit or overly optimized for in-sample out-performance. It is the job of the analyst to manage this risk though, and hopefully with the recently added quantstrat functions degfreedom(), deflated.Sharpe() and haircut.Sharpe() the analyst will be more empowered to manage this risk. For a useful demonstration of their use in quantstrat see the macdParameters demo.

Sampling from Portfolio PL may also be unrealistic since no consideration is given to market regimes and strategy-specific stylized facts such as the individual positions’ time in and out the market. Where this data is available however, the analyst has another recently developed tool at their disposal, namely txnsim()This will be the subject of my next post.

Thanks for reading.

Thanks must also go to my co-author, Brian Peterson, from whose work and commitment all of what i have written would be impossible. Brian’s invaluable insights related to this and so many other topics have benefited the industry to no end. Another legend worth a mention is Joshua Ulrich who never gave up on me (or never gave away that he had) w.r.t my supposed git ignorance. The R/Finance open source community is strong and i am honoured to be a part of it.

R-view: Backtesting – Harvey & Liu (2015)

In this post i take an R-view of “Backtesting – Harvey & Liu (2015).” The authors propose an alternative to the commonly practiced 50% discount that is applied to reported Sharpe ratios when evaluating backtests of trading strategies. The reason for the discount is due to the inevitable data mining inherent in research, both past and present. Harvey & Liu (HL) propose a multiple hypothesis testing framework with a more stringent requirement for determining a test’s significance level compared with single tests. HL suggest that the single hypothesis test used by most researchers leads to too many false discoveries (Type I errors). This makes intuitive sense if you consider that the probability of finding a significant factor increases with the number of factors tested or the number of tests conducted.

To give an example quoted in Harvey & Liu – Evaluating Trading Strategies (which you may have heard before), imagine you receive a mail from a manager promoting a particular stock. The mail says to track the manager’s recommendations over time. The recommendation is either to go long or short. The manager is correct 10 weeks in a row with his recommendations. The probability of such a feat you figure is very small, in fact a 0.09765% probability of being a false discovery (0.5^10 = 0.0009765625) based on single hypothesis testing. But what if the manager sent a similar mail to 100k people telling half to go long and the other half to go short the same stock. Every week the manager trims half the people off his mailing list, the half for whom the recommendation did not work. Every week for 10 weeks the list is trimmed. By the end of week 10 a total of 97 lucky people have been the recipients of these perfect picks. The results were random, and the recipients would have been fooled.

In multiple hypothesis testing the challenge is to guard against false discoveries. HL argue that the appropriate “haircut Sharpe ratio” is non-linear, in that the highest Sharpe ratios (SR’s) are only moderately penalized whilst marginal SR’s more so. The implication is that high SR’s are more likely true discoveries in a multiple hypothesis testing framework.

If you would like to jump straight to my R implementation it is at the end of this post. The authors make their Matlab code for Exhibits 5 & 6 available here. I simply converted it to R and added some extra bits for illustrative purposes. [As of Aug ’17 these functions are available in quantstrat as haircut.Sharpe and profit.hurdle]

HL mention 5 caveats to their framework, namely;

  • Sharpe ratios may not be appropriate metrics for strategies with negatively skewed expected payoffs, such as option strategies.
  • Sharpe ratios normalize returns based on their volatility (ie. market risk), which may not be the most appropriate reflection of risk for a strategy.
  • Determining the appropriate significance level for multiple testing (where in single tests 5% is the normal cutoff).
  • Which multiple testing method you choose could yield different conclusions. HL proposes 3 methods together with an average.
  • Lastly, and i would say most sensitively is the number of tests used.

Linking Sharpe ratio with t-statistic

To explain the link between the Sharpe ratio and the t-stat and the application of a multiple testing p-value adjustment, HL use the simplest case of an individual investment strategy. Assume a null hypothesis in which the strategy’s mean return is significantly different from zero, therefore implying a 2-sided alternative hypothesis. A strategy can be regarded as profitable if its mean returns are either side of zero since investors can generally go long or short. Since returns will at least be asymptotically normally distributed (thanks to the Central Limit Theorem) a t-statistic following a t-distribution can be constructed and tested for significance. However, due to the link between the Sharpe ratio and the t-stat it is possible to assess the significance of a strategy’s excess returns directly using the Sharpe ratio. Assume \hat{\mu} denotes the mean of your sample of historical returns (daily or weekly etc) and \hat{\sigma} denotes standard deviation, then

t-statistic = \frac{ \hat{\mu}}{(\hat{\sigma}/\sqrt{T})}

where T-1 is degrees of freedom and since

\widehat{SR} =  \hat{\mu}/\hat{\sigma}

it can be shown that

\widehat{SR} = t-statistic/\sqrt{T}

By implication a higher Sharpe ratio equates to a higher t-ratio, implying a higher significance level (lower p-value) for an investment strategy. If we denote p-value of the single test as {p^s} then we can present the p-value of the single test as

{p^s} = Pr( |r| > t-ratio)

or

{p^s} = Pr( |r| > \widehat{SR}.\sqrt{T})

Now if the researcher was exploring a particular economic theory then this p-value might make sense, but what if the researcher has tested multiple strategies and presents only the most profitable one? In this case the p-value of the single test may severely overstate the actual significance. A more truthful p-value would be an adjusted multiple testing p-value which assuming we denote as {p^m} could be represented as

{p^m} = Pr( max{|{r_i}|, i = 1,...,N} > t-ratio)

or

{p^m} = 1 - (1 - {p^s}{)^N}

By equating the p-value of a single test to a multiple test p-value we get the defining equation of {p^m} which is

{p^m} = Pr( {|{r_i}|} > \widehat{SR}.\sqrt{T})

Using the example in HL, assuming a strategy with 20 years of monthly returns (T=240), an annual Sharpe ratio (SR) of 0.75, you would have a p-value for the single test of

{p^s} = Pr( |r| > t-ratio)

{p^s} \approxeq 0.0008

To compute the single p-value and the multiple testing p-value assuming researchers have tried N = 200 strategies, you can use the below R code to compute the adjusted or “haircut” Sharpe ratio;

# Set the variables
N = 200 # Number of strategies tested
T = 240 # Number of observations or degrees of freedom
SR_ann = 0.75 # Annualised Sharpe ratio
# Compute the single test p-value
p_single = 2 * (1-pnorm(SR_ann*sqrt(T/12)))
# Compute the multiple test p-value
p_multiple = 1 - (1 - p_single) ^ N
# Compute haircut Sharpe ratio (HSR)
z_stat = qt(1 - p_multiple/2, N-1)
HSR = (z_stat/sqrt(T)) * sqrt(12)
# Sharpe ratio haircut
haircut = (SR_ann - HSR)/SR_ann

The code should output an HSR of 0.325 (HL yields 0.32 due to rounding the single and multiple p-values) which implies a haircut to the original Sharpe ratio of 56.6% (HL report a 60% haircut). The example above is no longer appropriate where t-statistics are dependent. More on this later.

Multiple Multiple-Testing methods

HL mentions 3 well known adjustment methods in the statistics literature, which are originally prescribed in the paper “…and the Cross-Section of Expected Returns” by Harvey, Liu and Zhu. These are Bonferroni, Holm, and Benjamini, Hochberg, and Yekutieli (BHY).

Bonferroni’s adjustment:

{p_i^{Bonferroni}} = min[M * {p_i}, 1]

Bonferroni applies the same adjustment to the p-value of each test, inflating the p-value by the number of tests. The multiple testing p-value is the minimum of each inflated p-value and 1 where 1 (or 100% if you prefer) is the upper bound of probability. HL use the example of p-values from 6 strategies where the p-values are (0.005, 0.009, 0.0128, 0.0135, 0.045, 0.06). According to a 5% significance cutoff the first 5 tests would be considered significant. Using the p.adjust function in R we can get the multiple adjusted p-values and according to Bonferroni only the first test would be considered significant.

# Set the variables
p_values = c(0.005, 0.009, 0.0128, 0.0135, 0.045, 0.06)
# Assign to adj.p_values the output of p.adjust()
adj.p_values = p.adjust(p_values, "bonferroni")
# Print
adj.p_values

bon_adj-p_values

Holm’s adjustment:

p-value adjustments can be categorized into 2 categories, namely: single-step and sequential. Single-step corrections equally adjust p-values as in Bonferroni. Sequential adjustments are an adaptive procedure based on the distribution of p-values. Sequential methods gained prominence after a seminal paper by Schweder & Spjotvoll (1982) and section 7.3 of this paper gives a useful example of an application of multiple testing hypothesis diagnostic plotting in R. Holm is an example of a sequential multiple testing procedure. For Holm, the equivalent adjusted p-value is

{p_{(i)}^{Holm}} = min[max((M - j + 1)*{p_{(j)}}),1]

Using the same 6 strategies illustrated above, we can use p.adjust() to compute the Holm adjusted p-values.

# Run p.adjust() for Holm
HOLM_adj.p_values = p.adjust(p_values, "holm")
HOLM_adj.p_values

holm_adj-p_values

We see in the output above that the first 2 tests are significant at the 5% cutoff, compared to only 1 with Bonferroni. This is due to the fact that Bonferroni adjusts single tests equally, whereas Holm applies a sequential approach. By conclusion it should not surprise you that adjusted Sharpe ratios under Bonferroni will therefore be lower than for Holm. At this point it is useful to mention that both Holm and Bonferroni attempt to prevent even 1 Type I error occurring, controlling what is called the family-wise error rate (FWER). The next adjustment proposed by HL is BHY and the main difference from the previous 2 adjustment methods is that BHY attempts to control the false discovery rate (FDR), implying more lenience than Holm and Bonferroni and therefore expected to yield higher adjusted Sharpe ratios.

BHY – (Benjamini, Hochberg, and Yekutieli):

BHY’s formulation of the FDR can be represented as follows. Firstly all p-values are sorted in descending order and the adjusted p-value sequence is defined by pairwaise comparisons.

BHY - formulation.PNG

c(M) is defined as follows

c(M) = {\sum_{j=1}^M} \frac{1}{j}

making c(M) = 2.45 for the example of M=6 strategies (ie. sum of 1/1 + 1/2 + 1/3 +1/4 + 1/5 + 1/6). Using p.adjust() we can see that there are now 4 significant strategies at the 5% cutoff. The output from p.adjust below is slightly different to HL’s BHY implementation as p.adjust() does not equate {p_{(6)}^{BHY}} with the least significant p-value 0.06.

# Run p.adjust() for BHY
BHY_adj.p_values = p.adjust(sort(p_values, decreasing = TRUE), "BY")
BHY_adj.p_values

bhy_adj-p_values

So to summarise these methods BHY leads to 4 significant discoveries versus Holm’s 2 and Bonferroni’s 1. We expect BHY to be more lenient as it controls the false discovery rate whereas Holm and Bonferroni control the family-wise error rate, trying to eliminate making even 1 false discovery. Bonferroni is more stringent than Holm since it is a single-step adjustment versus the sequential approach of Holm. With these 3 methods HL attempt to adjust p-values to account for multiple testing and then convert these to haircut Sharpe ratios and in so doing control for data mining. For both Holm and BHY you need the empirical distribution of p-values from previously tried strategies. Harvey, Liu and Zhu (HLZ) model over 300 risk factors documented in the finance literature. However, using this model for the distribution of p-values is not complete since many tried strategies would not have been documented (referred to as Publication Bias) plus they are potentially correlated thereby violating the requirement for independence between tests. HLZ propose a new distribution to overcome these shortfalls.

Harvey, Liu and Zhu (HLZ)

As previously mentioned HLZ study over 300 factors tested for explaining the cross section of return patterns. They publish a list of their sources here in spreadsheet format. Just looking at the history of published factors they show that factor discovery increases every decade.

History of Factor Discoveries - HLZ.PNG

Using the t-statistics published with those papers (assuming they are economically and statistically sound) HLZ perform the 3 multiple testing procedures described above (Bonferroni, Holm and BHY). They conclude under the assumption that if all tried factors are published then an appropriate minimum threshold t-statistic for 5% significance is 2.8, which equates to a p-value of 0.50% for single tests. Whilst the assumption of all tried factors being published is not reasonable, HLZ argue that the analysis does serve to provide a minimum threshold for accepting significance of future tests.

HLZ limit their factor sample to unique risk factors so as to keep test dependence to a minimum. This is a requirement for the 3 multiple testing procedures described above. But we know the requirements for being published are fairly stringent and most likely limited to tests that show significant t-statistics. HLZ estimate that 71% of tried tests are not published (see appendix B of HLZ for details), and based on this together with their implementation of the 3 adjusting procedures HLZ propose a benchmark t-statistic of 3.18. Intuitively this is greater than previously (2.8) when the assumption was all tried factors were published.

What about test dependence ie. correlation among test statistics?

This will be the case where factors tested model similar forms of risk, and are therefore correlated. Think about the multitude of price multiple factors. Assuming in the extreme that all factors are directly correlated at 100% then no adjustment to the single test would be required as you would be penalizing Type I errors (false discoveries) too harshly at the risk of increasing the Type II error rate (missed discoveries). In multiple testing, the statistics literature attempts to account for correlation through various methods including resampling the entire dataset. In finance research bootstrap procedures are being used to assess significance of individual tests and infer fund manager skill versus chance. Indeed, this is what we are aiming to achieve with mcsim() and txnsim() in the R:blotter package. However, since HLZ do not have access to the datasets used for the factors published in the majority of cases, they instead infer a distribution using the t-statistics published. They suggest a mixed distribution, in which the null hypothesis that mean returns are zero is drawn from a normal distribution and the alternative hypothesis that mean returns are non-zero is drawn from an exponential distribution. This should make economic sense if one assumes more profitable strategies are less likely to exist in an economically scarce world. Using some statistical magic (Section 4 and Appendix A in HLZ which i hope to address more specifically in a future post) HLZ propose model parameter estimates as follows:

Model Parameter Estimates.PNG

 Using the baseline assumptions for the number of unobserved tests, one can use Harvey & Liu’s code in Exhibit 5 & 6 (replicated in R below) to determine the appropriate haircut Sharpe ratio (HSR) or the required mean return for a given significance level, both taking into account an assumed number of tests.

Conclusion

Data mining is unavoidable and in some respects represents knowledge gain. Accounting for it when measuring statistical significance is a requirement. Judgement will have to be exercised when considering significance levels and which multiple testing technique is most appropriate. Bonferroni and Holm are appropriate for mission-critical tests such as for medical research or space missions. Controlling for the error rate of a single event occurring is desirable. Where the application is not a matter of life or death controlling for the rate of false discoveries may be more desirable. Perhaps this applies to finance. Perhaps not. Either way the requirements for statistical significance of trading strategies over time are becoming more stringent. This is natural; most of the true factors would have been discovered and the cost of mining data has diminished and continues to do so at an impressive rate. In other sciences such as physics and medicine/genomics, tests require 5 standard deviations to be considered significant. The field of finance is headed in the same direction as more factors are published. Assuming the widely used Sharpe ratio is an applicable metric for your new strategy’s evaluation, you would be well served to consider its haircut equivalent.

References

R implementation of Exhibit 5 (Haircut_SR.m and sample_random_multests.m)

R implementation of Exhibit 6 (Profit_Hurdle.m)

Please note you will need the sample_random_multests() function from above to run the Profit_Hurdle R code

Quantitative Strategy Development Overview – Brian Peterson

I have had the pleasure of getting to know and work with Brian Peterson of late building out the blotter::mcsim function in the blotter package. I will be writing about this function soon and where it is headed, but in this post i wanted to share a presentation Brian gave the CapeR User Group last week on Developing and Backtesting Systematic Trading Strategies with an application in R, and in particular using the quantstrat and blotter packages. Enjoy!

 

Link to the slides here!

Block Bootstrapped Monte Carlo – in R

roulette-wheel-gambling-5-wallpaper-1.jpg

A few weeks back i wrote a post including the source code for a Monte Carlo simulation function in R. The idea was to randomly sample daily returns produced by a backtest and build a confidence interval distribution of the middle 50% and 90% of returns. Since then Brian Peterson got in touch with me asking if i would work with him in getting some form of Monte Carlo simulation functionality standardized within the R blotter package. The blotter package forms part of the TradeAnalytics project on R-Forge (which includes “quantstrat”) where you will read the project description as…

“Transaction-oriented infrastructure for defining instruments, transactions, portfolios and accounts for trading systems and strategy simulation. Provides portfolio support for multi-asset class and multi-currency portfolios. Still in heavy development.”

Whilst working with Brian he suggested a few enhancements for the function in the original post, one of which was to include the option to perform a block bootstrap of the original backtest in order to capture any autocorrelation effects. This leaves 2 options as far as existing R packages go, namely; meboot() and boot().

 

MEBOOT

Now meboot() is a really neat package which i found out about thanks to this post by James Picerno over at The Capital Spectator. It eliminates the requirement for stationarity in a time series. Below is an example of what a meboot plot looks like using the same simSample file i used in the previous post with 1,000 samples (or replicates as the meboot package refers to it).

meboot_simSample

Looking at the above chart I am not sure it is the most appropriate tool for the job, since the sampling is done within a very narrow range giving you large swings which tend to revert back to the original backtest. To make sure though i also ran the meboot() function over the backtest results generated by the “longtrend” demo strategy script in the blotter package. Again, samples track the original backtest quite closely.

meboot_longtrend

Whether or not meboot is the best tool for the job will ultimately come down to what the sample statistics say and which metrics you use to determine whether or not to proceed with a strategy into production. Either way it may be a useful option to include in any standardized Monte Carlo simulation functionality within blotter.

 

BOOT

The boot package includes functions and datasets for bootstrapping from “Bootstrap Methods and Their Application” by A. C. Davison and D. V. Hinkley (1997, CUP). Of particular interest for time series bootstrapping is the tsboot() function, whose purpose is described in the vignette on CRAN as to “Generate R bootstrap replicates of a statistic applied to a time series. The replicate time series can be generated using fixed or random block lengths or can be model based replicates.”

Now the tsboot object returned by the function does not include the samples themselves, but instead you will have to call the boot.array() function to get the sample indexing. Off the bat though i wasn’t sure the tsboot function was appropriate for simply block sampling a time series, albeit very eloquently and efficiently achieving just that on top of calculating a statistic of each sampled series. So i spent some time debugging the function trying to learn how it goes about performing the block bootstrapping and managed to come up with my own function which produces the below graph after sampling simSample 1,000 times (R = 1000) using a block length of 5 (l = 5) [full source code below the graph].

mcblocksim_simSample

Your choice of block length (l) can be anything (as long as its less than the number of periods in your backtest) and how long you make your blocks could be based on a number of factors including the average length that your strategy is in the market or the average period in which your strategy experiences autocorrelation. Remember, the whole purpose of adding block bootstrap functionality was to account for autocorrelation characteristics in your backtest.

So, whilst the above mcblocksim() function is likely an improvement on the mcsimr() function i wrote about in my previous post, there is still quite a lot of work to do to get it ready for testing and committing to the blotter package. Particular to-do items include:

  • Revisiting the tsboot() function and determining whether or not some form of statistic can help users identify the optimal number of samples required
  • Potentially using the drawdowns() function in the Performance Analytics package as a statistic in the tsboot() function
  • Applying confidence intervals to the samples, thereby eliminating outlying returns (as in mcsimr())
  • Adding any other strategy-specific noise to the samples

So lots to do but as they say, Rome wasn’t built in a day. Hopefully you find some use from the above source code in the mean time.

Thanks to Brian Peterson for his valuable input thus far.

 

Thanks for reading.

 

A Monte Carlo Simulation function for your back-test results – in R

In this post on bettersystemtrader.com, Andrew Swanscott interviews Kevin Davey from KJ Trading Systems who discusses why looking at your back-test historical equity curve alone might not give you a true sense of a strategy’s risk profile. Kevin Davey also writes on the topic here for futuresmag.com. So i wrote a Monte Carlo-type simulation function (in R) to see graphically how my back-test results would have looked had the order of returns been randomized. I use ‘n’ to denote the number of simulations i would like to do, and exclude the highest and lowest 5% of return streams to form the blue ribbon. For the middle red ribbon i use the middle 50% of return streams (ie. excluding the 1st and last quartile). The second and last parameter (default TRUE) is a boolean which if TRUE will run the sampling procedure in R with replacement, and without replacement if set to FALSE. See this link if you need an intuitive explanation of what sampling with and without replacement means. Below is an example of the output for 10k simulations, with and without replacement for a random set of returns on a hypothetical strategy spanning 7 months. The black line is the actual historical equity curve. Whether or not the return distributions satisfy your risk appetite will come down to a more refined drawdown analysis together with the host of other risk metrics available for consideration. At least with the function below you have the stats in a dataframe and a graph to start…the starting point for any statistical analysis.

mcsimr_without_repmcsimr_with_rep

I read the data into R using the read.csv function. You can use the same data here (simSample.xlsx), just be sure to change the file type to CSV first, as WordPress only allows for storing .xls and .xlsx Excel files. Below is the source code to add the function to your environment. I add the required packages then create the mcsimr() function to do the import, manipulate the data and build the graph using ggplot. Remember to set your working directory as the directory in which you have stored the simSample.csv file.

If you would like to run the simulation 1,000 times without replacement, then the function should read:

mcsimr(1000, FALSE)

If you would like to run with replacement, include TRUE for the second argument or leave it blank, as the default is TRUE.

mcsimr(1000)

Ok, below is the full source code

I am aiming to have the function code on github soon [EDIT: The above is hosted (with love) by GitHub, however enhancements are still a work in progress], hopefully with some enhancements allowing the user to specify an xts object for the raw data (ie. backtest results) as well as letting the user decide which percentages to use for the bands. Some drawdown stats would be useful too. I have built something which incorporates a drawdown analysis of every ‘n’ return stream, however, it needs some more work to speed it up as you can imagine 10k xts objects with 500+ of daily return data could take some time to analyse individually.

Lastly, i wonder if something like this could fit into the QuantStrat or TradeAnalytics packages…or if there already is something similar do let me know. [EDIT: Since publishing this post Brian Peterson (maintainer of multiple Finance R packages like ReturnAnalytics, quantmod and Quanstrat) got in touch and asked if i would work with him in getting this function into the blotter package, a dependency of Quantstrat. Brian has also suggested a few improvements to the function including using Block Bootstrap to account for autocorrelation and plot.xts as opposed to ggplot for faster graphical rendering. So look out for an updated post on that.]

Happy hacking!

 

How i used 18 lines of R code to crack the Triangle Challenge

I recently came across a question that required logic and coding skills to solve. It went a little something like this:

Let there be a stick of length 1. Pick 2 points uniformly at random along the stick, and break the stick at those points. What is the probability of the three resulting pieces being able to form a triangle?

Interesting i thought. I do enjoy these types of challenges, and had fun trying to solve this problem. It took me a while to understand that the longest side (or one of the longest sides in the case of an isosceles triangle)  had to be smaller than the sum of the 2 smaller sides. In pseudo code a triangle would be possible IF:

Length of longest side < sum of other 2 sides, which is true IF

Length of longest side < 0.5 (half the length of the stick)

When i saw (in my mind’s eye) me breaking a stick in half and folding it so that both pieces overlap, it became obvious that i needed one of the pieces to be longer and then break that piece into 2 pieces of any length to form a would-be triangle. As a side note, i have been hoping to create my first gif for a while and this might be a useful one to animate a stick being broken into half, folded over and then one half growing in length and then breaking into 2 and forming a triangle.

I turned to the statistical computing language R to solve my problem, knowing i could randomly draw values from a uniform distribution with the runif() function.

Continue reading “How i used 18 lines of R code to crack the Triangle Challenge”