This is the story of a subtle error that, to my opinion, is a nice example of the special challenges of statistical programming. One of my main research interests is time series with long memory. These are often modeled by fractionally integrated models, where

Here is the time series, is between zero and one, is the lag operator defined so that and . Details on fractional differences can be found on Wikipedia or in the original paper of Hosking (1981). Essentially the fractional differencing filter defines an infinite order autoregressive model where the coefficients are a function of the memory parameter .

To generate a fractionally integrated series, we can bring the fractional differencing filter to the other side:

An important special case of this is the random walk model where . In this case for all and it is usually assumed that $\varepsilon_t=0$ for all . Then the model reduces to

Using the same stream of random number , it should therefore be possible to generate the exact same random walk in R using *cumsum()* from the base package as well as *diffseries()* from the package fracdiff with . But, as a former colleague demonstrated to me a while ago, this is not the case.

```
rm(list=ls())
set.seed(54321)
library(fracdiff)
series<-rnorm(1000)
a<-diffseries(series,d=-1)
b<-cumsum(series)
ts.plot(a, ylim=c(min(a,b),max(a,b)))
lines(b, col=2)
legend(x="bottomleft", col=c(1,2), lty=c(1,1), bty="n", legend=c("RW generated with cumsum()", "RW generated with diffseries()"))
```

As one can see from the graph above, the two series *a* and *b* that should be identical diverge faster from each other the longer the series becomes.

Recently I was reminded of this curious behavior when I was trying to implement some new statistical procedures for a research paper and they refused to work until I replaced *diffseries()* with the function *fast_fracdiff()*, that was proposed as a faster alternative for fractional differencing by Jensen and Nielsen (2014) and makes use of the convolution theorem.

```
fast_fracdiff <- function(x, d){
iT <- length(x)
np2 <- nextn(2*iT - 1, 2)
k <- 1:(iT-1)
b <- c(1, cumprod((k - d - 1)/k))
dx <- fft(fft(c(b, rep(0, np2 - iT))) * fft(c(x, rep(0, np2 - iT))), inverse = T) / np2;
return(Re(dx[1:iT]))
}
c<-fast_fracdiff(series, d=-1)
ts.plot(a, ylim=c(min(a,b),max(a,b)))
lines(b, col=2)
lines(c, col=3)
legend(x="bottomleft", col=c(1,2,3), lty=c(1,1,1), bty="n", legend=c("RW generated with cumsum()", "RW generated with diffseries()", "RW generated with fast_fracdiff()"))
```

As one can see, using *fast_fracdiff()* which can be found on the university webpage of Morten Nielsen with produces exactly the same series as *cumsum()*, as one would have expected from the beginning. The green lines representing the random walk generated using *fast_fracdiff()* lies directly above the red one that represents the series obtained using *cumsum()*, so that the latter is not visible. Now why does *diffseries()* behave differently? Lets have a look at the function.

```
diffseries <- function(x, d)
{
x <- as.data.frame(x)
names(x) <- "series"
x 1)
stop("only implemented for univariate time series")
if (any(is.na(x)))
stop("NAs in x")
n = 2)
x <- x - mean(x)
PI <- numeric(n)
PI[1] <- -d
for (k in 2:n) {
PI[k] <- PI[k-1]*(k - 1 - d)/k
}
ydiff <- x
for (i in 2:n) {
ydiff[i] <- x[i] + sum(PI[1:(i-1)]*x[(i-1):1])
}
## return numeric!
ydiff
}
```

As one can see above, in the line *x<-x-mean(x)* the inpust series is de-meaned prior to the fractional differentiation. This is because in a model with non-zero means, we have

where is the expected value of . However, this produces some unwanted behavior, since the fractionally differenced series returned by *diffseries()* always has a mean of zero.

```
x<-fracdiff.sim(n=1000, d=0.4)$series+10
mean(x)
```

```
## [1] 9.702761
```

```
y<-diffseries(x,0.4)
z<-fast_fracdiff(x,0.4)
mean(y)
```

```
## [1] -0.004302338
```

```
mean(z)
```

```
## [1] 0.6807775
```

But what causes the behavior in the first graph above? Why do the series drift apart? Here the input series is the series of innovations . Since these are standard normal, we have . That means has a standard deviation of .

If we use *diffseries()* to integrate the white noise sequence, these are demeaned before they are integrated.

As you can see from the last equation, this produces a random walk with drift, where the drift parameter is given by the mean of the of the innovations.

What do we lean from this? The de-meaning in *diffseries()* will probably not cause problems in most use cases. However, I certainly think it is problematic since the expected value in the formula above is the expected value after fractional differentiation. Therefore using *fast_fracdiff()* instead of *diffseries()* seems to be advantageous beyond the speed gains.

But there is another point to make here. It is obvious that *diffseries()* is written with the differentiation in mind. Using it to integrate – even though it should be theoretically possible – goes beyond the use cases that the developers had in mind. Errors caused by using the function this way are hard to spot, since the generated series is still a random walk and the slight drift is hard to spot unless it is compared with the *cumsum()* function. They could go unnoticed for years. This highlights very nicely the extra degree of care that is necessary for statistical programming.

## Refrences

Hosking, J. R. (1981). Fractional differencing. Biometrika, 165-176.

Jensen, A. N., & Nielsen, M. Ø. (2014). A fast fractional difference algorithm. Journal of Time Series Analysis, 35(5), 428-436.