Tuesday, August 19, 2014

Time Series Analysis in R

##################################################################
## Second Order Summaries
##################################################################
The theory for time series is based on the assumption of second-order stationarity after removing any trends (which will include seasonal trends). A time series is said to be stationary if the mean, variance, and lag covariance are constant over time. There are three methods of dealing with non-stationary time series listed above: differencing, using log transformations, using sinusoidal components.

Checking the acf functions to explore the auto-covariance and auto-correlation, make sure a series with a non-unit frequency.

##################################################################
## Checking acf and pacf
##################################################################
par(mfrow = c(1, 2), pty="s")
acf(Sample$embroidered_top)
pacf(Sample$embroidered_top)


The plots of the above series show the pattern typical of seasonal series, and the auto-correlation do not damp down for large lags.

How to identify an AR(p) process
- The ACF tails off exponentially
- The PACF drop to 0 after lag p, where p is the order of the AR process
- If the ACF tails off very slowly, the time series may not be stationary.

How to identify an MA(q) process
- The ACF drops to 0 after lag q, where q is the order of the MA process
- The PACF tail of exponentially

##################################################################
## Spetral analysis
##################################################################

The basic tool in estimating the spectral density is the Periodogram. Checking Periodogram, we effectively compute the squared correlation between the series and sine/cosine waves of frequency. And also, using repeated smoothing with modified Daniell smoothers, which are moving averages giving half weight to the end values of the span. The bandwidth is a measure of the size of smoothing window. Trial-and-error is needed to choose the spans. The spans should be odd integers, and it helps to produce a smooth plot if they are different and at least two are used.

spectrum(Sample$embroidered_top)

## Cumulative periodogram
par(mfrow = c(1, 1), pty="s")
cpgram(logtraindt)


##################################################################
## Fit ARIMA models
##################################################################
ARIMA models are best for short-term forecasting because long-term forecasting is no better than using the mean of the time series.

## split train and test data set
train <- Sample[1:(nrow(Sample)-4)]
test <- Sample[(nrow(Sample)-3):nrow(Sample)]

## transformation
logtraindt <- log(1+train$embroidered_top)/log(100)
logtestdt <- log(1+test$embroidered_top)/log(100)

## AR model investigation
## All likelihood assume a Gaussian distribution. The model with the smallest AIC is chosen.
ar1 <- ar(logtraindt, aic=T, order.max = 10); summary(ar1)
cpgram(ar1$resid, main="AR Max Model")

## case 1
fit <- arima(ts, order=c(1,0,0), list(order=c(2,1,0), period=5))
fore <- predict(fit, n.ahead=15)
# error bounds at 95% confidence level
U <- fore$pred + 2*fore$se
L <- fore$pred - 2*fore$se
ts.plot(ts, fore$pred, U, L, col=c(1,2,4,4), lty = c(1,1,2,2))
legend("topleft", c("Actual", "Forecast", "Error Bounds (95% Confidence)"), col=c(1,2,4), lty=c(1,1,2))

## case 2
## grid search for p,d,q
#for (i in 1:4){
#for (j in 1:4){
for (k in 1:4){
  arima1 <- arima(logtraindt, order=c(3,1,k))
  print(arima1$aic)
}
arima1 <- arima(logtraindt, order=c(4,4,3))
tsdiag(arima1)
arima1.fore <- predict(arima1, 8)
ts.plot(c(train$embroidered_top,arima1.fore$pred))
ts.plot(arima1.fore$pred,
        arima1.fore$pred - 2*arima1.fore$se,
        arima1.fore$pred + 2*arima1.fore$se)


##################################################################
## Plot ARIMA model Prediction
##################################################################
train$Date = as.Date(train$Date, "%Y-%m-%d")
pred.dt <- train$Date[length(train$Date)] + seq(1:8)*7
plot_dt=c(train$Date, pred.dt)
plot_pd=c(train$embroidered_top, test$embroidered_top, rep(NA,4))
plot_pd2=c(rep(NA, length(train$embroidered_top)),exp(arima1.fore$pred*log(100))-1)
plot_data=data.table(plot_dt, plot_pd, plot_pd2)
n=nrow(plot_data)
plot_data=plot_data[(n-52):n]

ggplot(plot_data, aes(x=plot_dt, y=plot_pd)) +
  geom_point(aes(x=plot_dt, y=plot_pd), col = "blue", size = 1) +
  geom_line(aes(x=plot_dt, y=plot_pd), size=1, linetype=1, color="blue") +
  geom_line(aes(x=plot_dt, y=plot_pd2), size=1, linetype=4, color="darkgreen") +
  geom_vline(xintercept = as.integer(as.Date("2016-10-23", "%Y-%m-%d")), linetype=4) +
  scale_x_date(labels=date_format("%b %y"), breaks=date_breaks("12 week")) +
  stat_smooth(color="red") +
  xlab("Search Index") +
  ylab("Date") +
  ggtitle("Forcast of 8 Weeks from ARIMA for Embroidered Top") +
  theme(plot.title = element_text(face = "bold", size = 20)) +
  theme(axis.text.x = element_text(face = "bold", size = 14)) +
  theme(axis.text.y = element_text(face = "bold", size = 14)) +
  theme(axis.title.x = element_text(face = "bold", size = 16)) +
  theme(strip.text.x = element_text(face = "bold", size = 16)) +
  theme(axis.title.y = element_text(face = "bold", size = 16, angle=90)) +
  guides(fill=FALSE)


##################################################################
### Exponential smoothing
#### Fit in to exponential model
##################################################################
ts.mail <- ts(data$count)
ts.mail.forecast <- HoltWinters(ts.mail , beta=FALSE, gamma=FALSE)
ts.mail.forecast
plot(ts.mail.forecast)
ts.mail.forecast2 <- forecast.HoltWinters(ts.mail.forecast, h=100)
plot(ts.mail.forecast2)

##################################################################
## Double Exponentila with Linear Trends
##################################################################
  mHolt <- holt(tslogcnt)
  summary(mHolt)
  plot(mHolt)
  attributes(mHolt)
  mHolt$fitted

  fit1 <- ses(tslogcnt, h=5)
  fit2 <- holt(tslogcnt, h=5)
  fit3 <- holt(tslogcnt, exponential=TRUE, h=5)
  fit4 <- holt(tslogcnt, damped=TRUE, h=5)
  fit5 <- holt(tslogcnt, exponential=TRUE, damped=TRUE, h=5)

  plot(fit3, plot.conf=FALSE)
  lines(fit1$mean,col=2)
  lines(fit2$mean,col=3)
  lines(fit4$mean,col=5)
  lines(fit5$mean,col=6)
  legend("topleft", lty=1, pch=1, col=1:6,
         c("Data", "SES", "Holt's", "Exponential", "Additive Damped", "Multiplicative Damped"))
}


##################################################################
## Stationary Test
##################################################################
## stl function to extract components
stl_dau <- stl(ts_dau, s.window="periodic", robust=T)
names(stl_dau)
ts.plot(stl_dau$time.series)

## stationary test
## In ADF test, P-value less that 0.05 to reject the null hypothesis to conclude stationary time series
adf.test(ts_dau)

## de-trend, de-seasonaly
stl_dau <- stl(ts_dau, s.window="periodic", robust=T)
head(stl_dau$time.series)
ts_dau_dt <- ts_dau - stl_dau$time.series[,2]
adf.test(ts_dau_dt)
ts_dau_sa <- ts_dau - stl_dau$time.series[,1]- stl_dau$time.series[,2]
adf.test(ts_dau_sa)

## lags, leads, ACF, PACF and CCF
ts_dau_lag1 <- dplyr::lag(ts_dau_sa, 1)
ts_dau_lead1 <- dplyr::lead(ts_dau_sa, 1)
acf_result <- Acf(ts_dau_sa)
pacf_result <- Pacf(ts_dau_sa)
## ccf(ts1, ts2)

##################################################################
## State Space Model
##################################################################
## tbats
ts_dau <- ts(log(data[,amount]), frequency=365.25, start=c(2015,1,17))
daily_data <- tbats(ts_dau) 
summary(daily_data)
t.forecast <- forecast(daily_data, h=14)
plot(t.forecast)
t.forecast
summary(t.forecast)
Box.test(t.forecast$residuals)

No comments:

Post a Comment