Tuesday, August 19, 2014

Time Series Analysis in R

##################################################################
## Second Order Summaries
##################################################################
The theory for time series is based on the assumption of second-order stationarity after removing any trends (which will include seasonal trends). A time series is said to be stationary if the mean, variance, and lag covariance are constant over time. There are three methods of dealing with non-stationary time series listed above: differencing, using log transformations, using sinusoidal components.

Checking the acf functions to explore the auto-covariance and auto-correlation, make sure a series with a non-unit frequency.

##################################################################
## Checking acf and pacf
##################################################################
par(mfrow = c(1, 2), pty="s")
acf(Sample$embroidered_top)
pacf(Sample$embroidered_top)


The plots of the above series show the pattern typical of seasonal series, and the auto-correlation do not damp down for large lags.

How to identify an AR(p) process
- The ACF tails off exponentially
- The PACF drop to 0 after lag p, where p is the order of the AR process
- If the ACF tails off very slowly, the time series may not be stationary.

How to identify an MA(q) process
- The ACF drops to 0 after lag q, where q is the order of the MA process
- The PACF tail of exponentially

##################################################################
## Spetral analysis
##################################################################

The basic tool in estimating the spectral density is the Periodogram. Checking Periodogram, we effectively compute the squared correlation between the series and sine/cosine waves of frequency. And also, using repeated smoothing with modified Daniell smoothers, which are moving averages giving half weight to the end values of the span. The bandwidth is a measure of the size of smoothing window. Trial-and-error is needed to choose the spans. The spans should be odd integers, and it helps to produce a smooth plot if they are different and at least two are used.

spectrum(Sample$embroidered_top)

## Cumulative periodogram
par(mfrow = c(1, 1), pty="s")
cpgram(logtraindt)


##################################################################
## Fit ARIMA models
##################################################################
ARIMA models are best for short-term forecasting because long-term forecasting is no better than using the mean of the time series.

## split train and test data set
train <- Sample[1:(nrow(Sample)-4)]
test <- Sample[(nrow(Sample)-3):nrow(Sample)]

## transformation
logtraindt <- log(1+train$embroidered_top)/log(100)
logtestdt <- log(1+test$embroidered_top)/log(100)

## AR model investigation
## All likelihood assume a Gaussian distribution. The model with the smallest AIC is chosen.
ar1 <- ar(logtraindt, aic=T, order.max = 10); summary(ar1)
cpgram(ar1$resid, main="AR Max Model")

## case 1
fit <- arima(ts, order=c(1,0,0), list(order=c(2,1,0), period=5))
fore <- predict(fit, n.ahead=15)
# error bounds at 95% confidence level
U <- fore$pred + 2*fore$se
L <- fore$pred - 2*fore$se
ts.plot(ts, fore$pred, U, L, col=c(1,2,4,4), lty = c(1,1,2,2))
legend("topleft", c("Actual", "Forecast", "Error Bounds (95% Confidence)"), col=c(1,2,4), lty=c(1,1,2))

## case 2
## grid search for p,d,q
#for (i in 1:4){
#for (j in 1:4){
for (k in 1:4){
  arima1 <- arima(logtraindt, order=c(3,1,k))
  print(arima1$aic)
}
arima1 <- arima(logtraindt, order=c(4,4,3))
tsdiag(arima1)
arima1.fore <- predict(arima1, 8)
ts.plot(c(train$embroidered_top,arima1.fore$pred))
ts.plot(arima1.fore$pred,
        arima1.fore$pred - 2*arima1.fore$se,
        arima1.fore$pred + 2*arima1.fore$se)


##################################################################
## Plot ARIMA model Prediction
##################################################################
train$Date = as.Date(train$Date, "%Y-%m-%d")
pred.dt <- train$Date[length(train$Date)] + seq(1:8)*7
plot_dt=c(train$Date, pred.dt)
plot_pd=c(train$embroidered_top, test$embroidered_top, rep(NA,4))
plot_pd2=c(rep(NA, length(train$embroidered_top)),exp(arima1.fore$pred*log(100))-1)
plot_data=data.table(plot_dt, plot_pd, plot_pd2)
n=nrow(plot_data)
plot_data=plot_data[(n-52):n]

ggplot(plot_data, aes(x=plot_dt, y=plot_pd)) +
  geom_point(aes(x=plot_dt, y=plot_pd), col = "blue", size = 1) +
  geom_line(aes(x=plot_dt, y=plot_pd), size=1, linetype=1, color="blue") +
  geom_line(aes(x=plot_dt, y=plot_pd2), size=1, linetype=4, color="darkgreen") +
  geom_vline(xintercept = as.integer(as.Date("2016-10-23", "%Y-%m-%d")), linetype=4) +
  scale_x_date(labels=date_format("%b %y"), breaks=date_breaks("12 week")) +
  stat_smooth(color="red") +
  xlab("Search Index") +
  ylab("Date") +
  ggtitle("Forcast of 8 Weeks from ARIMA for Embroidered Top") +
  theme(plot.title = element_text(face = "bold", size = 20)) +
  theme(axis.text.x = element_text(face = "bold", size = 14)) +
  theme(axis.text.y = element_text(face = "bold", size = 14)) +
  theme(axis.title.x = element_text(face = "bold", size = 16)) +
  theme(strip.text.x = element_text(face = "bold", size = 16)) +
  theme(axis.title.y = element_text(face = "bold", size = 16, angle=90)) +
  guides(fill=FALSE)


##################################################################
### Exponential smoothing
#### Fit in to exponential model
##################################################################
ts.mail <- ts(data$count)
ts.mail.forecast <- HoltWinters(ts.mail , beta=FALSE, gamma=FALSE)
ts.mail.forecast
plot(ts.mail.forecast)
ts.mail.forecast2 <- forecast.HoltWinters(ts.mail.forecast, h=100)
plot(ts.mail.forecast2)

##################################################################
## Double Exponentila with Linear Trends
##################################################################
  mHolt <- holt(tslogcnt)
  summary(mHolt)
  plot(mHolt)
  attributes(mHolt)
  mHolt$fitted

  fit1 <- ses(tslogcnt, h=5)
  fit2 <- holt(tslogcnt, h=5)
  fit3 <- holt(tslogcnt, exponential=TRUE, h=5)
  fit4 <- holt(tslogcnt, damped=TRUE, h=5)
  fit5 <- holt(tslogcnt, exponential=TRUE, damped=TRUE, h=5)

  plot(fit3, plot.conf=FALSE)
  lines(fit1$mean,col=2)
  lines(fit2$mean,col=3)
  lines(fit4$mean,col=5)
  lines(fit5$mean,col=6)
  legend("topleft", lty=1, pch=1, col=1:6,
         c("Data", "SES", "Holt's", "Exponential", "Additive Damped", "Multiplicative Damped"))
}


##################################################################
## Stationary Test
##################################################################
## stl function to extract components
stl_dau <- stl(ts_dau, s.window="periodic", robust=T)
names(stl_dau)
ts.plot(stl_dau$time.series)

## stationary test
## In ADF test, P-value less that 0.05 to reject the null hypothesis to conclude stationary time series
adf.test(ts_dau)

## de-trend, de-seasonaly
stl_dau <- stl(ts_dau, s.window="periodic", robust=T)
head(stl_dau$time.series)
ts_dau_dt <- ts_dau - stl_dau$time.series[,2]
adf.test(ts_dau_dt)
ts_dau_sa <- ts_dau - stl_dau$time.series[,1]- stl_dau$time.series[,2]
adf.test(ts_dau_sa)

## lags, leads, ACF, PACF and CCF
ts_dau_lag1 <- dplyr::lag(ts_dau_sa, 1)
ts_dau_lead1 <- dplyr::lead(ts_dau_sa, 1)
acf_result <- Acf(ts_dau_sa)
pacf_result <- Pacf(ts_dau_sa)
## ccf(ts1, ts2)

##################################################################
## State Space Model
##################################################################
## tbats
ts_dau <- ts(log(data[,amount]), frequency=365.25, start=c(2015,1,17))
daily_data <- tbats(ts_dau) 
summary(daily_data)
t.forecast <- forecast(daily_data, h=14)
plot(t.forecast)
t.forecast
summary(t.forecast)
Box.test(t.forecast$residuals)

Thursday, August 7, 2014

Time Series Analysis - Multivariate Autoregressive State-Space Models

memory.limit()
memory.size(max = TRUE)
rm(list=ls(all=TRUE))
sessionInfo()

require(data.table)
require(stringr)
require(lubridate)
require(scales)

require(ggplot2)
require(gridExtra)
require(ggthemes)

require(MARSS)

## Read in individual data;
setwd("~/Desktop/R/KLAB/11_18_2016/individual")

## Copy results from one directory to current directory
dir.create(file.path("/Users/tkmaemd/Desktop/R/KLAB/11_18_2016/individual", "results"))
file.copy(from="/Users/tkmaemd/Desktop/R/KLAB/11_18_2016/batch/results",
          to="/Users/tkmaemd/Desktop/R/KLAB/11_18_2016/individual",
          recursive = TRUE, copy.mode = TRUE)

## Read-in raw data
files <- list.files(pattern = ".csv")
files
temp <- lapply(files, fread, sep=",")
data <- rbindlist( temp )
dim(data)


tmp=data[theme=='Dresses' & category=='Details']
Sample = dcast.data.table(tmp, Date + theme + category ~ keyword,value.var="ankle jeans")
names(Sample)=str_replace_all(names(Sample), " ", "_")
Sample <- Sample[, theme:=NULL]
Sample <- Sample[, category:=NULL]
Sample <- Sample[-1]

n=nrow(Sample)
dat = Sample[(n-51):n]
dat = dat[, Date:=NULL]

## model 1: one hidden state and i.i.d obs - R is diagonal and equal
## Fit the single state model, where the time series are assumed to be from the same population.
model1=list()
model1$R="diagonal and equal"
model1$Z=matrix(1,8,1) #matrix of 2 rows and 1 column
model1$A="scaling" #the default

kemfit = MARSS(bb, model=list(Z=matrix(1,2,1),R ="diagonal and equal")) ##2 is used because we have 2 signals
#kemfit = MARSS(bb, model.list)
print(kemfit$model)
summary(kemfit$model)
kem1 = MARSS(t(dat), model=model1)
kem1$AIC
kem1$AICc

# model 2: Fit the single state model, where the time series are assumed to be from the same population with different variance
model2=list()
model2$R="diagonal and unequal"
model2$Z=matrix(1,8,1) #matrix of 2 rows and 1 column
model2$A="scaling" #the default

kem2 = MARSS(t(dat), model=model2)
kem2$AIC
kem2$AICc

output <- Sample[(n-51):n]
output$Date <- as.Date(output$Date,"%Y-%m-%d")
b <- t(kem2$states)####b is the combined index
output$COMBINED <- round(100*b/max(b))
colors=rainbow(8)
ggplot(output, aes(x=Date, y=COMBINED, group = 1)) +
  geom_line(aes(x=Date, y=COMBINED), col = 'red', size = 2, linetype=1) +
  geom_line(aes(x=Date, y=bandana_print_dress), col = colors[1], size = 1, linetype=4) +
  geom_line(aes(x=Date, y=bohemian_dress), col =  colors[2], size = 1, linetype=4) +
  geom_line(aes(x=Date, y=embroidered_dress), col =  colors[3], size = 1, linetype=4) +
  geom_line(aes(x=Date, y=floral_dress), col =  colors[4], size = 1, linetype=4) +
  geom_line(aes(x=Date, y=fringe_dress), col =  colors[5] , size = 1, linetype=4) +
  geom_line(aes(x=Date, y=paisley_print_dress), col = colors[6], size = 1, linetype=4) +
  geom_line(aes(x=Date, y=ruffle_dress), col =  colors[7], size = 1, linetype=4) +
  geom_line(aes(x=Date, y=tie_dye_dress), size=1, linetype=4, color= colors[8]) +
  scale_x_date(labels=date_format("%b %y"), breaks=date_breaks("4 week")) +
  xlab("Search Index") +
  ylab("Date") +
  ggtitle("Combined 8 Index of 52 Weeks from STATE SPACE Model for Dress Detail Category") +
  theme(plot.title = element_text(face = "bold", size = 20)) +
  theme(axis.text.x = element_text(face = "bold", size = 14)) +
  theme(axis.text.y = element_text(face = "bold", size = 14)) +
  theme(axis.title.x = element_text(face = "bold", size = 16)) +
  theme(strip.text.x = element_text(face = "bold", size = 16)) +
  theme(axis.title.y = element_text(face = "bold", size = 16, angle=90)) +
  guides(fill=FALSE)


# model 3:  Fit a model with different population process with the same process parameters
model3=list()
model3$Q="diagonal and equal"
model3$R="diagonal and equal"
model3$U="equal"
model3$Z="identity"
model3$A="zero"
kem3 = MARSS(t(dat), model=model3)
kem3$AIC
kem3$AICc