Friday, September 26, 2014

MCMC in R

Random Variable Generation

Monte Carlo methods rely on the possibility of producing a supposedly endless flow of random variables, and for well-known or new distributions. Such a simulation is, in turn, based on the production of uniform random variables on the interval (0, 1)

# check the generator
Nsim = 10^4
x=runif(Nsim)
x1=x[-Nsim]
x2=x[-1]
par(mfrow=c(1,3))
hist(x)
plot(x1,x2)
acf(x)

# Inverse Transform
Probability Integral Transform is related to the result that data values that are modelled as being random variables from any given continuous distribution can be converted to random variables having a standard uniform distribution.

# generate exponential
Nsim = 10^4
U=runif(Nsim)
X=-log(U)
Y=rexp(Nsim)
par(mfrow=c(1,2))
hist(X,freq=F)
hist(Y,freq=F)

# generate Chisq(6)
U=runif(3*10^4)
U=matrix(data=U, nrow=3)
X=-log(U)
X=2*apply(X,2,sum)
Y=rchisq(Nsim,6)
par(mfrow=c(1,2))
hist(X,freq=F)
hist(Y,freq=F)

# generate normal by Box-Muller
U=runif(2*10^4)
U=matrix(data=U, ncol=2)
X1=sqrt(-2*log(U[,1]))*cos(2*pi*U[,2])
X2=sqrt(-2*log(U[,1]))*sin(2*pi*U[,2])
Y=rnorm(10^4)
par(mfrow=c(1,3))
hist(X1,freq=F)
hist(X2,freq=F)
hist(Y)

# Poisson for large lambda
Nsim=10^4
lambda=100
spread=3*sqrt(lambda)
t=round(seq(max(0,lambda-spread),lambda+spread,1))
prob=ppois(t,lambda)
X=rep(0,Nsim)
for (i in 1:Nsim){
u=runif(1)
X[i]=t[1]+sum(prob<u)-1
}
hist(X,freq=F)

# Accept-Reject Algorithm to generate beta with Uniform
1 We generate a candidate random variable
2 Only accept it subject to passing a test

Nsim=2500
a=2.7
b=6.3
M=2.67
u=runif(Nsim, max=M)
y=runif(Nsim)
x=y[u<dbeta(y,a,b)]
hist(X,freq=F,breaks=100)

U = runif(100000,0,1)
accept = c()
for(i in 1:length(U)){
U1 = runif(1, 0, 1)
if(dunif(U[i], 0, 1)*3*U1 <= dbeta(U[i], 6, 3)) {
accept[i] = 'Yes'
}
else if(dunif(U[i],0,1)*3*U1 > dbeta(U[i], 6, 3)) {
accept[i] = 'No'
}
}
T = data.frame(U, accept = factor(accept, levels= c('Yes','No')))
hist(T[,1][T$accept=='Yes'], breaks = seq(0,1,0.01), freq = FALSE, main = 'Histogram of X', xlab = 'X')

 Monte Carlo Integration

The validity of Monte Carlo approximations relies on the Law of Large Numbers
The versatility of the representation of an integral as an expectation

We apply probabilistic results of Law of Large Numbers and Central Limit Theorem,


Friday, September 19, 2014

View the source code for a function in R

The S3 method dispatch system
> methods(t)
[1] t.data.frame t.default    t.ts*    

   Non-visible functions are asterisked
> methods(class="ts")
 [1] aggregate.ts     as.data.frame.ts cbind.ts*        cycle.ts*    
 [5] diffinv.ts*      diff.ts          kernapply.ts*    lines.ts      
 [9] monthplot.ts*    na.omit.ts*      Ops.ts*          plot.ts      
[13] print.ts         time.ts*         [<-.ts*          [.ts*        
[17] t.ts*            window<-.ts*     window.ts*    

   Non-visible functions are asterisked
"Non-visible functions are asterisked" means the function is not exported from its package's namespace. You can still view its source code via the ::: function, or by using getAnywhere(). getAnywhere() is useful because you don't have to know which package the function came from.
> getAnywhere(t.ts)
A single object matching ‘t.ts’ was found
It was found in the following places
  registered S3 method for t from namespace stats
  namespace:stats
with value

function (x)
{
    cl <- oldClass(x)
    other <- !(cl %in% c("ts", "mts"))
    class(x) <- if (any(other))
        cl[other]
    attr(x, "tsp") <- NULL
    t(x)
}
<bytecode: 0x294e410>
<environment: namespace:stats>

The S4 method dispatch system

The S4 system is a newer method dispatch system and is an alternative to the S3 system. Here is an example of an S4 function:
> library(Matrix)
Loading required package: lattice
> chol2inv
standardGeneric for "chol2inv" defined from package "base"

function (x, ...)
standardGeneric("chol2inv")
<bytecode: 0x000000000eafd790>
<environment: 0x000000000eb06f10>
Methods may be defined for arguments: x
Use  showMethods("chol2inv")  for currently available ones.

The output already offers a lot of information. standardGeneric is an indicator of an S4 function. The method to see defined S4 methods is offered helpfully:

> showMethods(chol2inv)
Function: chol2inv (package base)
x="ANY"
x="CHMfactor"
x="denseMatrix"
x="diagonalMatrix"
x="dtrMatrix"
x="sparseMatrix"

getMethod can be used to see the source code of one of the methods:

> getMethod("chol2inv", "diagonalMatrix")
Method Definition:

function (x, ...)
{
    chk.s(...)
    tcrossprod(solve(x))
}
<bytecode: 0x000000000ea2cc70>
<environment: namespace:Matrix>

Signatures:
        x            
target  "diagonalMatrix"
defined "diagonalMatrix"

There are also methods with more complex signatures for each method, for example

require(raster)
showMethods(extract)
Function: extract (package raster)
x="Raster", y="data.frame"
x="Raster", y="Extent"
x="Raster", y="matrix"
x="Raster", y="SpatialLines"
x="Raster", y="SpatialPoints"
x="Raster", y="SpatialPolygons"
x="Raster", y="vector"

To see the source code for one of these methods the entire signature must be supplied, e.g.
getMethod("extract" , signature = c( x = "Raster" , y = "SpatialPolygons") )

Functions that call compiled code

Note that "compiled" does not refer to byte-compiled R code as created by the compiler package. The <bytecode: 0x294e410> line in the above output indicates that the function is byte-compiled, and you can still view the source from the R command line.

Functions that call .C, .Call, .Fortran, .External, .Internal, or .Primitive are calling entry points in compiled code, so you will have to look at sources of the compiled code if you want to fully understand the function. Packages may use .C, .Call, .Fortran, and .External; but not .Internal or .Primitive, because these are used to call functions built into the R interpreter.

Calls to some of the above functions may use an object instead of a character string to reference the compiled function. In those cases, the object is of class "NativeSymbolInfo", "RegisteredNativeSymbol", or "NativeSymbol"; and printing the object yields useful information. For example, optim calls .External2(C_optimhess, res$par, fn1, gr1, con) (note that's C_optimhess, not "C_optimhess"). optim is in the stats package, so you can type stats:::C_optimhess to see information about the compiled function being called.

Compiled code in a package

If you want to view compiled code in a package, you will need to download/unpack the package source. The installed binaries are not sufficient. A package's source code is available from the same CRAN (or CRAN compatible) repository that the package was originally installed from. The download.packages() function can get the package source for you.

download.packages(pkgs = "Matrix",
                  destdir = ".",
                  type = "source")

This will download the source version of the Matrix package and save the corresponding .tar.gz file in the current directory. Source code for compiled functions can be found in the src directory of the uncompressed and untared file. The uncompressing and untaring step can be done outside of R, or from within R using the untar() function. It is possible to combine the download and expansion step into a single call (note that only one package at a time can be downloaded and unpacked in this way):

untar(download.packages(pkgs = "Matrix",
                        destdir = ".",
                        type = "source")[,2])

Alternatively, if the package development is hosted publicly (e.g. via GitHub, R-Forge, or RForge.net), you can probably browse the source code online.

Compiled code in a base package

Certain packages are considered "base" packages. These packages ship with R and their version is locked to the version of R. Examples include base, compiler, stats, and utils. As such, they are not available as separate downloadable packages on CRAN as described above. Rather, they are part of the R source tree in individual package directories under /src/library/. How to access the R source is described in the next section.

Compiled code built into the R interpreter

If you want to view the code built-in to the R interpreter, you will need to download/unpack the R sources; or you can view the sources online via the R Subversion repository or Winston Chang's github mirror.

Uwe Ligges's R news article (PDF) (p. 43) is a good general reference of how to view the source code for .Internal and .Primitive functions. The basic steps are to first look for the function name in src/main/names.c and then search for the "C-entry" name in the files in src/main/*.

Monday, September 15, 2014

Cluster Analysis in R

memory.limit()
memory.size(max = TRUE)
rm(list=ls(all=T))

library(MARSS)
library(ggplot2)
library(reshape2)
library(data.table)

#############################
## Hierarchical Clustering
#############################
## could generate more features;
### growth <- function(input)
### momentum <- function(input)
data = read.csv("ranking_growth_momentum.csv", header = T)
row.names(data) <- data$style

## transform data
data$logcnt = log(data$cnt)/log(max(data$cnt))

## compute distance
d <- dist(data[,c(5,6,7)])
m <- nrow(data)
as.matrix(d)[1:5, 1:5]

## cluster methods
hc_cluster <- hclust(d, method="complete")
plot(hc_cluster)
rect.hclust(hc_cluster, k=5, border="red")
data$segments <- cutree(hc_cluster, k=5)

## examine results
ddply(data, c('segments'), summarise,
      n=length(growth),
      growthmean=mean(growth),
      momentummean=mean(momentum))

rect <- data.table(ymin=round(min(data$growth),2),
                   ymax=round(max(data$growth),2),
                   xmin=round(min(data$momentum),2),
                   xmax=round(max(data$momentum),2))

pdf("~/Desktop/plot3.pdf", width=15, height=8.5)
ggplot(data, aes(y=growth, x=momentum, size=scalecnt, label=style), guide=FALSE) +
  geom_point(position = "jitter", colour= "red", fill= "red", shape=21) +
  geom_text_repel(aes(y = growth, x = momentum, color=factor(segments)), size=5, angle = -15) +
  theme_bw() +
  ggtitle("Cluster Analysis") +
  xlab("Growth") + ylab("Momemtum") +
  scale_y_continuous(name="Growth", labels=scales::percent, limits=c(-max(abs(rect$ymin),abs(rect$ymax))*1.4, max(abs(rect$ymin),abs(rect$ymax))*1.4)) +
  scale_x_continuous(name="Momentum", labels=scales::percent, limits=c(-max(abs(rect$xmin),abs(rect$xmax))*1.4, max(abs(rect$xmin),abs(rect$xmax))*1.4)) +
  theme(plot.title = element_text(face = "bold", size = 16)) +
  theme(axis.text.x = element_text(face = "bold", size = 10)) +
  theme(axis.text.y = element_text(face = "bold", size = 10)) +
  theme(axis.title.x = element_text(face = "bold", size = 10)) +
  theme(axis.title.y = element_text(face = "bold", size = 10, angle = 90)) +
  theme(legend.position = 'bottom') +
  theme(legend.text = element_text(family = "sans", face = "bold", size = 10)) +
  theme(legend.title = element_text(family = "sans", face = "bold", size = 10)) +
  guides(size=guide_legend("Search Volume Index")) +
  geom_vline(xintercept=c(0), linetype="dotted") +
  geom_hline(yintercept=c(0), linetype="dotted")
dev.off()


#############################
## k_mean Clustering
#############################
data = read.csv("ranking_growth_momentum.csv", header = T)
row.names(data) <- data$style

## transform data
data$logcnt = log(data$cnt)/log(max(data$cnt))
kmean_cluster <- kmeans(data[,c(5,6,7)], centers=5)
summary(kmean_cluster)
data$segments = kmean_cluster$cluster
boxplot(data$growth ~ data$segments, ylab="Growth", xlab="Segments")
clusplot(data, data$segments, color=T, shade = T, labels=4, lines=0, main="K-Means")

##Example2
data <-fread("shopper.txt")
setnames(data, c("id","lable","ageBucket","gender","country","dvs","pvs","adviews","adclicks",
"allnumevents","wk1numevents","ratio_wk1numevents_all","alldv","wk1dv","ratio_wk1dv_all",
"itemcnt","itemtot","item_cnt_30day","item_tot_30day","item_cnt_30dayPlus","item_tot_30dayPlus"));

> dim(data)
[1] 1083701      21

data$rowid <- 1:nrow(data)

```
## Model One - Item Count
## Drop Outliers

```
data2 <- data[, c(9,20,22), with = F]

> summary(data2)

> quantile(data2$adclicks, c(.99,.999,.9999))
   99%  99.9% 99.99%
  6.00  45.00 159.89
> sum(data$adclicks>160)
[1] 109

data2 <- data2[data2$adclicks<160]
> summary(data2)

> quantile(data2$item_cnt_30dayPlus, c(.99,.999,.9999))
   99%  99.9% 99.99%
    46    129    335
> sum(data2$item_cnt_30dayPlus>335)
[1] 108

data2 <- data2[data2$item_cnt_30dayPlus<335]

```

## Log - Transform data

Apply log(x+1)/log(max(x)) transformation to generate new data
```
func1 <- function(x){
return(log(x+1)/log(max(x)))
}

data3 <- data2[, c(1,2,3), with = F]
data3[, 1:2 := lapply(.SD, func1), .SDcols = 1:2]


```

## Scale Transformed data

```
func2 <- function(x){
return( (x-mean(x))/sqrt(sum((x-mean(x))^2)/(length(x)-1)))
}

 data4 <- data3[, c(1,2,3), with = F]
 data4[, 1:2 := lapply(.SD, func2), .SDcols = 1:2]

```

## Partition by k-means using scaled transformed data
```
data5 <- data4[,c(1,2), with = F]
results <- data5
for (i in c(2:2)){
  print('----------------------------------------------------------');
  print(i);
  set.seed(33);
  model <- kmeans(data5, i);
  print("model size");
  print(cbind(1:i,model$size));
  print("model centers")
  print(model$centers);
  results<- cbind(results, model$cluster);
}

[1] "----------------------------------------------------------"
[1] 2
[1] "model size"
     [,1]   [,2]
[1,]    1 667941
[2,]    2 415540
[1] "model centers"
     adclicks item_cnt_30dayPlus
1 -0.09521215         -0.6450849
2  0.15304447          1.0369126


data2$cluster = results$V2
data2[,list(users = .N,
                    group_adclicks=sum(adclicks)/.N,
                    group_item_cnt_30dayPlus=sum(item_cnt_30dayPlus)/.N), by='cluster'];
```


## Model Two - Item Total
## Drop Outliers

```
data_2 <- data[, c(9,21,22), with = F]
> head(data_2)
> summary(data_2)
    adclicks         item_tot_30dayPlus
 Min.   :   0.0000   Min.   :     0.00
 1st Qu.:   0.0000   1st Qu.:    23.94
 Median :   0.0000   Median :    65.70
 Mean   :   0.5131   Mean   :   172.36
 3rd Qu.:   0.0000   3rd Qu.:   176.10
 Max.   :1638.0000   Max.   :225247.28

> quantile(data_2$adclicks, c(.99,.999,.9999))
   99%  99.9% 99.99%
  6.00  45.00 159.89
> sum(data_2$adclicks>160)

data_2 <- data_2[data_2$adclicks<160]

> summary(data_2)

> quantile(data_2$item_tot_30dayPlus, c(.99,.999,.9999))
      99%     99.9%    99.99%
 1510.772  4389.105 16171.602

> sum(data_2$item_tot_30dayPlus>16171)
[1] 109

data_2 <- data_2[data_2$item_tot_30dayPlus<16171]

> summary(data_2)
```

## Log - Transform data

Apply log(x+1)/log(max(x)) transformation to generate new data
```

data_3 <- data_2[, c(1,2,3), with = F]
data_3[, 1:2 := lapply(.SD, func1), .SDcols = 1:2]

```

## Scale Transformed data

```
 data_4 <- data_3[, c(1,2,3), with = F]
 data_4[, 1:2 := lapply(.SD, func2), .SDcols = 1:2]

```

## Partition by k-means using scaled transformed data
```

data_5 <- data_4[,c(1,2), with = F]
results_ <- data_5

for (i in c(2:2)){
  print('----------------------------------------------------------');
  print(i);
  set.seed(33);
  model <- kmeans(data_5, i);
  print("model size");
  print(cbind(1:i,model$size));
  print("model centers")
  print(model$centers);
  results_<- cbind(results_, model$cluster);
}

[1] "----------------------------------------------------------"
[1] 2
[1] "model size"
     [,1]   [,2]
[1,]    1 545531
[2,]    2 537952
[1] "model centers"
    adclicks item_tot_30dayPlus
1  0.1732808          0.7845828
2 -0.1757221         -0.7956365


data_2$cluster = 3-results_$V2
data_2[,list(users = .N,
                    group_adclicks=sum(adclicks)/.N,
                    group_item_tot_30dayPlus=sum(item_tot_30dayPlus)/.N), by='cluster'];
                 
```

## Merge Two Results

```
setkey(data2, rowid);
setkey(data_2, rowid);

comb = merge(data2, data_2, by ="rowid", all = TRUE)
comb1 = comb[,c(4,7),with=F]
> m <- table(comb1,useNA = "always"); m
         cluster.y
cluster.x      1      2   <NA>
     1    464209 203732      0
     2     73742 341718     80
     <NA>      1     81      0
> round(prop.table(m),5)
         cluster.y
cluster.x       1       2    <NA>
     1    0.42841 0.18802 0.00000
     2    0.06806 0.31537 0.00007
     <NA> 0.00000 0.00007 0.00000
```

Number of cluster
# Important packages (libraries)  

library(cluster) # Data set votes.repub, and cluster technique 
library(mclust) # Clustering by mixtures of Gaussians 
library(MASS) # kMeans clustering and others 
library(kernlab) # Kernel kMeans, spectral clustering 

# Function for finding within cluster sum of squares when K=1 

WCSS1 <- function(x) 
return(ifelse(ncol(x)>1,sum(rowSums((x - matrix(rep(apply(x,2,mean), 
times=nrow(x)), byrow=T, ncol=2))^2)),(nrow(x)-1)*var(x))) 

# Function for finding and plotting the number of cluster 

nb.clust <- function(x) 
K <- 7 # The maximum number of clusters 
n <- nrow(x) # sample size 
wcss <- matrix(WCSS1(x), ncol=1, nrow=K) 
for (k in 2:K) 
wcss[k] <- sum(kmeans(x, centers=k)$withinss) 

plot(1:K, wcss, type="b", xlab="Number of Clusters", 
ylab="Within cluster sum of squares") 

dw <- abs(diff(wcss[-1])) 
return(min(which(cumsum(dw/sum(dw))>0.8))) 

> nb.clust(faithful) 
[1] 3 
> kmeans.faithful <- kmeans(faithful,3) 
> kmeans.faithful$centers 

> x11() 
> plot(faithful, col = kmeans.faithful$cluster) 
> points(kmeans.faithful$centers, col = 1:3, pch = 8, cex=6) 

# Imputing 

library(DMwR) 
xvotes <- centralImputation(votes.repub) 
kmeans(xvotes, 2) 
# Note that n remains unchanged 

# Exclusion 

votes# x <- na.exclude(votes.repub) 
kmeans(votesx, 2) 
# Note that the sample size is hugely reduced 
# This can be very very bad 

# What difference 


kmeans can handle large q small n


#############################
##Gaussian Mixture Clustering
#############################
data = read.csv("ranking_growth_momentum.csv", header = T)
row.names(data) <- data$style

## transform data
data$logcnt = log(data$cnt)/log(max(data$cnt))
X <- data[,c(5,6,7)]
BIC = mclustBIC(X)
plot(BIC)
m_cluster <- Mclust(X, x = BIC)
plot(m_cluster, what = "classification")
data$segments = m_cluster$classification
boxplot(data$growth ~ data$segments, ylab="Growth", xlab="Segments")
clusplot(data, data$segments, color=T, shade = T, labels=4, lines=0, main="K-Means")


Classification Models in R

n=nrow(all)
ntrain <- round(n*0.7)
set.seed(333)
tindex <- sample(n, ntrain)
all$y <- all$y==1
train <- all[tindex, ]
test <- all[-tindex, ]


##################################################################
## Naive Bayes
##################################################################
train$y=factor(train$y, levels=c(TRUE, FALSE))
newdata <- data.frame(y=test$y,
                      is_verified=test$is_verified,
                      pageviews1=test$pageviews1,
                      follows1=test$follows1,
                      likes1=test$likes1,
                      reblogs1=test$reblogs1,
                      original_posts1=test$original_posts1,
                      searches1=test$searches1,
                      unfollows1=test$unfollows1,
                      received_engagments1=test$received_engagments1,
                      time1=test$time1,
                      regi_geo1=test$regi_geo1)
nb1 <- NaiveBayes(y ~ is_verified + pageviews1 + follows1 + likes1 + reblogs1 + original_posts1
                  + searches1 + unfollows1 + received_engagments1 + time1 + regi_geo1, data=train)
tmp <- predict(nb1, newdata = newdata)
table(tmp$class, test$y)
test$score = tmp$posterior
#calculate auc;
pred = prediction(test$score, test$y);
auc.tmp <- performance(pred,"auc");
auc <- as.numeric(auc.tmp@y.values); auc;
perf <- performance(pred,"tpr","fpr");
plot(perf);
abline(0, 1, lty = 2);
# KS is the maximum difference between the cumulative true positive and cumulative false positive rate.
max(attr(perf,'y.values')[[1]]-attr(perf,'x.values')[[1]])


##################################################################
## Decision Tree
##################################################################
require(rpart)
ct <- rpart(factor(gear) ~ ., data = train, minsplit = 3)
summary(ct)
plot(ct)
text(ct)
pred <- predict(ct, newdata = test, type = 'class')
prop.table(table(pred, test$gear))

ct <- rpart(as.factor(y) ~ is_verified + pageviews1 + follows1 + likes1
            + reblogs1 + original_posts1 + searches1 + unfollows1 + received_engagments1,
            data = train, minsplit = 5)
summary(ct)
plot(ct)
text(ct)

test$score = predict(ct, newdata = test, type = "prob")[,2]
#calculate auc;
pred = prediction(test$score, test$y);
plot(performance(Pred2, "tpr", "fpr"))
abline(0, 1, lty = 2)
auc.tmp <- performance(pred,"auc");
auc <- as.numeric(auc.tmp@y.values); auc;
perf <- performance(pred,"tpr","fpr");
plot(perf);
abline(0, 1, lty = 2);
# KS is the maximum difference between the cumulative true positive and cumulative false positive rate.
max(attr(perf,'y.values')[[1]]-attr(perf,'x.values')[[1]])


##################################################################
## KNN;
##################################################################
train_x <- data.frame(is_verified=as.numeric(train$is_verified),
                      pageviews1=as.numeric(train$pageviews1),
                      follows1=as.numeric(train$follows1),
                      likes1=as.numeric(train$likes1),
                      reblogs1=as.numeric(train$reblogs1),
                      original_posts1=as.numeric(train$original_posts1),
                      searches1=as.numeric(train$searches1),
                      unfollows1=as.numeric(train$unfollows1),
                      received_engagments1=as.numeric(train$received_engagments1),
                      time1=as.numeric(train$time1),
                      regi_geo1=as.numeric(train$regi_geo1))
train_y=train$y
test_x <- data.frame(is_verified=as.numeric(test$is_verified),
                     pageviews1=as.numeric(test$pageviews1),
                     follows1=as.numeric(test$follows1),
                     likes1=as.numeric(test$likes1),
                     reblogs1=as.numeric(test$reblogs1),
                     original_posts1=as.numeric(test$original_posts1),
                     searches1=as.numeric(test$searches1),
                     unfollows1=as.numeric(test$unfollows1),
                     received_engagments1=as.numeric(test$received_engagments1),
                     time1=as.numeric(test$time1),
                     regi_geo1=as.numeric(test$regi_geo1))

prediction <- knn(train_x, test_x, train_y, k=5)
table(prediction, test$y)

test$score = as.integer(prediction)
#calculate auc;
pred = prediction(test$score, test$y);
auc.tmp <- performance(pred,"auc");
auc <- as.numeric(auc.tmp@y.values); auc;
perf <- performance(pred,"tpr","fpr");
plot(perf);
abline(0, 1, lty = 2);
# KS is the maximum difference between the cumulative true positive and cumulative false positive rate.
max(attr(perf,'y.values')[[1]]-attr(perf,'x.values')[[1]])

p <- ggplot(test, aes(pageviews1, follows1))
p + geom_point(aes(color=test$y))


##################################################################
## Support Vector Machines
##################################################################
## 0.6148956 0.2297911
svm <- svm(y ~ is_verified + pageviews1 + follows1 + likes1 + reblogs1 + original_posts1
           + searches1 + unfollows1 + received_engagments1 + time1 + regi_geo1, data=train,
           method = "C-classification", kernel = "radial", cost = 100, gamma = 1)
test$score <- predict(svm, test)
AUC(test$score, test$y)
KS(test$score, test$y)

## 0.621205 0.2424101
svm <- svm(y ~ is_verified + pageviews1 + follows1 + likes1 + reblogs1 + original_posts1
           + searches1 + unfollows1 + received_engagments1 + time1 + regi_geo1, data=train,
           method = "C-classification", kernel = "radial", cost = 1, gamma = 1)
test$score <- predict(svm, test)
AUC(test$score, test$y)
KS(test$score, test$y)


##################################################################
## Artificial Neural Networks
##################################################################
train$y=as.integer(train$y)
train$y1=1-train$y
train$is_verified=as.integer(train$is_verified)
train$time1=as.integer(train$time1)
train$regi_geo1=as.integer(train$regi_geo1)

nn1 <- neuralnet(y ~ is_verified + pageviews1 + follows1 + likes1 + reblogs1 + original_posts1
                 + searches1 + unfollows1 + received_engagments1 + time1 + regi_geo1,
                 data=train, hidden=c(4))
plot(nn1)

test_x <- data.frame(is_verified=as.integer(test$is_verified),
                     pageviews1=as.integer(test$pageviews1),
                     follows1=as.integer(test$follows1),
                     likes1=as.integer(test$likes1),
                     reblogs1=as.integer(test$reblogs1),
                     original_posts1=as.integer(test$original_posts1),
                     searches1=as.integer(test$searches1),
                     unfollows1=as.integer(test$unfollows1),
                     received_engagments1=as.integer(test$received_engagments1),
                     time1=as.integer(test$time1),
                     regi_geo1=as.integer(test$regi_geo1))
prediction <- compute(nn1, test_x)
prediction <- prediction$net.result

test$score <- prediction[,1]
AUC(test$score, test$y)
KS(test$score, test$y)


##################################################################
## Random Forest
##################################################################
Random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is called impurity. For classification, it is typically either Gini impurity or information gain/entropy and for regression trees it is variance. Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.

In particular, trees that are grown very deep tend to learn highly irregular patterns: they overfit their training sets, i.e. have low bias, very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance of the final model.

install.packages('randomForest')
require(randomForest)
(rf <- randomForest(factor(gear) ~ ., data = train, ntree = 10))
plot(rf)
pred <- predict(rf, newdata = test, type = 'class')
prop.table(table(pred, test$gear))  

## importance of variables
rf <- randomForest(factor(gear) ~ ., data = train, ntree = 100, importance=T)
varImpPlot(rf, main="Importance of Variables")        


##################################################################
## GB Decesion Tree
##################################################################
require("gbm")
gbm_perf <- gbm.perf(gbm_model, method = "cv")
predictions_gbm <- predict(gbm_model, newdata = testdf[, -response_column],
                           n.trees = gbm_perf, type = "response")