Showing posts with label R. Show all posts
Showing posts with label R. Show all posts

Tuesday, January 24, 2017

Topic Modeling using LDA

Topic Modeling

Gibbs sampling works by performing a random walk in such a way that reflects the characteristics of a desired distribution. Because the starting point of the walk is chosen at random, it is necessary to discard the first few steps of the walk (as these do not correctly reflect the properties of distribution). This is referred to as the burn-in period. We set the burn-in parameter to 4000. Following the burn-in period, we perform 2000 iterations, taking every 500th iteration for further use. The reason we do this is to avoid correlations between samples. We use 5 different starting points (nstart=5) – that is, five independent runs. Each starting point requires a seed integer (this also ensures reproducibility), so I have provided 5 random integers in my seed list. Finally I’ve set best to TRUE (actually a default setting), which instructs the algorithm to return results of the run with the highest posterior probability.

Some words of caution are in order here. It should be emphasised that the settings above do not guarantee the convergence of the algorithm to a globally optimal solution. Indeed, Gibbs sampling will, at best, find only a locally optimal solution, and even this is hard to prove mathematically in specific practical problems such as the one we are dealing with here. The upshot of this is that it is best to do lots of runs with different settings of parameters to check the stability of your results. The bottom line is that our interest is purely practical so it is good enough if the results make sense. We’ll leave issues of mathematical rigour to those better qualified to deal with them

As mentioned earlier, there is an important parameter that must be specified upfront: k, the number of topics that the algorithm should use to classify documents. There are mathematical approaches to this, but they often do not yield semantically meaningful choices of k (http://stackoverflow.com/questions/21355156/topic-models-cross-validation-with-loglikelihood-or-perplexity/21394092 for an example). From a practical point of view, one can simply run the algorithm for different values of k and make a choice based by inspecting the results. This is what we’ll do.



################################################################################
## https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r/
#################################################################################
require(tm)
setwd("/Users/tkmaemd/Desktop/R/KLAB/1_24_2017")

#load files into corpus
#get listing of .txt files in directory
#include facebook, instagram, pinterest, twitter
filenames <- list.files(getwd(),pattern="*.txt")
filenames

#read files into a character vector
files <- lapply(filenames, readLines)


#create corpus from vector
docs <- Corpus(VectorSource(files))
inspect(docs)

#inspect a particular document in corpus
writeLines(as.character(docs[[1]]))
writeLines(as.character(docs[[2]]))
writeLines(as.character(docs[[3]]))
writeLines(as.character(docs[[4]]))

#start preprocessing
#Transform to lower case
docs <-tm_map(docs, content_transformer(tolower))

#remove potentially problematic symbols
# toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ", x))})
# docs <- tm_map(docs, toSpace, "-")
# docs <- tm_map(docs, toSpace, "'")
# docs <- tm_map(docs, toSpace, ".")

#remove punctuation
docs <- tm_map(docs, removePunctuation)
#Strip digits
docs <- tm_map(docs, removeNumbers)
#remove stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
#remove whitespace
docs <- tm_map(docs, stripWhitespace)

#Good practice to check every now and then
writeLines(as.character(docs[[1]]))

#keep it as plaintextdocument
docs <- tm_map(docs, PlainTextDocument)
#Stem document
docs <- tm_map(docs,stemDocument)
#change to the character type
docs <- lapply(docs, as.character)
#create the object
docs <- Corpus(VectorSource(docs))
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
#convert rownames to filenames
rownames(dtm) <- filenames

##Create a WordCloud to Visualize the Text Data
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
#length should be total number of terms
length(freq)
head(freq, 20)

# Create the word cloud
wf <- data.frame(word=names(freq), freq=freq)
head(wf)
set.seed(123)
pal = brewer.pal(11,"Spectral")
wordcloud(words = wf$word,
freq = wf$freq,
scale = c(3.5, 1.2),
random.order = F,
colors = pal,
max.words = 100)

#################################################################################
## Topic Models
#################################################################################
## What is the meaning of each topic?
## How prevalent is each topic?
## How do the topics relate to each other?
## How do the documents relate to each other?
#Set parameters for Gibbs sampling
burnin <- 400
iter <- 200
thin <- 50
seed <-list(2003,5,63)
nstart <- 3
best <- TRUE

#Number of topics
k <- 3

## What is the meaning of each topic?
## How prevalent is each topic?
## How do the topics relate to each other?
## How do the documents relate to each other?
#Set parameters for Gibbs sampling
burnin <- 400
iter <- 200
thin <- 50
seed <-list(2003,5,63)
nstart <- 3
best <- TRUE

#Number of topics
k <- 3

#Run LDA using Gibbs sampling
ldaOut <-LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed,
                                                 best=best, burnin = burnin, iter = iter, thin=thin))

## Show document-topic distribution
ldaOut.topics <- as.matrix(topics(ldaOut))
write.csv(ldaOut.topics, file=paste("LDAGibbs",k,"DocsToTopics.csv"))

## Show term-topic distriubtion
#top 20 terms in each topic
ldaOut.terms <- as.matrix(terms(ldaOut,20))
write.csv(ldaOut.terms,file=paste('LDAGibbs',k,'TopicsToTerms.csv'))


#probabilities associated with each topic assignment
topicProbabilities <- as.data.frame(ldaOut@gamma)
topicProbabilities
write.csv(topicProbabilities, file=paste("LDAGibbs",k,"TopicAssignmentProbabilities.csv"))

#Find relative importance of top 2 topics
topic1ToTopic2 <- lapply(1:nrow(dtm),function(x) sort(topicProbabilities[x,])[k]/sort(topicProbabilities[x,])[k-1])
topic1ToTopic2

#Find relative importance of second and third most important topics
topic2ToTopic3 <- lapply(1:nrow(dtm),function(x) sort(topicProbabilities[x,])[k-1]/sort(topicProbabilities[x,])[k-2])
topic2ToTopic3

Wednesday, January 18, 2017

Use American Community Survey (ACS) Data

#############################
## state.R
#############################
## require(choroplethr)

?df_pop_state
data(df_pop_state)
head(df_pop_state)
dat10 <- merge(dat9, df_pop_state, by.x="states", by.y="region")
dat10$perc <- dat10$cust/dat10$value
percent <- function(x, digits = 2, format = "f", ...) {
  paste0(formatC(100 * x, format = format, digits = digits, ...), "%")
}
dat10$statelabel <- paste(dat10$demand_state, "\n", percent(dat10$perc,2,"f"),  sep="")
head(dat10)

p9 <- ggplot() +
  geom_map(data=states, map=states,
           aes(x=long, y=lat, map_id=region),
           fill="#ffffff", color="#ffffff", size=0.15) +
  geom_map(data=dat10, map=states,
           aes(fill=perc, map_id=states),
           color="#ffffff", size=0.15) +
  coord_fixed(1.3) +
  scale_fill_continuous(low = "thistle2", high = "darkred", guide="colorbar") +
  geom_text(data=dat10, aes(x=long, y=lat, label=statelabel), colour="black", size=4 ) +
  ggtitle("Kohl's Customers from 11/1/2014 to 10/31/2016") + ylab("") + xlab("") +
  theme(plot.title = element_text(face = "bold", size = 20)) +
  theme(axis.text.x = element_text(face = "bold", size = 14)) +
  theme(axis.text.y = element_text(face = "bold", size = 14)) +
  theme(axis.title.x = element_text(face = "bold", size = 16)) +
  theme(strip.text.x = element_text(face = "bold", size = 16)) +
  theme(axis.title.y = element_text(face = "bold", size = 16, angle=90)) +
  guides(fill=FALSE)


## collect acs table names and column names
install.packages("acs")
require(acs)

acs.lookup(endyear=2013, span=5, table.name="Age by Sex")
acs.lookup(endyear=2013, span=5, table.number="B01001")
write.csv(, "tmp.csv")

#############################
## demo.py
#############################
# https://github.com/CommerceDataService/census-wrapper
# https://www.socialexplorer.com/data/ACS2013_5yr/documentation/12e2e690-0503-4fb9-a431-4ce385fc656a
# http://www.census.gov/geo/maps-data/data/relationship.html
import csv
import pandas as pd
from census import Census
from us import states
# Datasets
# acs5: ACS 5 Year Estimates (2013, 2012, 2011, 2010)
# acs1dp: ACS 1 Year Estimates, Data Profiles (2012)
# sf1: Census Summary File 1 (2010, 2000, 1990)
# sf3: Census Summary File 3 (2000, 1990)

API_KEY = '0c9d762f43fe6dd071ec0f5a3bfdd19b478c2381'
c = Census(API_KEY, year=2013)


# example: convert names and fips
#c.acs5.get(('NAME', 'B25034_010E'),{'for': 'state:{}'.format(states.MD.fips)})


# Total number for all states:
records=['B01001_031E', 'B01001_032E','B01001_033E','B01001_034E','B01001_035E', 'B01001_036E']
flag = 0
for record in records:
if flag==0:
tmp=c.acs5.get(record, {'for': 'state:*'})
df=pd.DataFrame(tmp)
flag=1
else:
tmp1=c.acs5.get(record, {'for': 'state:*'})
df1=pd.DataFrame(tmp1)
df = df.merge(df1, on='state', how='outer')

df.index=df['state']
df.sort_index(inplace=True)
del df['state']
df1=df.applymap(int)
df1['tot']=df1.apply(sum, axis=1)/2
df1['tot']=df1['tot'].astype(int)

# Get State label
statelabel=[]
for state in df1.index:
statelabel.append(states.lookup(state).abbr)
df1['statelable']=statelabel

df1.to_csv('~/Desktop/output.csv')
print 'Done'

Thursday, December 8, 2016

Visual 13 - Create Pie Chart

dat1 <- dbGetQuery(conn,"select channel, sum(a.sld_qty)
                  from eipdb_sandbox.ling_sls_brnd_demog a
                  where a.gma_nbr in (2,5)
                  and a.trn_sls_dte between '2016-11-14' AND '2016-12-16'
                  group by 1
                  order by 1;")
dat1$perc <- dat1$sld_qty/sum(dat1$sld_qty)
p1 <- ggplot(dat1, aes(x = factor(1), y =sld_qty, fill = channel)) +
  geom_bar(width = 1, stat = "identity") +
  scale_fill_manual(values = c("red", "blue")) +
  coord_polar(theta="y", start = pi / 3) +
  ##labs(title = "Kohl's Sold Items by Channel") +
  geom_text_repel(aes(label=scales::percent(perc)), size=4.5) + ylab("") + xlab("") +
  theme_void()


dat2 <-dbGetQuery(conn,"select new_ind, sum(a.sld_qty)
                  from eipdb_sandbox.ling_sls_brnd_demog a
                  where a.gma_nbr in (2,5)
                  and a.trn_sls_dte between '2016-11-14' AND '2016-12-16'
                  group by 1
                  order by 1;")
dat2$new_ind <- factor(dat2$new_ind)
dat2$perc <- dat2$sld_qty/sum(dat2$sld_qty)
p2 <- ggplot(dat2, aes(x = "", y =sld_qty, fill = new_ind)) +
  geom_bar(width = 1, stat = "identity") +
  scale_fill_manual(values = c("darkgreen", "orangered", "red")) +
  coord_polar("y", start = pi / 3) +
  ##labs(title = "Kohl's Sold Items by New/Existed Customer") +
  geom_text_repel(aes(label=scales::percent(perc)), size=4.5) + ylab("") + xlab("") +
  theme_void()


dat3 <-dbGetQuery(conn,"select sku_stat_desc, sum(a.sld_qty)
                  from eipdb_sandbox.ling_sls_brnd_demog a
                 where a.gma_nbr in (2,5)
                 and a.trn_sls_dte between '2016-11-14' AND '2016-12-16'
                 group by 1
                 order by 1;")
dat3$perc <- dat3$sld_qty/sum(dat3$sld_qty)
dat3 <- dat3[order(dat3$sld_qty),]
p3 <- ggplot(dat3, aes(x = "", y =sld_qty, fill = sku_stat_desc)) +
  geom_bar(width = 1, stat = "identity") +
  scale_fill_brewer(palette = "Spectral") +
  coord_polar("y", start = pi / 3) +
  ##labs(title = "Kohl's Sold Items by SKU Status") +
  geom_text_repel(aes(label=scales::percent(perc)), size=4.5) + ylab("") + xlab("") +
  theme_void()


grid.arrange(p1,p2,p3, nrow=3, ncol=1)

Wednesday, December 7, 2016

Visual12 - Create Maps


http://bcb.dfci.harvard.edu/~aedin/courses/R/CDC/maps.html

#########################################################
## Geographic Information of Customers
#########################################################
# Returns centroids
getLabelPoint <- function(county) {
  Polygon(county[c('long', 'lat')])@labpt}
df <- map_data("state")              
centroids <- by(df, df$region, getLabelPoint)     # Returns list
centroids <- do.call("rbind.data.frame", centroids)  # Convert to Data Frame
names(centroids) <- c('long', 'lat')                 # Appropriate Header
centroids$states <- rownames(centroids)

dat8<-dbGetQuery(conn,"select demand_state,
                count(distinct mstr_persona_key) cust
                 from eipdb_sandbox.ling_sls_brnd_demog
                 where new_ind=1 and trn_sls_dte between '2014-11-01' and '2016-10-31'
                 group by 1
                 order by 1;
                 ")
## Join with States
dat8$states <- tolower(state.name[match(dat8$demand_state,  state.abb)])

states <- map_data("state")
head(states)
dat9 <- merge(dat8, centroids, by="states")
dat9$statelabel <- paste(dat9$demand_state, "\n", format(dat9$cust, big.mark = ",", scientific = F),  sep="")

# ggplot(data = Total) +
#   geom_polygon(aes(x = long, y = lat, fill = region, group = group), color = "white") +
#   coord_fixed(1.3) +
#   guides(fill=FALSE) +
#   geom_text(data=statelable, aes(x=long, y=lat, label = demand_state), size=2)

ggplot() +
  geom_map(data=states, map=states,
           aes(x=long, y=lat, map_id=region),
           fill="#ffffff", color="#ffffff", size=0.15) +
  geom_map(data=dat8, map=states,
           aes(fill=cust, map_id=states),
           color="#ffffff", size=0.15) +
  coord_fixed(1.3) +
  scale_fill_continuous(low = "thistle2", high = "darkred", guide="colorbar") +
  #scale_fill_distiller(name="Customers", palette = "YlGn", breaks=pretty_breaks(n=5)) +
  #geom_text(data=dat9, hjust=0.5, vjust=-0.5, aes(x=long, y=lat, label=statelabel), colour="black", size=4 ) +
  geom_text(data=dat9, aes(x=long, y=lat, label=statelabel), colour="black", size=4 ) +
  ggtitle("Customers from 11/1/2014 to 10/31/2016") + ylab("") + xlab("") +
  theme(plot.title = element_text(face = "bold", size = 20)) +
  theme(axis.text.x = element_text(face = "bold", size = 14)) +
  theme(axis.text.y = element_text(face = "bold", size = 14)) +
  theme(axis.title.x = element_text(face = "bold", size = 16)) +
  theme(strip.text.x = element_text(face = "bold", size = 16)) +
  theme(axis.title.y = element_text(face = "bold", size = 16, angle=90)) +
  guides(fill=FALSE)

## Plot2
## American Community Survey (ACS) Data
## Join with States
## access population estimates for US States in 2012
?df_pop_state
data(df_pop_state)
head(df_pop_state)
dat10 <- merge(dat9, df_pop_state, by.x="states", by.y="region")
dat10$perc <- dat10$cust/dat10$value
percent <- function(x, digits = 2, format = "f", ...) {
  paste0(formatC(100 * x, format = format, digits = digits, ...), "%")
}
dat10$statelabel <- paste(dat10$demand_state, "\n", percent(dat10$perc,2,"f"),  sep="")
head(dat10)

p9 <- ggplot() +
  geom_map(data=states, map=states,
           aes(x=long, y=lat, map_id=region),
           fill="#ffffff", color="#ffffff", size=0.15) +
  geom_map(data=dat10, map=states,
           aes(fill=perc, map_id=states),
           color="#ffffff", size=0.15) +
  coord_fixed(1.3) +
  scale_fill_continuous(low = "thistle2", high = "darkred", guide="colorbar") +
  geom_text(data=dat10, aes(x=long, y=lat, label=statelabel), colour="black", size=4 ) +
  ggtitle("Customers from 11/1/2015 to 10/31/2016") + ylab("") + xlab("") +
  theme(plot.title = element_text(face = "bold", size = 20, hjust = 0.5)) +
  theme(axis.text.x = element_text(face = "bold", size = 14)) +
  theme(axis.text.y = element_text(face = "bold", size = 14)) +
  theme(axis.title.x = element_text(face = "bold", size = 16)) +
  theme(strip.text.x = element_text(face = "bold", size = 16)) +
  theme(axis.title.y = element_text(face = "bold", size = 16, angle=90)) +
  guides(fill=FALSE)


#########################################################
## World Maps
#########################################################
## regi_geo
data <- ddply(user_regi, .(regi_geo), summarise, tot=length(user_id))
summary(data$tot)
tmp=joinCountryData2Map(data, joinCode = "ISO2"
                    , nameJoinColumn = "regi_geo"
                    , verbose='TRUE'
)
tmp$tot[is.na(tmp$tot)]=0
# catMethod='categorical'
mapCountryData(tmp, nameColumnToPlot="tot",catMethod="fixedWidth")

#getting class intervals
classInt <- classIntervals(tmp[["tot"]], n=5, style = "jenks")
catMethod = classInt[["brks"]]
#getting colours
colourPalette <- brewer.pal(5,'RdPu')
#plot map
mapParams <- mapCountryData(tmp
                            ,nameColumnToPlot="tot"
                            ,addLegend=FALSE
                            ,catMethod = catMethod
                            ,colourPalette=colourPalette )
#adding legend
do.call(addMapLegend
        ,c(mapParams
           ,legendLabels="all"
           ,legendWidth=0.5
           ,legendIntervals="data"
           ,legendMar = 2))

tmp2=data.frame(tmp[['regi_geo']], tmp[['tot']], tmp[['NAME']])
tmp2=tmp2[order(tmp2$tmp...tot...,decreasing = T),]
write.csv(tmp2, "junk.csv")

Sunday, November 20, 2016

Check Bivariate Distribution in R

data$index=1+(data$rankP-mean(data$rankP))/sd(data$rankP)
ggplot(data, aes(y=data$index, x=data$Avg...Viewed)) +
  geom_density_2d() +
  ggtitle("Contour Plot of Original APV and APV Index") +
  ylab("Original APV Index") +
  xlab("Origianl APV") +
  theme(plot.title = element_text(face = "bold", size = 20)) +
  theme(axis.text.x = element_text(face = "bold", size = 14)) +
  theme(axis.text.y = element_text(face = "bold", size = 14)) +
  theme(axis.title.x = element_text(face = "bold", size = 16)) +
  theme(strip.text.x = element_text(face = "bold", size = 16)) +
  theme(axis.title.y = element_text(face = "bold", size = 16, angle=90))

## Type
p1=ggplot(data, aes(y=data$Avg...Viewed, x=data$Type)) +
  geom_boxplot(aes(fill = data$Type), outlier.colour = "red", outlier.size = 2) +
  ggtitle("Boxplot of Original APV by Type") +
  ylab("Original APV") +
  xlab("Type") +
  theme(plot.title = element_text(face = "bold", size = 20)) +
  theme(axis.text.x = element_text(face = "bold", size = 14)) +
  theme(axis.text.y = element_text(face = "bold", size = 14)) +
  theme(axis.title.x = element_text(face = "bold", size = 16)) +
  theme(strip.text.x = element_text(face = "bold", size = 16)) +
  theme(axis.title.y = element_text(face = "bold", size = 16, angle=90)) +
  guides(fill=FALSE)
p2=ggplot(data, aes(y=data$index, x=data$Type)) +
  geom_boxplot(aes(fill = data$Type), outlier.colour = "red", outlier.size = 2) +
  ggtitle("Boxplot of APV Index by Type") +
  ylab("APV Index") +
  xlab("Type") +
  theme(plot.title = element_text(face = "bold", size = 20)) +
  theme(axis.text.x = element_text(face = "bold", size = 14)) +
  theme(axis.text.y = element_text(face = "bold", size = 14)) +
  theme(axis.title.x = element_text(face = "bold", size = 16)) +
  theme(strip.text.x = element_text(face = "bold", size = 16)) +
  theme(axis.title.y = element_text(face = "bold", size = 16, angle=90)) +
  guides(fill=FALSE)

#X..Eps
p3=ggplot(data, aes(y=data$Avg...Viewed, x=data$X..Eps)) +
  geom_density_2d() +
  ggtitle("Contor Plot of Original APV by Number of Episodes") +
  ylab("Original APV") +
  xlab("Number of Episodes") +
  scale_x_continuous(limits=c(0,25)) +
  theme(plot.title = element_text(face = "bold", size = 20)) +
  theme(axis.text.x = element_text(face = "bold", size = 14)) +
  theme(axis.text.y = element_text(face = "bold", size = 14)) +
  theme(axis.title.x = element_text(face = "bold", size = 16)) +
  theme(strip.text.x = element_text(face = "bold", size = 16)) +
  theme(axis.title.y = element_text(face = "bold", size = 16, angle=90))
p4=ggplot(data, aes(y=data$index, x=data$X..Eps)) +
  geom_density_2d() +
  ggtitle("Contor Plot of APV Index by Number of Episodes") +
  ylab("APV Index") +
  xlab("Number of Episodes") +
  scale_x_continuous(limits=c(0,30)) +
  theme(plot.title = element_text(face = "bold", size = 20)) +
  theme(axis.text.x = element_text(face = "bold", size = 14)) +
  theme(axis.text.y = element_text(face = "bold", size = 14)) +
  theme(axis.title.x = element_text(face = "bold", size = 16)) +
  theme(strip.text.x = element_text(face = "bold", size = 16)) +
  theme(axis.title.y = element_text(face = "bold", size = 16, angle=90))

## daypart
p5=ggplot(data, aes(y=data$Avg...Viewed, x=data$daypart)) +
  geom_boxplot(aes(fill = data$daypart), outlier.colour = "red", outlier.size = 2) +
  ggtitle("Boxplot of Original APV by Daypart") +
  ylab("Original APV") +
  xlab("Daypart") +
  theme(plot.title = element_text(face = "bold", size = 20)) +
  theme(axis.text.x = element_text(face = "bold", size = 14)) +
  theme(axis.text.y = element_text(face = "bold", size = 14)) +
  theme(axis.title.x = element_text(face = "bold", size = 16)) +
  theme(strip.text.x = element_text(face = "bold", size = 16)) +
  theme(axis.title.y = element_text(face = "bold", size = 16, angle=90)) +
  guides(fill=FALSE)
p6=ggplot(data, aes(y=data$index, x=data$daypart)) +
  geom_boxplot(aes(fill = data$daypart), outlier.colour = "red", outlier.size = 2) +
  ggtitle("Boxplot of APV Index by Type") +
  ylab("APV Index") +
  xlab("Daypart") +
  theme(plot.title = element_text(face = "bold", size = 20)) +
  theme(axis.text.x = element_text(face = "bold", size = 14)) +
  theme(axis.text.y = element_text(face = "bold", size = 14)) +
  theme(axis.title.x = element_text(face = "bold", size = 16)) +
  theme(strip.text.x = element_text(face = "bold", size = 16)) +
  theme(axis.title.y = element_text(face = "bold", size = 16, angle=90)) +
  guides(fill=FALSE)

pdf("result.pdf", width=15, height=8.5)
grid.arrange(p1,p2, ncol=2)
grid.arrange(p3,p4, ncol=2)
grid.arrange(p5,p6, ncol=2)
dev.off()

Saturday, November 19, 2016

Check Univariate Distribution in R

memory.limit()
memory.size(max = TRUE)
rm(list=ls(all=TRUE))
sessionInfo()

require(data.table)
require(stringr)
require(lubridate)
require(scales)
require(tigerstats)
require(ggplot2)
require(gridExtra)
require(ggthemes)

##################################################################
## Check Distribution
##################################################################
## Change data type
# series - actual TV series such as American Idol or Glee (names are masked for this exercise)
data$series = as.character(data$series)
sort(xtabs(~series,data), decreasing = T)[1:10]

# network - Networks such as ABC, HBO, FOX, etc. Names are masked.
data$network = as.character(data$network)
sort(xtabs(~network,data), decreasing = T)[1:10]

# Type - Type of TV network (broadcast or cable)
data$Type = as.character(data$Type)
sort(xtabs(~Type,data), decreasing = T)
rowPerc(xtabs(~Type,data))

# Eps 4 - Number of episodes in the given timeframe (assume a broadcast month = 4 weeks)
data$X..Eps = as.integer(data$X..Eps)
summary(data$X..Eps)

# Air Day - Day of episode airing (M, T, W, R, F, S, U)
data$Air.Day_M=unlist(lapply(data$Air.Day, function(x) grepl('M',x)))
data$Air.Day_T=unlist(lapply(data$Air.Day, function(x) grepl('T',x)))
data$Air.Day_W=unlist(lapply(data$Air.Day, function(x) grepl('W',x)))
data$Air.Day_R=unlist(lapply(data$Air.Day, function(x) grepl('R',x)))
data$Air.Day_F=unlist(lapply(data$Air.Day, function(x) grepl('F',x)))
data$Air.Day_S=unlist(lapply(data$Air.Day, function(x) grepl('S',x)))
data$Air.Day_U=unlist(lapply(data$Air.Day, function(x) grepl('U',x)))
rowPerc(xtabs(~data$Air.Day_M,data))
rowPerc(xtabs(~data$Air.Day_T,data))
rowPerc(xtabs(~data$Air.Day_W,data))
rowPerc(xtabs(~data$Air.Day_R,data))
rowPerc(xtabs(~data$Air.Day_F,data))
rowPerc(xtabs(~data$Air.Day_S,data))
rowPerc(xtabs(~data$Air.Day_U,data))

# National Time - 9:00 PM Airing start time
# tmp=as.POSIXct(as.character(data$National.Time), format="%H:%M %r")
# class(data$National.Time)
# rowPerc(xtabs(~data$National.Time,data))

# daypart prime Industry-standard time block (see side panel for details)
data$daypart=as.character(data$daypart)
xtabs(~daypart,data)
rowPerc(xtabs(~daypart,data))

#Run_time (min) 60 Series run time in minutes
summary(data$Run_time..min.)
data=data[order(data$Run_time..min., decreasing = T), ]

#Unique HHs 2,636,448 Number of unique Households tuned in to a given series within given time interval*
data$Unique.HHs=as.integer(data$Unique.HHs)
summary(data$Unique.HHs)

#Total Hrs Viewed 1,534,543 Sum of hours 'logged in' to the given program by all viewers in given time frame
data$Total.Hrs.Viewed=as.integer(data$Total.Hrs.Viewed)
summary(data$Total.Hrs.Viewed)

#Avg % Viewed 53.6% Average % of the program viewed**
tmp=str_replace_all(data$Avg...Viewed, "%", "")
data$Avg...Viewed=as.numeric(tmp)/100
summary(data$Avg...Viewed)

##################################################################
## Check Time Series Distribution
##################################################################

train$Date = as.Date(train$Date, "%Y-%m-%d")
ggplot(train, aes(x=Date, y=embroidered_top)) +
  geom_point(aes( x= Date, y=embroidered_top), col = "blue", size = 1) +
  scale_x_date(labels=date_format("%b %y")) +
  stat_smooth(color="red")

Grep Strs in R

Identify specific characters in columns
> data[1:20,]
       series   network      Type X..Eps Air.Day National.Time     daypart Run_time..min.  Unique.HHs Total.Hrs.Viewed
...
14  series314  network5 Broadcast      1       T       8:00 PM       prime             60    991,953          469,083
15  series838  network4 Broadcast      4       T       9:00 PM       prime             60  2,636,448        1,534,543
16  series230  network5 Broadcast      3       M      10:01 PM       prime             59  2,080,862        1,112,973
17   series39  network4 Broadcast      3       U       7:00 PM primeaccess             60  2,133,072        1,123,172
18  series193  network5 Broadcast      5     T U      10:01 PM       prime             59  3,351,520        1,795,463
# Air Day - Day of episode airing (M, T, W, R, F, S, U)
> unlist(lapply(data$Air.Day, function(x) grepl('M',x)))
   [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
...

Fetch out specific strings
  tmp=str_replace_all(topstr, " ", "_")
  tmp=paste("^",tmp,"$", sep="")
  tmp2=grep(tmp, totalSampleName)

Friday, August 26, 2016

1 - Download Google Trend Data

1 Use Python to Download

from pytrends import pyGTrends
server = pyGTrends.pyGTrends("js970008@gmail.com", "durham19")
styles=[
"Bomber jacket",
...
]
for idx, val in enumerate(styles):
if idx>=10:
print("*************************")
    server.request_report(val)
    server.save_csv("", str(idx))
    print(idx)
 
2 Use Shell to Extract TS

cat 0.csv | awk 'NR>=5 && NR <=663 {print }' > result0.csv
cat 1.csv | awk 'NR>=5 && NR <=663 {print }' > result1.csv

3 Use R to Combine

Sample=fread("result0.csv")
dim(Sample)
str(Sample)
names(Sample)=str_replace_all(names(Sample), " ", "_")
head(Sample)
n=nrow(Sample)
for(i in 1:2){
  info = file.info(paste("result",i,".csv",sep = ""))
  if(!(info$size == 0)){
    tmp=fread(paste("result",i,".csv",sep = ""))
    names(tmp)=str_replace_all(names(tmp), " ", "_")

    setkey(Sample, "Week")
    setkey(tmp, "Week")
    Sample = merge(Sample, tmp, by="Week")
  }
  else print(paste("result",i,".csv is empty",sep = ""))
}

4 Use R to directly extract data
filename <- "0.csv"

con  <- file(filename, open = "r")
linecount <- 0
stringdata <- ""
while (length(oneLine <- readLines(con, n = 1, warn = FALSE)) > 0) {
  linecount <- linecount + 1

  if (linecount < 3) {
    filename <- paste0(filename,oneLine)  
  }

  # get headers at line 5
  if (linecount == 5) rowheaders = strsplit(oneLine, ",")[[1]]

  # skip firt 5 lines
  if (linecount > 5) {
    # break when there is no more main data
    if (gsub(pattern=",", x=oneLine, replacement="") == "") break
 
    stringdata <- paste0(stringdata,oneLine,"\n")
  }
}
close(con)

newData <- read.table(textConnection(stringdata), sep=",", header=FALSE, stringsAsFactors = FALSE)
names(newData) <- rowheaders
newData$StartDate <- as.Date(sapply(strsplit(as.character(newData[,1]), " - "), `[`, 1))
newData$EndDate <- as.Date(sapply(strsplit(as.character(newData[,1]), " - "), `[`, 2))
newData$year <- sapply(strsplit(as.character(newData$StartDate), "-"), `[`, 1)

Sample=data.table(do.call(cbind.data.frame, newData))

5 Use R library

install.packages("gtrendsR")

library(gtrendsR)
user <- "***@gmail.com"
psw <- "***"
gconnect(user, psw)

trend <- gtrends(c(
  "Bomber jacket",
  "Satin bomber jacket",
  "Embroidered bomber jacket",
  "Flight jacket",
  "Slip dress"
))
plot(trend)

Friday, February 12, 2016

Master R 14 - Analyzing the R community

R Foundation members

library(XML)
page <- htmlParse('http://r-project.org/foundation/donors.html')
list <- unlist(xpathApply(page, "//h3[@id='supporting-members']/following-sibling::ul[1]/li", xmlValue))
str(list)

supporterlist <- sub(' \\([a-zA-Z ]*\\)$', '', list)
countrylist   <- substr(list, nchar(supporterlist) + 3, nchar(list) - 1)
tail(sort(prop.table(table(countrylist)) * 100), 5)

Visualizing supporting members around the world

countries <- as.data.frame(table(countrylist))
library(rworldmap)
joinCountryData2Map(countries, joinCode = 'NAME', nameJoinColumn = 'countrylist', verbose = TRUE)

library(ggmap)
for (fix in c('Brasil', 'CZ', 'Danmark', 'NL')) {
   countrylist[which(countrylist == fix)] <-
       geocode(fix, output = 'more')$country
}

countries <- as.data.frame(table(countrylist))
countries <- joinCountryData2Map(countries, joinCode = 'NAME', nameJoinColumn = 'countrylist')
mapCountryData(countries, 'Freq', catMethod = 'logFixedWidth', mapTitle = 'Number of R Foundation supporting members')


R package maintainers

packages <- readHTMLTable(paste0('http://cran.r-project.org', '/web/checks/check_summary.html'), which = 2)
maintainers <- sub('(.*) <(.*)>', '\\1', packages$' Maintainer')
maintainers <- gsub(' ', ' ', maintainers)
str(maintainers)
tail(sort(table(maintainers)), 8)


The number of packages per maintainer

N <- as.numeric(table(maintainers))
library(fitdistrplus)
plotdist(N)
descdist(N, boot = 1e3)
(gparams <- fitdist(N, 'gamma'))
gshape <- gparams$estimate[['shape']]
grate  <- gparams$estimate[['rate']]
sum(rgamma(1e5, shape = gshape, rate = grate))
hist(rgamma(1e5, shape = gshape, rate = grate))
pgamma(2, shape = gshape, rate = grate)
prop.table(table(N <= 2))
ploc <- min(N)
pshp <- length(N) / sum(log(N) - log(ploc))

library(actuar)
ppareto(2, pshp, ploc)
fg <- fitdist(N, 'gamma')
fw <- fitdist(N, 'weibull')
fl <- fitdist(N, 'lnorm')
fp <- fitdist(N, 'pareto', start = list(shape = 1, scale = 1))
par(mfrow = c(1, 2))
denscomp(list(fg, fw, fl, fp), addlegend = FALSE)
qqcomp(list(fg, fw, fl, fp),  legendtext = c('gamma', 'Weibull', 'Lognormal', 'Pareto')) 
length(unique(maintainers))


The R-help mailing list

library(RCurl)
url <- getURL('https://stat.ethz.ch/pipermail/r-help/')
R.help.toc <- htmlParse(url)
R.help.archives <- unlist(xpathApply(R.help.toc, "//table//td[3]/a", xmlAttrs), use.names = FALSE)
dir.create('r-help')
for (f in R.help.archives)
     download.file(url = paste0(url, f), file.path('help-r', f), method = 'curl'))
lines <- system(paste0("zgrep -E '^From: .* at .*' ./help-r/*.txt.gz"), intern = TRUE)
length(lines)
length(unique(lines))
lines[26]
lines    <- sub('.*From: ', '', lines)
Rhelpers <- sub('.*\\((.*)\\)', '\\1', lines)
tail(sort(table(Rhelpers)), 6)
grep('Brian( D)? Ripley', names(table(Rhelpers)), value = TRUE)

Volume of the R-help mailing list

lines <- system(paste0(
"zgrep -E '^Date: [A-Za-z]{3}, [0-9]{1,2} [A-Za-z]{3} ",
"[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2} [-+]{1}[0-9]{4}' ",
"./help-r/*.txt.gz"), intern = TRUE)
length(lines)
head(sub('.*Date: ', '', lines[1]))
times <- strptime(sub('.*Date: ', '', lines), format = '%a, %d %b %Y %H:%M:%S %z')
plot(table(format(times, '%Y')), type = 'l')

library(data.table)
Rhelp <- data.table(time = times)
Rhelp[, H := hour(time)]
Rhelp[, D := wday(time)]
library(ggplot2)
ggplot(na.omit(Rhelp[, .N, by = .(H, D)]),
     aes(x = factor(H), y = factor(D), size = N)) + geom_point() +
     ylab('Day of the week') + xlab('Hour of the day') +
     ggtitle('Number of mails posted on [R-help]') +
     theme_bw() + theme('legend.position' = 'top')
tail(sort(table(sub('.*([+-][0-9]{4}).*', '\\1', lines))), 22)

Forecasting the e-mail volume in the future

Rhelp[, date := as.Date(time)]
Rdaily <- na.omit(Rhelp[, .N, by = date])
Rdaily <- zoo(Rdaily$N, Rdaily$date)
plot(Rdaily)

library(forecast)
fit <- ets(Rdaily)
predict(fit, 1)
plot(forecast(fit, 30), include = 365)


Analyzing overlaps between our lists of R users

lists <- rbindlist(list(
data.frame(name = unique(supporterlist), list = 'supporter'),
data.frame(name = unique(maintainers),   list = 'maintainer'),
data.frame(name = unique(Rhelpers),      list = 'R-help')))

t <- table(lists$name, lists$list)
table(rowSums(t))
library(Rcapture)
descriptive(t)
closedp(t)

Further ideas on extending the capture-recapture models


The number of R users in social media

library(fbRads)
fbad_init(FB account ID, FB API token)
fbad_get_search(q = 'rstats', type = 'adinterest')
fbad_get_search(fbacc = fbacc, q = 'SPSS', type = 'adinterest')
res <- fbad_get_search(fbacc = fbacc, q = 'programming language', type = 'adinterest')
res <- res[order(res$audience_size, decreasing = TRUE), ]
res[1:10, 1:3]


R-related posts in social media

library(twitteR)
setup_twitter_oauth(...)
str(searchTwitter("#rstats", n = 1, resultType = 'recent'))
tweets <- Rtweets(n = 500)
length(strip_retweets(tweets))
tweets <- twListToDF(tweets)

library(tm)
corpus <- Corpus(VectorSource(tweets$text))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, 'rstats')


library(wordcloud)
wordcloud(corpus)

Wednesday, February 10, 2016

Master R 13 - Data Around Us

Geocoding

library(hflights)
library(data.table)
dt <- data.table(hflights)[, list(
    N         = .N,
    Cancelled = sum(Cancelled),
    Distance  = Distance[1],
    TimeVar   = sd(ActualElapsedTime, na.rm = TRUE),
    ArrDelay  = mean(ArrDelay, na.rm = TRUE)) , by = Dest]

str(dt)

library(ggmap)
(h <- geocode('Houston, TX'))
dt[, c('lon', 'lat') := geocode(Dest)]

str(setDF(dt))


Visualizing point data in space

library(maps)
map('state')
title('Flight destinations from Houston,TX')
points(h$lon, h$lat, col = 'blue', pch = 13)
points(dt$lon, dt$lat, col = 'red', pch = 19)
text(dt$lon, dt$lat + 1, labels = dt$Dest, cex = 0.7)

map('state')
title('Frequent flight destinations from Houston,TX')
points(h$lon, h$lat, col = 'blue', pch = 13)
points(dt$lon, dt$lat, pch = 19, col = rgb(1, 0, 0, dt$N / max(dt$N)))
legend('bottomright', legend = round(quantile(dt$N)), pch = 19, col = rgb(1, 0, 0, quantile(dt$N) / max(dt$N)), box.col = NA)


Finding polygon overlays of point data

str(map_data <- map('state', plot = FALSE, fill = TRUE))
grep('^washington', map_data$names, value = TRUE)
states <- sapply(strsplit(map_data$names, ':'), '[[', 1)
library(maptools)
us <- map2SpatialPolygons(map_data, IDs = states, proj4string = CRS("+proj=longlat +datum=WGS84"))
plot(us)

library(sp)
dtp <- SpatialPointsDataFrame(dt[, c('lon', 'lat')], dt, proj4string = CRS("+proj=longlat +datum=WGS84"))
str(sp::over(us, dtp))
str(sapply(sp::over(us, dtp, returnList = TRUE), function(x) sum(x$Cancelled)))
val <- cancels$Cancelled[match(states, row.names(cancels))]
val[is.na(val)] <- 0


Plotting thematic maps

map("state", col = rgb(1, 0, 0, sqrt(val/max(val))), fill = TRUE)
title('Number of cancelled flights from Houston to US states')
points(h$lon, h$lat, col = 'blue', pch = 13)
legend('bottomright', legend = round(quantile(val)), fill = rgb(1, 0, 0, sqrt(quantile(val)/max(val))), box.col = NA)

Rendering polygons around points

Contour lines

library(fields)
out <- as.image(dt$ArrDelay, x = dt[, c('lon', 'lat')], nrow = 256, ncol = 256)
table(is.na(out$z))
image(out)
image(out, xlim = base::range(map_data$x, na.rm = TRUE), ylim = base::range(map_data$y, na.rm = TRUE))
look <- image.smooth(out, theta = .5)
table(is.na(look$z))
image(look)

usa_data <- map('usa', plot = FALSE, region = 'main')
p <- expand.grid(look$x, look$y)
library(sp)
n <- which(point.in.polygon(p$Var1, p$Var2, usa_data$x, usa_data$y) == 0)
look$z[n] <- NA

map("usa")
image(look, add = TRUE)
map("state", lwd = 3, add = TRUE)
title('Arrival delays of flights from Houston')
points(dt$lon, dt$lat, pch = 19, cex = .5)
points(h$lon, h$lat, pch = 13)

Voronoi diagrams

library(deldir)
map("usa")
plot(deldir(dt$lon, dt$lat), wlines = "tess", lwd = 2, pch = 19, col = c('red', 'darkgray'), add = TRUE)


Satellite maps

library(OpenStreetMap)
map <- openmap(c(max(map_data$y, na.rm = TRUE),
                  min(map_data$x, na.rm = TRUE)),
                c(min(map_data$y, na.rm = TRUE),
                  max(map_data$x, na.rm = TRUE)),
                type = 'stamen-terrain')
map <- openproj(map, projection = '+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs')
plot(map)
plot(deldir(dt$lon, dt$lat), wlines = "tess", lwd = 2, col = c('red', 'black'), pch = 19, cex = 0.5, add = TRUE)


Interactive maps

Querying Google Maps

cancels$state <- rownames(cancels)
cancels$Cancelled[is.na(cancels$Cancelled)] <- 0

library(googleVis)
plot(gvisGeoChart(cancels, 'state', 'Cancelled',
                   options = list(
                       region      = 'US',
                       displayMode = 'regions', 
                       resolution  = 'provinces')))
dt$LatLong <- paste(dt$lat, dt$lon, sep = ':')
dt$tip <- apply(dt, 1, function(x) paste(names(dt), x, collapse = '<br/ >'))
plot(gvisMap(dt, 'LatLong', tipvar = 'tip'))


JavaScript mapping libraries

devtools::install_github('ramnathv/rCharts')
library(rCharts)
map <- Leaflet$new()
map$setView(as.numeric(dt[which(dt$Dest == 'MCI'),
   c('lat', 'lon')]), zoom = 4)
for (i in 1:nrow(dt))
   map$marker(c(dt$lat[i], dt$lon[i]), bindPopup = dt$tip[i])
map$show()

library(leaflet)
leaflet(us) %>%
   addProviderTiles("Acetate.terrain") %>%
   addPolygons() %>%
   addMarkers(lng = dt$lon, lat = dt$lat, popup = dt$tip)


Alternative map designs

dt <- dt[point.in.polygon(dt$lon, dt$lat, usa_data$x, usa_data$y) == 1, ]

library(diagram)
library(scales)
map("usa")
title('Number of flights, cancellations and delays from Houston')
image(look, add = TRUE)
map("state", lwd = 3, add = TRUE)
for (i in 1:nrow(dt)) {
   curvedarrow(
     from       = rev(as.numeric(h)),
     to         = as.numeric(dt[i, c('lon', 'lat')]),
     arr.pos    = 1,
     arr.type   = 'circle',
     curve      = 0.1,
     arr.col    = alpha('black', dt$N[i] / max(dt$N)),
     arr.length = dt$N[i] / max(dt$N),
     lwd        = dt$Cancelled[i] / max(dt$Cancelled) * 25,
     lcol       = alpha('black',
                    dt$Cancelled[i] / max(dt$Cancelled)))
 }


Spatial statistics

dm <- dist(dt[, c('lon', 'lat')])
dm <- as.matrix(dm)
idm <- 1 / dm
diag(idm) <- 0
str(idm)
dt$TimeVar[is.na(dt$TimeVar)] <- 0
library(ape)
Moran.I(dt$TimeVar, idm)

library(spdep)
idml <- mat2listw(idm)
moran.test(dt$TimeVar, idml)
idml <- mat2listw(idm, style = "W")

moran.test(dt$TimeVar, idml)

Master R 12 - Analyzing Time-series

Creating time-series objects

library(hflights)
library(data.table)
dt <- data.table(hflights)
dt[, date := ISOdate(Year, Month, DayofMonth)]

daily <- dt[, list(
     N         = .N,
     Delays    = sum(ArrDelay, na.rm = TRUE),
     Cancelled = sum(Cancelled),
     Distance  = mean(Distance)
 ), by = date]
str(daily)


Visualizing time-series

plot(ts(daily))
setorder(daily, date)
plot(ts(daily))
plot(ts(daily$N, start = 2011, frequency = 365), main = 'Number of flights from Houston in 2011')


Seasonal decomposition

plot(decompose(ts(daily$N, frequency = 7)))
setNames(decompose(ts(daily$N, frequency = 7))$figure, weekdays(daily$date[1:7]))
decompose(ts(daily$N, frequency = 365))


Holt-Winters filtering

nts <- ts(daily$N, frequency = 7)
fit <- HoltWinters(nts, beta = FALSE, gamma = FALSE)
plot(fit)
fit <- HoltWinters(nts)
plot(fit)
library(forecast)
forecast(fit)
plot(forecast(HoltWinters(nts), 31))


Autoregressive Integrated Moving Average models

auto.arima(nts)
auto.arima(nts, approximation = FALSE)
plot(forecast(auto.arima(nts, D = 1, approximation = FALSE), 31))


Outlier detection

cts <- ts(daily$Cancelled)
fit <- auto.arima(cts)
auto.arima(cts)

library(tsoutliers)
outliers <- tso(cts, tsmethod = 'arima',
   args.tsmethod  = list(order = c(1, 1, 2)))
plot(outliers)
plot(tso(ts(daily$Cancelled)))
dfc <- as.data.frame(daily[, c('date', 'Cancelled'), with = FALSE])

library(AnomalyDetection)
AnomalyDetectionTs(dfc, plot = TRUE)$plot


More complex time-series objects

library(zoo)
zd <- zoo(daily[, -1, with = FALSE], daily[[1]])
plot(zd)

plot(cumsum(zd))

Tuesday, February 9, 2016

Master R 11 - Social Network analysis of the R Ecosystem

Loading network data

library(tools)
pkgs <- available.packages()
str(pkgs)

head(package.dependencies(pkgs), 2)
library(plyr)
edges <- ldply(
   c('Depends', 'Imports', 'Suggests'), function(depLevel) {
     deps <- package.dependencies(pkgs, depLevel = depLevel)
     ldply(names(deps), function(pkg)
         if (!identical(deps[[pkg]], NA))
             data.frame(
                 src   = pkg,
                 dep   = deps[[pkg]][, 1],
                 label = depLevel,
                 stringsAsFactors = FALSE))
})

str(edges)


Centrality measures of networks
 compute centrality and density metrics
 identify bridges and simulate changes in the network

nrow(edges) / (nrow(pkgs) * (nrow(pkgs) - 1))
head(sort(table(edges$dep), decreasing = TRUE))
edges <- edges[edges$dep != 'R', ]

library(igraph)
g <- graph.data.frame(edges)
summary(g)
graph.density(g)
head(sort(degree(g), decreasing = TRUE))
head(sort(betweenness(g), decreasing = TRUE))


Visualizing network data

plot(degree(g), betweenness(g), type = 'n', main = 'Centrality of R package dependencies')
text(degree(g), betweenness(g), labels = V(g)$name)

edges <- edges[edges$label != 'Suggests', ]
deptree <- edges$dep[edges$src == 'igraph']
while (!all(edges$dep[edges$src %in% deptree] %in% deptree))
   deptree <- union(deptree, edges$dep[edges$src %in% deptree])
deptree

g <- graph.data.frame(edges[edges$src %in% c('igraph', deptree), ])
plot(g)

V(g)$label.color <- 'orange'
V(g)$label.color[V(g)$name == 'igraph'] <- 'darkred'
V(g)$label.color[V(g)$name %in% edges$dep[edges$src == 'igraph']] <- 'orangered'
E(g)$color <- c('blue', 'green')[factor(df$label)]
plot(g, vertex.shape = 'none', edge.label = NA)
tkplot(g, edge.label = NA)

library(visNetwork)
nodes <- get.data.frame(g, 'vertices')
names(nodes) <- c('id', 'color')
edges <- get.data.frame(g)
visNetwork(nodes, edges)

Custom plot layouts

g <- dominator.tree(g, root = "igraph")$domtree
plot(g, layout = layout.reingold.tilford(g, root = "igraph"), vertex.shape = 'none')

Analyzing R package dependencies with an R package

library(miniCRAN)
pkgs <- pkgAvail()
pkgDep('igraph', availPkgs = pkgs, suggests = FALSE, includeBasePkgs = TRUE)

plot(makeDepGraph('igraph', pkgs, suggests = FALSE, includeBasePkgs = TRUE))

Monday, February 8, 2016

Master R 10 - Classification and Clustering

Cluster analysis

# Hierarchical clustering

d <- dist(mtcars)
h <- hclust(d)
h
plot(h)
rect.hclust(h, k=3, border = "red")
cn <- cutree(h, k=3)
table(cn)
round(aggregate(mtcars, FUN = mean, by = list(cn)), 1)
round(aggregate(mtcars, FUN = sd, by = list(cn)), 1)
round(sapply(mtcars, sd), 1)
round(apply(aggregate(mtcars, FUN = mean, by = list(cn)),2, sd), 1)

# Determining the ideal number of clusters

install.packages('NbClust')
library(NbClust)
NbClust(mtcars, method = 'complete', index = 'dindex')
NbClust(mtcars, method = 'complete', index = 'hartigan')$Best.nc
NbClust(mtcars, method = 'complete', index = 'kl')$Best.nc
NbClust(iris[, -5], method = 'complete', index = 'all')$Best.nc[1,]

# K-means clustering

(k <- kmeans(mtcars, 3))
all(cn == k$cluster)

# Visualizing clusters
library(cluster) 
clusplot(mtcars, k$cluster, color = TRUE, shade = TRUE, labels = 2)


# Latent class models
factors <- c('cyl', 'vs', 'am', 'carb', 'gear')
mtcars[, factors] <- lapply(mtcars[, factors], factor)

# Latent Class Analysis
install.packages('poLCA')
library(poLCA)
p <- poLCA(cbind(cyl, vs, am, carb, gear) ~ 1,data = mtcars, graphs = TRUE, nclass = 3)
p$P

# Latent class regression
# Discriminant analysis
rm(mtcars)
mtcars$gear <- factor(mtcars$gear)
library(MASS)
d <- lda(gear ~ ., data = mtcars, CV =TRUE)
(tab <- table(mtcars$gear, d$class))
sum(diag(tab)) / sum(tab)
round(d$posterior, 4)
d <- lda(gear ~ ., data = mtcars)
plot(d)
plot(d, dimen = 1, type = "both" )


# Logistic regression
lr <- glm(am ~ hp + wt, data = mtcars, family = binomial)
summary(lr)
table(mtcars$am, round(predict(lr, type = 'response')))

install.packages('nnet')
library(nnet) 
(mlr <- multinom(factor(gear) ~ ., data = mtcars)) 
table(mtcars$gear, predict(mlr))
rm(mtcars)


# Machine learning algorithms
# The K-Nearest Neighbors algorithm

#K-Nearest Neighbors is a supervised classification algorithm, which is a mostly used in pattern recognition and business analytics. A big advantage of k-NN is that it is not sensitive to outliers, and the usage is extremely straightforward.

set.seed(42)
n     <- nrow(mtcars)
train <- mtcars[sample(n, n/2), ]

library(dplyr)
train <- sample_n(mtcars, n / 2)
test <- mtcars[setdiff(row.names(mtcars), row.names(train)), ]
library(class)
(cm <- knn(
  train = subset(train, select = -gear),
  test  = subset(test, select = -gear),
  cl    = train$gear,
  k     = 5))
cor(test$gear, as.numeric(as.character(cm)))
table(train$gear)

# Classification trees
library(rpart)
ct <- rpart(factor(gear) ~ ., data = train, minsplit = 3)
summary(ct)
plot(ct)
text(ct)
table(test$gear, predict(ct, newdata = test, type = 'class'))

install.packages('party')
library(party)
ct <- ctree(factor(gear) ~ drat, data = train, controls = ctree_control(minsplit = 3)) 
plot(ct, main = "Conditional Inference Tree")
table(test$gear, predict(ct, newdata = test, type = 'node'))

install.packages('randomForest')
library(randomForest)
(rf <- randomForest(factor(gear) ~ ., data = train, ntree = 250))
table(test$gear, predict(rf, test))   
plot(rf)
legend('topright', legend = colnames(rf$err.rate), col    = 1:4, fill   = 1:4, bty    = 'n')

# Other algorithms
install.packages(c('caret','c50'))
library(caret)
library(C50)
C50 <- train(factor(gear) ~ ., data = train, method = 'C5.0')
summary(C50)
table(test$gear, predict(C50, test))

Saturday, February 6, 2016

Master R 9 - From Big to Small Data

Adequacy tests

Normality

library(hlfights)
JFK <- hflights[which(hflights$Dest == 'JFK'), c('TaxiIn', 'TaxiOut')]
JFK <- subset(hflights, Dest == 'JFK', select = c(TaxiIn, TaxiOut))

par(mfrow = c(1, 2))
qqnorm(JFK$TaxiIn, ylab = 'TaxiIn')
qqline(JFK$TaxiIn)
qqnorm(JFK$TaxiOut, ylab = 'TaxiOut')
qqline(JFK$TaxiOut)

shapiro.test(JFK$TaxiIn)

Multivariate normality

JFK <- na.omit(JFK)
library(MVN)
mardiaTest(JFK)
hzTest(JFK)
roystonTest(JFK)
par(mfrow = c(1, 2))
mvnPlot(mvt, type = "contour", default = TRUE)
mvnPlot(mvt, type = "persp", default = TRUE)

set.seed(42)
mvt <- roystonTest(MASS::mvrnorm(100, mu = c(0, 0), Sigma = matrix(c(10, 3, 3, 2), 2)))
mvnPlot(mvt, type = "contour", default = TRUE)
mvnPlot(mvt, type = "persp", default = TRUE)


Dependence of variables

hflights_numeric <- hflights[, which(sapply(hflights, is.numeric))]
cor(hflights_numeric, use = "pairwise.complete.obs")
str(cor(hflights_numeric, use = "pairwise.complete.obs"))
hflights_numeric <- hflights[,which(sapply(hflights, function(x) is.numeric(x) && var(x, na.rm = TRUE) != 0))]
table(is.na(cor(hflights_numeric, use = "pairwise.complete.obs")))

library(ellipse)
plotcorr(cor(hflights_numeric, use = "pairwise.complete.obs"))
plotcorr(cor(data.frame(
1:10,
1:10 + runif(10),
1:10 + runif(10) * 5,
runif(10),
10:1,
check.names = FALSE)))
cor(hflights$FlightNum, hflights$Month)

KMO and Barlett's test

library(psych)
KMO(cor(hflights_numeric, use = "pairwise.complete.obs"))
cor(hflights_numeric[, c('Cancelled', 'AirTime')])
cancelled <- which(hflights_numeric$Cancelled == 1)
table(hflights_numeric$AirTime[cancelled], exclude = NULL)
table(hflights_numeric$Cancelled)
hflights_numeric <- subset(hflights_numeric, select = -Cancelled)
which(is.na(cor(hflights_numeric, use = "pairwise.complete.obs")), arr.ind = TRUE)
hflights_numeric <- subset(hflights_numeric, select = -Diverted)
KMO(cor(hflights_numeric[, -c(14)], use = "pairwise.complete.obs"))
KMO(mtcars)
cortest.bartlett(cor(mtcars))


Principal Component Analysis

PCA algorithms
prcomp(mtcars, scale = TRUE)
summary(prcomp(mtcars, scale = TRUE))
sum(prcomp(scale(mtcars))$sdev^2)

Determining the number of components

prcomp(scale(mtcars))$sdev^2
VSS.scree(cor(mtcars))
scree(cor(mtcars))
fa.parallel(mtcars)

Interpreting components

pc <- prcomp(mtcars, scale = TRUE)
head(pc$x[, 1:2])
head(scale(mtcars) %*% pc$rotation[, 1:2])
summary(pc$x[, 1:2])
apply(pc$x[, 1:2], 2, sd)
round(cor(pc$x))
pc$rotation[, 1:2]

biplot(pc, cex = c(0.8, 1.2))
abline(h = 0, v = 0, lty = 'dashed')
cor(mtcars, pc$x[, 1:2])

Rotation methods

varimax(pc$rotation[, 1:2])
pcv <- varimax(pc$rotation[, 1:2])$loadings
plot(scale(mtcars) %*% pcv, type = 'n', xlab = 'Transmission', ylab = 'Power')
text(scale(mtcars) %*% pcv, labels = rownames(mtcars))


library(GPArotation)
promax(pc$rotation[, 1:2])

Outlier-detection with PCA

library(jpeg)
t <- tempfile()
download.file('http://bit.ly/nasa-img', t)
img <- readJPEG(t)
str(img)
h <- dim(img)[1]
w <- dim(img)[2]
m <- matrix(img, h*w)
str(m)
pca <- prcomp(m)
summary(pca)
pca$rotation
extractColors <- function(x) rgb(x[1], x[2], x[3])
(colors <- apply(abs(pca$rotation), 2, extractColors))
pie(pca$sdev, col = colors, labels = colors)

par(mfrow = c(1, 2), mar = rep(0, 4))
image(matrix(pca$x[, 1], h), col = gray.colors(100))
image(matrix(pca$x[, 2], h), col = gray.colors(100), yaxt = 'n')


Factor analysis

m <- subset(mtcars, select = c(mpg, cyl, hp, carb))
(f <- fa(m))
fa.diagram(f)
cor(f$scores, mtcars$disp)


Principal Component Analysis versus Factor Analysis

PCA is used to reduce the number of variables by creating principal components that then can be used in further projects instead of the original variables. This means that we try to extract the essence of the dataset in the means of artificially created variables, which best describe the variance of the data.

FA is the other way around, as it tries to identify unknown, latent variables to explain the original data. In plain English, we use the manifest variables from our empirical dataset to guess the internal structure of the data.


Multidimensional Scaling

as.matrix(eurodist)[1:5, 1:5]
(mds <- cmdscale(eurodist))
plot(mds)
plot(mds, type = 'n')
text(mds[, 1], mds[, 2], labels(eurodist))
mds <- cmdscale(dist(mtcars))
plot(mds, type = 'n')

text(mds[, 1], mds[, 2], rownames(mds))

Thursday, February 4, 2016

Master R 8 - Polishing data

The types and origins of missing data

Missing Completely at Random (MCAR) means that every value in the dataset has the same probability of being missed, 
so no systematic error or distortion is to be expected due to missing data, and nor can we explain the pattern of missing values.

Missing at Random (MAR) is known or at least can be identified, although it has nothing to do with the actual missing values. 

For Missing Not at Random (MNAR), where data is missing for a specific reason that is highly related to the actual question, 
which classifies missing values as nonignorable non-response.


Identifying missing data

library(hflights)
table(complete.cases(hflights))
prop.table(table(complete.cases(hflights))) * 100
sort(sapply(hflights, function(x) sum(is.na(x))))

By-passing missing values

mean(cor(apply(hflights, 2, function(x) as.numeric(is.na(x)))), na.rm = TRUE)
Funs <- Filter(is.function, sapply(ls(baseenv()), get, baseenv()))
names(Filter(function(x) any(names(formals(args(x))) %in% 'na.rm'), Funs))
names(Filter(function(x) any(names(formals(args(x))) %in% 'na.rm'), Filter(is.function,sapply(ls('package:stats'), get, 'package:stats'))))

Overriding the default arguments of a function

myMean <- function(...) mean(..., na.rm = TRUE)
mean(c(1:5, NA))
myMean(c(1:5, NA))

library(rapportools)
mean(c(1:5, NA))
detach('package:rapportools')
mean(c(1:5, NA))

library(Defaults)
setDefaults(mean.default, na.rm = TRUE)
mean(c(1:5, NA))
setDefaults(mean, na.rm = TRUE)
mean

Getting rid of missing data

na.omit(c(1:5, NA))
na.exclude(c(1:5, NA))

x <- rnorm(10); y <- rnorm(10)
x[1] <- NA; y[2] <- NA
exclude <- lm(y ~ x, na.action = "na.exclude")
omit <- lm(y ~ x, na.action = "na.omit")
residuals(exclude)

m <- matrix(1:9, 3)
m[which(m %% 4 == 0, arr.ind = TRUE)] <- NA
m
na.omit(m)

Filtering missing data before or during the actual analysis

mean(hflights$ActualElapsedTime)
mean(hflights$ActualElapsedTime, na.rm = TRUE)
mean(na.omit(hflights$ActualElapsedTime))

library(microbenchmark)
NA.RM   <- function() mean(hflights$ActualElapsedTime, na.rm = TRUE)
NA.OMIT <- function() mean(na.omit(hflights$ActualElapsedTime))
microbenchmark(NA.RM(), NA.OMIT())

Data imputation

m[which(is.na(m), arr.ind = TRUE)] <- 0
m

ActualElapsedTime <- hflights$ActualElapsedTime
mean(ActualElapsedTime, na.rm = TRUE)
ActualElapsedTime[which(is.na(ActualElapsedTime))] <- mean(ActualElapsedTime, na.rm = TRUE)
mean(ActualElapsedTime)


library(Hmisc)
mean(impute(hflights$ActualElapsedTime, mean))
sd(hflights$ActualElapsedTime, na.rm = TRUE)
sd(ActualElapsedTime)
summary(iris)

library(missForest)
set.seed(81)
miris <- prodNA(iris, noNA = 0.2)
summary(miris)
iiris <- missForest(miris, xtrue = iris, verbose = TRUE)
str(iiris)

Comparing different imputation methods

miris <- miris[, 1:4]
iris_mean <- impute(miris, fun = mean)
iris_forest <- missForest(miris)
diag(cor(iris[, -5], iris_mean))
diag(cor(iris[, -5], iris_forest$ximp))

Not imputing missing values

Multiple imputation

Extreme values and outliers

detach('package:missForest')
detach('package:randomForest')

library(outliers)
outlier(hflights$DepDelay)
summary(hflights$DepDelay)

library(lattice)
bwplot(hflights$DepDelay)
IQR(hflights$DepDelay, na.rm = TRUE)

Testing extreme values

set.seed(83)
dixon.test(c(runif(10), pi))
model <- lm(hflights$DepDelay ~ 1)
model$coefficients
mean(hflights$DepDelay, na.rm = TRUE)
a <- 0.1
(n <- length(hflights$DepDelay))
(F <- qf(1 - (a/n), 1, n-2, lower.tail = TRUE))
(L <- ((n - 1) * F / (n - 2 + F))^0.5)
sum(abs(rstandard(model)) > L)

Using robust method

summary(lm(Sepal.Length ~ Petal.Length, data = miris))
lm(Sepal.Length ~ Petal.Length, data = iris)$coefficients

library(MASS)
summary(rlm(Sepal.Length ~ Petal.Length, data = miris))
f <- formula(Sepal.Length ~ Petal.Length)
cbind(orig =  lm(f, data = iris)$coefficients, lm   =  lm(f, data = miris)$coefficients, rlm  = rlm(f, data = miris)$coefficients)

miris$Sepal.Length[1] <- 14

cbind(orig = lm(f, data = iris)$coefficients, lm   = lm(f, data = miris)$coefficients, rlm  = rlm(f, data = miris)$coefficients)

Wednesday, February 3, 2016

Master R 7 - Unstructured Data

library(tm)
getSources()
getReaders()

# Importing the corpus
res <- XML::readHTMLTable(paste0('http://cran.r-project.org/', 'web/packages/available_packages_by_name.html'), which = 1)
head(res)
v <- Corpus(VectorSource(res$V2)); v
inspect(head(v, 3))
meta(v[[1]])
writeLines(as.character(v[[1]]))
lapply(v[1:5], as.character)

# Cleaning the corpus
getTransformations()
stopwords("english")
removeWords('to be or not to be', stopwords("english"))
v <- tm_map(v, removeWords, stopwords("english"))
inspect(head(v, 3))

v <- tm_map(v, content_transformer(tolower))
v <- tm_map(v, removePunctuation)
v <- tm_map(v, stripWhitespace)
inspect(head(v, 3))

# Visualizing the most frequent words in the corpus
wordcloud::wordcloud(v)

# Further cleanup
v <- tm_map(v, removeNumbers)
tdm <- TermDocumentMatrix(v)
inspect(tdm[1:5, 1:20])
findFreqTerms(tdm, lowfreq = 100)

myStopwords <- c('package', 'based', 'using')
v <- tm_map(v, removeWords, myStopwords)

# Stemming words
library(SnowballC)
wordStem(c('cats', 'mastering', 'modelling', 'models', 'model'))
wordStem(c('are', 'analyst', 'analyze', 'analysis'))
d <- v
v <- tm_map(v, stemDocument, language = "english")
v <- tm_map(v, content_transformer(function(x, d) {
paste(stemCompletion(strsplit(stemDocument(x), ' ')[[1]],d),collapse = ' ')
}), d)

tdm <- TermDocumentMatrix(v)
findFreqTerms(tdm, lowfreq = 100)


# Lemmatisation
Remove characters from the end of words in the hope of finding the stem, which is heuristic process often resulting in not-existing words. We tried to overcome this issue by comleting the stems to the shortest meaningful words by using a dictionary, which might result in derivation in the meaning of the term.
Another way to reduce the number of inflectional forms of different terms, instead of deconstruction and then trying to rebuidl the words, is morphological analysis with the help of a dictionary. This process is called lemmatisation, which looks for lemma (the canonical form of a word) instead of stems.

# Analyzing the associations among terms
findAssocs(tdm, 'data', 0.1)

# Some other metrics
vnchar <- sapply(v, function(x) nchar(x$content))
summary(vnchar)
(vm <- which.min(vnchar))
v[[vm]]
hist(vnchar, main = 'Length of R package descriptions', xlab = 'Number of characters')

# The segmentation of documents
hadleyverse <- c('ggplot2', 'dplyr', 'reshape2', 'lubridate','stringr', 'devtools', 'roxygen2', 'tidyr')
(w <- which(res$V1 %in% hadleyverse))
plot(hclust(dist(DocumentTermMatrix(v[w]))), xlab = 'Hadleyverse packages')
sapply(v[w], function(x) structure(content(x), .Names = meta(x, 'id')))

Tuesday, February 2, 2016

Master R 6 - Beyond the linear trend line

The modeling workflow
1 First, fit the model with the main predictors and all the relevant confounders, and then reduce the number of confounders by dropping out the non-significant ones. 
2 Decide whether to use the continuous variables in their original or categorized form.
3 Try to achieve a better fit by testing for non-linear relationships, if they are pragmaticaly relevant.
4 Finally check the model assumptions.

Logistic regression
library(catdata)
data(deathpenalty)
library(vcdExtra)
deathpenalty.expand <- expand.dft(deathpenalty)
binom.model.0 <- glm(DeathPenalty ~ DefendantRace, data = deathpenalty.expand, family = binomial)
summary(binom.model.0)
exp(cbind(OR = coef(binom.model.0), confint(binom.model.0)))

binom.model.1 <- update(binom.model.0, . ~ . + VictimRace)
summary(binom.model.1)
exp(cbind(OR = coef(binom.model.1), confint(binom.model.1)))

prop.table(table(factor(deathpenalty.expand$VictimRace,labels = c("VictimRace=0", "VictimRace=1")),factor(deathpenalty.expand$DefendantRace, labels = c("DefendantRace=0", "DefendantRace=1"))), 1)

Data consideration
A general rule of thumb, logistic regression models require at least 10 events per predictors.

Goodness of model fit
library(lmtest)
lrtest(binom.model.1)
library(BaylorEdPsych)
PseudoR2(binom.model.1)

Model comparison
lrtest(binom.model.0, binom.model.1)

Models for count data
Poisson regression
dfa <- readRDS('SMART_2013.RData')
(ct <- xtabs(~model+failure, data=dfa))
dfa <- dfa[dfa$model %in% names(which(rowSums(ct) - ct[, 1] > 0)),]

library(ggplot2)
ggplot(rbind(dfa, data.frame(model='All', dfa[, -1] )), aes(failure)) + ylab("log(count)") + 
geom_histogram(binwidth = 1, drop=TRUE, origin = -0.5)  + 
scale_y_log10() + scale_x_continuous(breaks=c(0:10)) + 
facet_wrap( ~ model, ncol = 3) + ggtitle("Histograms by manufacturer") + theme_bw()

poiss.base <- glm(failure ~ model, offset(log(freq)),  family = 'poisson', data = dfa)
summary(poiss.base)
contrasts(dfa$model, sparse = TRUE)
lrtest(poiss.base)

Negative binomial regression
library(MASS)
model.negbin.0 <- glm.nb(failure ~ model, offset(log(freq)), data = dfa)
lrtest(poiss.base,model.negbin.0)

Multivariate non-linear models
model.negbin.1 <- update(model.negbin.0, . ~ . + capacity_bytes + age_month + temperature)
model.negbin.2 <- update(model.negbin.1, . ~ . + PendingSector)
lrtest(model.negbin.0, model.negbin.1, model.negbin.2)
summary(model.negbin.2)
exp(data.frame(exp_coef = coef(model.negbin.2)))
dfa$model <- relevel(dfa$model, 'WDC')


model.negbin.3 <- update(model.negbin.2, data = dfa)
library(broom)
format(tidy(model.negbin.3), digits = 4)

library(data.table)
dfa <- data.table(dfa)
dfa[, temp6 := cut2(temperature, g = 6)]
temperature.weighted.mean <- dfa[, .(wfailure = weighted.mean(failure, freq)), by = temp6] 
ggplot(temperature.weighted.mean, aes(x = temp6, y = wfailure)) +  
geom_bar(stat = 'identity') + xlab('Categorized temperature') + ylab('Weighted mean of disk faults') + theme_bw()

model.negbin.4 <- update(model.negbin.0, .~. + capacity_bytes + age_month + temp6 + PendingSector, data = dfa)
AIC(model.negbin.3,model.negbin.4)

weighted.means <- rbind(dfa[, .(l = 'capacity', f = weighted.mean(failure, freq)), by = .(v = capacity_bytes)], dfa[, .(l = 'age', f = weighted.mean(failure, freq)), by = .(v = age_month)])
ggplot(weighted.means, aes(x = l, y = f)) + geom_step() +
facet_grid(. ~ v, scales = 'free_x') + theme_bw() +
ylab('Weighted mean of disk faults') + xlab('')

dfa[, capacity_bytes := as.factor(capacity_bytes)]
dfa[, age8 := cut2(age_month, g = 8)]
model.negbin.5 <- update(model.negbin.0, .~. + capacity_bytes + age8 + temp6 + PendingSector, data = dfa)
AIC(model.negbin.5, model.negbin.4)
format(tidy(model.negbin.5), digits = 3)

tmnb5 <- tidy(model.negbin.5)
str(terms <- tmnb5$term[tmnb5$p.value < 0.05][-1])

library(plyr)
ci <- ldply(terms, function(t) confint(model.negbin.5, t))
names(ci) <- c('min', 'max')
ci$term <- terms
ci$variable <- sub('[A-Z0-9\\]\\[,() ]*$', '', terms, perl = TRUE)

ggplot(ci, aes(x = factor(term), color = variable)) + 
geom_errorbar(ymin = min, ymax = max) + xlab('') +
ylab('Coefficients (95% conf.int)') + theme_bw() + 
theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = 'top')
PseudoR2(model.negbin.6 )

Monday, February 1, 2016

Master R 5 - Building Models

Linear regression/Logistic regression/Poisson regression

The Motivation behind Multivariate Models
 A confounder is a third variable that biases (increases or decreases) the association we are interested in. The confounder is always associated with both the response and the predictor.

Linear Regression with Continuous Predictors
install.packages("gamlss.data")
library(gamlss.data)
data(usair)

model.0 <- lm(y~x3, data=usair)
summary(model.0)
plot(y~x3, data=usair, cex.lab=1.5)
abline(model.0, col='red', lwd=2.5)
legend('bottomright', legend='y=x3', lty=1, col='red', lwd=2.5, title='Regression Line')

usair$prediction <- predict(model.0)
usair$residule <- resid(model.0)
plot(y~x3, data=usair, cex.lab=1.5)
abline(model.0, col='red', lwd=2.5)
segments(usair$x3, usair$y, usair$x3, usair$prediction, col='blue', lty=2)
legend('bottomright', legend=c('y~x3', 'residule'), lty=c(1,2), col=c('red', 'blue'),lwd=2.5, title='Regression Line')

Multiple predictor
model.1 <- update(model.0, .~.+x2)
summary(model.1)
as.numeric(predict(model.1, data.frame(x2=150, x3=400)))

install.packages('scatterplot3d')
library(scatterplot3d)
plot3d <- scatterplot3d(usair$x3, usair$x2, usair$y, pch=19, type='h', hightlight.3d=TRUE, main='3_D Scatterplot')
plot3d$planed3d(model.1, lty='solid', col = 'red')

par(mfrow=c(1,2))
plot(y~x3, data=usair, main='2D projection for x3')
abline(model.1, col='red', lwd=2.5)
plot(y~x2, data=usair, main='2D projection for x2')
abline(lm(y~x2+x3, data=usair), col='red', lwd=2.5)


Model Assumptions
install.packages(c('Hmisc','gridExtra'))
library(Hmisc)
library(ggplot2)
library(gridExtra)
set.seed(7)
x  <- sort(rnorm(1000, 10, 100))[26:975]
y  <- x * 500 + rnorm(950, 5000, 20000)
df <- data.frame(x = x, y = y, cuts = factor(cut2(x, g = 5)),resid = resid(lm(y ~ x)))
scatterPl <- ggplot(df, aes(x = x, y = y)) + geom_point(aes(colour = cuts, fill = cuts), shape = 1,show_guide = FALSE) + geom_smooth(method = lm, level = 0.99)
plot_left <- ggplot(df, aes(x = y, fill = cuts)) + geom_density(alpha = .5) + coord_flip() + scale_y_reverse()
plot_right <- ggplot(data = df, aes(x = resid, fill = cuts)) + geom_density(alpha = .5) + coord_flip()
grid.arrange(plot_left, scatterPl, plot_right, ncol=3, nrow=1, widths=c(1, 3, 1))


Outlier
install.packages('gvlma')
library(gvlma)
gvlma(model.1)
model.2 <- update(model.1, data=usair[-31,])
gvlma(model.2)


How well does the line fit in the data
model.0 <- update(model.0, data = usair[-31, ])
summary(model.0)[c('r.squared', 'adj.r.squared')]
summary(model.2)[c('r.squared', 'adj.r.squared')]
#Akaike Information Criterion(AIC)
summary(model.3 <- update(model.2, .~. -x2 + x1))$coefficients
summary(model.4 <- update(model.2, .~. -x3 + x1))$coefficients
AIC(model.3, model.4)

Discrete preditors
par(mfrow=c(1,1))
plot(y ~ x5, data = usair, cex.lab = 1.5)
abline(lm(y ~ x5, data = usair), col = 'red', lwd = 2.5, lty = 1)
abline(lm(y ~ x5, data = usair[usair$x5<=45,]), col = 'red', lwd = 2.5, lty = 3)
abline(lm(y ~ x5, data = usair[usair$x5 >=30, ]), col = 'red', lwd = 2.5, lty = 2)
abline(v = c(30, 45), col = 'blue', lwd = 2.5)
legend('topleft', lty = c(1, 3, 2, 1), lwd = rep(2.5, 4),
legend = c('y ~ x5', 'y ~ x5 | x5<=45','y ~ x5 | x5>=30','Critical zone'), col = c('red', 'red', 'red', 'blue'))

install.packages(c('partykit','rpart'))
library(partykit)
library(rpart)
plot(as.party(rpart(y ~ x5, data = usair)))

usair$x5_3 <- cut2(usair$x5, c(30,45))
plot(y ~ as.numeric(x5_3), data = usair, cex.lab = 1.5, xlab = 'Categorized annual rainfall(x5)', xaxt = 'n')
axis(1, at = 1:3, labels = levels(usair$x5_3))
lines(tapply(usair$y, usair$x5_3, mean), col='red', lwd=2.5, lty=1)
legend('topright', legend = 'Linear prediction', col = 'red')
summary(glmmodel.1 <- glm(y ~ x2+x3+x5_3, data=usair[-31,]))              

Thursday, December 17, 2015

Master R 4 - Restruct Data

rm(list=ls())

# Transposing matrices
m <- matrix(1:9, 3)
t(m)

# Filtering data by string matching
library(plyr)
library(dplyr)
library(hflights)

str(select(hflights, ends_with("delay")))
str(select(hflights, contains("T", ignore.case = FALSE)))
str(select(hflights, matches("^[[:alpha:]]{5,6}$")))
table(nchar(names(hflights)))
str(select(hflights, -matches("^[[:alpha:]]{7,8}$")))

# Rearranging data
str(arrange(hflights, ActualElapsedTime))
install.packages("magrittr")
library(magrittr)
hflights %>% arrange(ActualElapsedTime) %>% str
hflights %>% arrange(ActualElapsedTime) %>%
  select(ActualElapsedTime, Dest) %>%
  subset(Dest != 'AUS') %>%
  head %>%
  str
library(data.table)
str(head(data.table(hflights, key='ActualElapsedTime')[Dest != 'AUS', c('ActualElapsedTime', 'Dest'), with = FALSE]))
str(head(na.omit(data.table(hflights, key='ActualElapsedTime'))[Dest != 'AUS', list('ActualElapsedTime', 'Dest')]))

# dplyr vs data.table

# Computing new varibles
hflights_dt <- data.table(hflights)
hflights_dt[, DistanceKMs := Distance / 0.62137]
hflights_dt$DistanceKMs <- hflights_dt$Distance / 0.62137

# Memory profiling
install.packages("pryr")
library(pryr)
hflights_dt <- data.table(hflights)
address(hflights_dt)
hflights_dt$DistanceKMs <- hflights_dt$Distance / 0.62137

hflights_dt <- data.table(hflights)
address(hflights_dt)
hflights_dt[, DistanceKMs := Distance / 0.62137]
address(hflights_dt)
system.time(within(hflights_dt, DistanceKMs <- Distance / 0.62137))

# Creating multiple variables at a time
hflights_dt[, c('DistanceKMs', 'DistanceFeets') := list(Distance/0.62137, Distance * 5280)];
carriers <- unique(hflights_dt$UniqueCarrier)
hflights_dt[, paste('carrier', carriers, sep = '_') := lapply(carriers, function(x) as.numeric(UniqueCarrier == x))]
str(hflights_dt[, grep('^carrier', names(hflights_dt)), with = FALSE])
str(hflights_dt)

# Computing new variables with dplyr
hflights <- hflights %>% mutate(DistanceKMs = Distance / 0.62137)
hflights <- mutate(hflights, DistanceKMs = Distance/0.62137)

# Merging datasets
wdays <- data.frame(DayOfWeek = 1: 7,DayOfWeekString = c("Sunday", "Monday", "Tuesday","Wednesday", "Thursday", "Friday", "Saturday"))
system.time(merge(hflights, wdays))
system.time(merge(hflights_dt, wdays, by = 'DayOfWeek'))
hflights$wdays <- weekdays(as.Date(with(hflights, paste(Year, Month, DayofMonth, sep = '-'))))

# One of the most often used function along with the base commands, rbind and cbind is do.call, which can execute the rbind or cbind call on all elemetns of a list, thus enabling us, to join a list of data frames.
# Similarly, rbindlist can be called to merge a list of data.table objects ina much faster way.

# Reshaping data in a flexible way

# Converting wide tables to the long table format
library(reshape2)
head(melt(hflights))
hflights_melted <- melt(hflights, id.vars=0, measure.vars = c('ActualElapsedTime', 'AirTime'))
str(hflights_melted)

library(ggplot2)
ggplot(hflights_melted, aes(variable, y=value)) + geom_boxplot()

# Converting long tables to the wide table format
hflights_melted <- melt(hflights, id.vars='Month', measure.vars = c('ActualElapsedTime', 'AirTime'))
df <- dcast(hflights_melted, Month ~ variable, fun.aggregate = mean, na.rm = TRUE)
ggplot(melt(df, id.vars = 'Month')) +
  geom_line(aes(x=Month, y=value, color=variable)) +
  scale_x_continuous(breaks = 1:12) +
  theme_bw() +
  theme(legend.position = 'top')

hflights_melted <- melt(add_margins(hflights, 'Month'), id.vars = 'Month', measure.vars = c('ActualElapsedTime', 'AirTime'))
df <- dcast(hflights_melted, Month ~ variable, fun.aggregate = mean, na.rm = TRUE)