Tuesday, January 24, 2017

Topic Modeling using LDA

Topic Modeling

Gibbs sampling works by performing a random walk in such a way that reflects the characteristics of a desired distribution. Because the starting point of the walk is chosen at random, it is necessary to discard the first few steps of the walk (as these do not correctly reflect the properties of distribution). This is referred to as the burn-in period. We set the burn-in parameter to 4000. Following the burn-in period, we perform 2000 iterations, taking every 500th iteration for further use. The reason we do this is to avoid correlations between samples. We use 5 different starting points (nstart=5) – that is, five independent runs. Each starting point requires a seed integer (this also ensures reproducibility), so I have provided 5 random integers in my seed list. Finally I’ve set best to TRUE (actually a default setting), which instructs the algorithm to return results of the run with the highest posterior probability.

Some words of caution are in order here. It should be emphasised that the settings above do not guarantee the convergence of the algorithm to a globally optimal solution. Indeed, Gibbs sampling will, at best, find only a locally optimal solution, and even this is hard to prove mathematically in specific practical problems such as the one we are dealing with here. The upshot of this is that it is best to do lots of runs with different settings of parameters to check the stability of your results. The bottom line is that our interest is purely practical so it is good enough if the results make sense. We’ll leave issues of mathematical rigour to those better qualified to deal with them

As mentioned earlier, there is an important parameter that must be specified upfront: k, the number of topics that the algorithm should use to classify documents. There are mathematical approaches to this, but they often do not yield semantically meaningful choices of k (http://stackoverflow.com/questions/21355156/topic-models-cross-validation-with-loglikelihood-or-perplexity/21394092 for an example). From a practical point of view, one can simply run the algorithm for different values of k and make a choice based by inspecting the results. This is what we’ll do.



################################################################################
## https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r/
#################################################################################
require(tm)
setwd("/Users/tkmaemd/Desktop/R/KLAB/1_24_2017")

#load files into corpus
#get listing of .txt files in directory
#include facebook, instagram, pinterest, twitter
filenames <- list.files(getwd(),pattern="*.txt")
filenames

#read files into a character vector
files <- lapply(filenames, readLines)


#create corpus from vector
docs <- Corpus(VectorSource(files))
inspect(docs)

#inspect a particular document in corpus
writeLines(as.character(docs[[1]]))
writeLines(as.character(docs[[2]]))
writeLines(as.character(docs[[3]]))
writeLines(as.character(docs[[4]]))

#start preprocessing
#Transform to lower case
docs <-tm_map(docs, content_transformer(tolower))

#remove potentially problematic symbols
# toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ", x))})
# docs <- tm_map(docs, toSpace, "-")
# docs <- tm_map(docs, toSpace, "'")
# docs <- tm_map(docs, toSpace, ".")

#remove punctuation
docs <- tm_map(docs, removePunctuation)
#Strip digits
docs <- tm_map(docs, removeNumbers)
#remove stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
#remove whitespace
docs <- tm_map(docs, stripWhitespace)

#Good practice to check every now and then
writeLines(as.character(docs[[1]]))

#keep it as plaintextdocument
docs <- tm_map(docs, PlainTextDocument)
#Stem document
docs <- tm_map(docs,stemDocument)
#change to the character type
docs <- lapply(docs, as.character)
#create the object
docs <- Corpus(VectorSource(docs))
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
#convert rownames to filenames
rownames(dtm) <- filenames

##Create a WordCloud to Visualize the Text Data
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
#length should be total number of terms
length(freq)
head(freq, 20)

# Create the word cloud
wf <- data.frame(word=names(freq), freq=freq)
head(wf)
set.seed(123)
pal = brewer.pal(11,"Spectral")
wordcloud(words = wf$word,
freq = wf$freq,
scale = c(3.5, 1.2),
random.order = F,
colors = pal,
max.words = 100)

#################################################################################
## Topic Models
#################################################################################
## What is the meaning of each topic?
## How prevalent is each topic?
## How do the topics relate to each other?
## How do the documents relate to each other?
#Set parameters for Gibbs sampling
burnin <- 400
iter <- 200
thin <- 50
seed <-list(2003,5,63)
nstart <- 3
best <- TRUE

#Number of topics
k <- 3

## What is the meaning of each topic?
## How prevalent is each topic?
## How do the topics relate to each other?
## How do the documents relate to each other?
#Set parameters for Gibbs sampling
burnin <- 400
iter <- 200
thin <- 50
seed <-list(2003,5,63)
nstart <- 3
best <- TRUE

#Number of topics
k <- 3

#Run LDA using Gibbs sampling
ldaOut <-LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed,
                                                 best=best, burnin = burnin, iter = iter, thin=thin))

## Show document-topic distribution
ldaOut.topics <- as.matrix(topics(ldaOut))
write.csv(ldaOut.topics, file=paste("LDAGibbs",k,"DocsToTopics.csv"))

## Show term-topic distriubtion
#top 20 terms in each topic
ldaOut.terms <- as.matrix(terms(ldaOut,20))
write.csv(ldaOut.terms,file=paste('LDAGibbs',k,'TopicsToTerms.csv'))


#probabilities associated with each topic assignment
topicProbabilities <- as.data.frame(ldaOut@gamma)
topicProbabilities
write.csv(topicProbabilities, file=paste("LDAGibbs",k,"TopicAssignmentProbabilities.csv"))

#Find relative importance of top 2 topics
topic1ToTopic2 <- lapply(1:nrow(dtm),function(x) sort(topicProbabilities[x,])[k]/sort(topicProbabilities[x,])[k-1])
topic1ToTopic2

#Find relative importance of second and third most important topics
topic2ToTopic3 <- lapply(1:nrow(dtm),function(x) sort(topicProbabilities[x,])[k-1]/sort(topicProbabilities[x,])[k-2])
topic2ToTopic3

No comments:

Post a Comment

Blog Archive