Text Mining With LDA

WHAT YOU WOULD REQUIRE?

topicmodels package – implementaion of LDA

TM package -for basic text mining

LDAvis package- for visualization

Mallet package-another topic modelling package

INTRODUCTION

Latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.Its one of the most popular models in use in natural language processing today.

 

READING IN DATA

The data will have to be converted to a term document matrix for use by topic models package.

To convert to a term document matrix we will be using the ™ pakcage(alternative:RTextTools).

 

Txt_Corpus=Corpus(VectorSource(data$text)
Matrix=TermDocumentMatrix(Txt_Corpus)

 

TO PERFORM LDA

Before setting out to carry out the lda process on our document matrix, we have to find out the number of topics(K) in our data,since its an argument to the lda function.

One could get the optimum number of topics by using the harmonic mean method

(http://epub.wu.ac.at/3558/1/main.pdf)

Or you can do a trial-and-error method to arrive at the optimum model selection

Once this is done

lda=LDA(matrix,k)

 

VIEW RESULTS

To view terms per topic

terms(lda)
 

To view topics per document

topics(lda)

VISUALIZATING LDA

To create datasets to feed into LDAvis, we have to use mallet library

Create model
model = MalletLDA(k)

 

Import dataset and load it

instance = mallet.import(names(data$text),data$text)
model$loadDocuments(instance)

 

Train the model

model$train(n)

 

Topic-Term distribution

phi = t(mallet.topic.words(model, smoothed = TRUE, normalized = TRUE))
Table of topics and terms

phi.count =mallet.topic.words(model, smoothed = TRUE, normalized = FALSE))

 

Number of topics per token

topic.counts = rowSums(topic.words)
topic.proportions =  topic.counts/sum(topic.counts)
vocab = model$getVocabulary()

 

out = check.inputs(k, W = length(vocab), phi ,
term.frequency = apply(phi.count, 1, sum),
vocab , topic.proportions)

 

Create JSON file

json = with(out, createJSON(k, phi, term.frequency,
vocab, topic.proportion))

 

To Create Interactive Chart

serVis(json, out.dir = 'vis', open.browser = FALSE)

The serVis function creates a couple of files including index,html which can be opened up in  a browser.

LDAVIS

For further Understanding check this one of out

http://cpsievert.github.io/projects/615/xkcd/

If you like us, please tell your friends.Share on LinkedInShare on Google+Share on RedditTweet about this on TwitterShare on Facebook