How to do basic text mining, create a terms matrix and a WordCloud?

Text mining refers to the process of parsing unstructured text in order to derive high quality information.

What You Will Need?

install.packages (c ( "tm", "wordcloud")) # install 'tm' package
library (tm)
library (wordcloud)

How To Bring The Data Into R?

The standard approach to processing data for text mining is to convert your documents to a so-called Corpus

path <- "http://goo.gl/qo0OsZ"  # url of sample processed tweets
data <- read.csv (path) # plug-in your dataset path here.
data <- head (data, 15) # top 15 rows
txt_corpus <- Corpus (VectorSource (data))  # create a corpus

Basic Text Manipulation

To replace text, use gsub()

gsub (pattern, replacement_pattern, text) # general syntax
gsub ("/", " ", "my/Text") # replaces all ‘/’ with space " "

Use tm_map() for specialized replacement operations
tm_map (txt_corpus, removePunctuation) # remove punctuations
tm_map (txt_corpus, removeNumbers) # to remove numbers
tm_map (txt_corpus, removeWords, stopwords('english')) # to remove stop words(like ‘as’ ‘the’ etc….)

Some Basic Text Analysis

For most of our analysis we will be using the Document Term Matrix, which is a simple triplet matrix that describes the frequency of terms that occur in a collection of documents.

How To Convert The Corpus to a Document-Term-Matrix?

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

Matrix <- TermDocumentMatrix(txt_corpus) # terms in rows
DTM <- DocumentTermMatrix(txt_corpus) # document no's in rows

Most frequent words

findFreqTerms (Matrix, lowfreq=5)  # include words with freq>5

To find word associations

findAssocs (Matrix,'word', n) # try 'n' as 0.3 to start with

The value of ‘n’ restricts the words to be shown to those having association values more than n (around 0.30 shows good association)

*eg:- can be used for sentiment analysis on twitter/fb posts

Clustering Word Associations
f <- matrix (0, ncol=nrow(Matrix), nrow=nrow(Matrix))  # Matrix is a term doc matrix created above.
colnames (f) <- rownames(Matrix)
rownames (f) <- rownames(Matrix)

for (i in rownames (Matrix)) {
ff <- findAssocs (Matrix,i,0)
for  (j in rownames (ff)) {
f[j,i]=ff[j,]
}
}

fd <- as.dist(f) # calc distance matrix
plot(hclust(fd, method="ward"))  # plot dendrogram

Dendrogram of Associations

Create Word Cloud

Get term frequencies

matrix_c <- as.matrix (Matrix)
freq <- sort (rowSums (matrix_c))  # frequency data
tmdata <- data.frame (words=names(freq), freq)

Displaying wordcloud

wordcloud (tmdata$words, tmdata$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

wordcloud

If you like us, please tell your friends.Share on LinkedInShare on Google+Share on RedditTweet about this on TwitterShare on Facebook