Text mining refers to the process of parsing unstructured text in order to derive high quality information.
What You Will Need?
install.packages (c ( "tm", "wordcloud")) # install 'tm' package
library (tm)
library (wordcloud)
How To Bring The Data Into R?
The standard approach to processing data for text mining is to convert your documents to a so-called Corpus
path <- "http://goo.gl/qo0OsZ" # url of sample processed tweets
data <- read.csv (path) # plug-in your dataset path here.
data <- head (data, 15) # top 15 rows
txt_corpus <- Corpus (VectorSource (data)) # create a corpus
Basic Text Manipulation
To replace text, use gsub()
gsub (pattern, replacement_pattern, text) # general syntax
gsub ("/", " ", "my/Text") # replaces all ‘/’ with space " "
Use tm_map() for specialized replacement operations
tm_map (txt_corpus, removePunctuation) # remove punctuations
tm_map (txt_corpus, removeNumbers) # to remove numbers
tm_map (txt_corpus, removeWords, stopwords('english')) # to remove stop words(like ‘as’ ‘the’ etc….)
Some Basic Text Analysis
For most of our analysis we will be using the Document Term Matrix, which is a simple triplet matrix that describes the frequency of terms that occur in a collection of documents.
How To Convert The Corpus to a Document-Term-Matrix?
In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
Matrix <- TermDocumentMatrix(txt_corpus) # terms in rows
DTM <- DocumentTermMatrix(txt_corpus) # document no's in rows
Most frequent words
findFreqTerms (Matrix, lowfreq=5) # include words with freq>5
To find word associations
findAssocs (Matrix,'word', n) # try 'n' as 0.3 to start with
The value of ‘n’ restricts the words to be shown to those having association values more than n (around 0.30 shows good association)
*eg:- can be used for sentiment analysis on twitter/fb posts
Clustering Word Associations
f <- matrix (0, ncol=nrow(Matrix), nrow=nrow(Matrix)) # Matrix is a term doc matrix created above.
colnames (f) <- rownames(Matrix)
rownames (f) <- rownames(Matrix)
for (i in rownames (Matrix)) {
ff <- findAssocs (Matrix,i,0)
for (j in rownames (ff)) {
f[j,i]=ff[j,]
}
}
fd <- as.dist(f) # calc distance matrix
plot(hclust(fd, method="ward")) # plot dendrogram
Create Word Cloud
Get term frequencies
matrix_c <- as.matrix (Matrix)
freq <- sort (rowSums (matrix_c)) # frequency data
tmdata <- data.frame (words=names(freq), freq)
Displaying wordcloud
wordcloud (tmdata$words, tmdata$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))