Text Mining is the process by which information is retrieved from written documents. It turns out that there are many text mining methods that utilize matrices and factorization techniques similar to those that we have learned in CS 322. There are plenty of previous and current research projects in text mining, and there are numerous software packages already on the market (AeroText, for example). In this post, I will try to give a simplified overview of text mining.
First, to convert text documents to something more formally organized and mathematical (easier to process for a computer), a document-term matrix is created. A document-term matrix’s columns correspond to words in the documents, and its rows correspond to the documents themselves. Each entry in the matrix is the number of times a specific word occurs in a specific document. Here is a simple example of a document-term matrix (note that there are many ways to construct these matrices and weight the different terms, this is just one (very simple) one):
Document 1: “You read blogs.”
Document 2: “You read books.”
Document 3: “You hate hate hate books”
|
|
You |
read |
hate |
blogs |
books |
|
Document 1 |
1 |
1 |
0 |
1 |
0 |
|
Document 2 |
1 |
1 |
0 |
0 |
1 |
|
Document 3 |
1 |
0 |
3 |
0 |
1 |
After this matrix is constructed, there are many different ways to transform or factor it to get at its most important features. For example, it could be factored into a term-feature and a feature-document matrix (a process which I discovered can be done using PCA and SVD, or certain Non-negative matrix factorization algorithms, but I couldn’t find any examples). A feature is an element of a document that is important to that document’s meaning or contents (e.g. a sports story at ESPN might have features related to baseball, steroids, and probably features related to news). The feature-document matrix relates features to documents in the same way that the document-term matrix relates terms to documents. So, the feature-document matrix can be analyzed to determine the meaning or subject matter of a specific document.
There are many ways to analyze a feature-document matrix. One such way is to use some sort of cluster analysis to find patterns in the data. Once some pattern of features has been determined for a document, the document’s meaning can be inferred from the pattern.
Text mining has many real world applications, including email spam filtering, search engine result relevance, and security. There is a lot of current research on text mining which can be found all over the web. This post was just a brief overview and did not do justice to the true complexity of sophisticated text mining, so see the links below if you are interested.
References:
http://en.wikipedia.org/wiki/Text_mining
http://people.ischool.berkeley.edu/~hearst/text-mining.html
http://www.springerlink.com/content/n910300t07621125/fulltext.pdf
http://www.siam.org/meetings/sdm06/workproceed/Text%20Mining/antonellis21.pdf






Leave a Comment
You must be logged in to post a comment.
* You can follow any responses to this entry through the RSS 2.0 feed.