Topic Analysis using Nonnegative Matrix Factorization

I found a paper written by Michael W. Berry and Murray Browne on how to use non-negative matrix factorization to do topic extraction and email surveillance.
according to the article, non-negative matrix factorization (NMF),
“is a vector space method used to obtain a
representation of data using non-negativity constraints.
These constraints can lead to a parts-based representation
because they allow only additive, not subtractive,
combinations of the original data. This is in contrast to
techniques for finding a reduced dimensional representation
based on singular value decomposition-type methods”-(pg 2)

and the reason it can work in this case is because term-frequencies in emails are non-negative.

in very general terms, they try to factor a matrix X, with dimensions m=term by n=message, and factor it into matrices W and H. but this solution is not unique, and so we aim for a specific solution: computing a pair W and H to minimize the Frobenius norm of the difference X−WH.

a bunch of computations later, they found that topics can be “found” by large components in the same row
of the matrix H.

so they did an analysis to the emails sent by Enron top executives in the year leading up to its collapse, and picked a few hundred general terms that have nothing really to do with each other, and factored out all the emails in the inbox, and then factored out the emails labeled “private”, and compared the leading cluster terms, so decide whether the emails in that group(inbox or private) had different topics.

a graph of the result is on page 7/10 or page 51 on the document.

This is a pretty cool, since we can now estimate topics in large quantities of emails without having to read through them one by one.

sources:
http://www.cs.queensu.ca/~skill/proceedings/berrybrowne.pdf

Posted in Topics: Uncategorized

Jump down to leave a comment.

Leave a Comment

You must be logged in to post a comment.



* You can follow any responses to this entry through the RSS 2.0 feed.