I’ve been curious about sentiment analysis, for some stuff I’m working on unrelated to my thesis or research. I came across a great paper by Andrew Maas and some colleagues at Stanford called Learning Word Vectors for Sentiment Analysis, published at ACL 2011. They introduce a dataset based on IMDB reviews. The goal is to classify a review as positive (ratings between 7-10 out of 10) or negative (ratings between 1-3). After some heavy preprocessing (stop-word removal, infrequent word removal, etc.) and running it through a fancy probabilistic model, representing it as a vector space model, doing some tfidf trickery, and then running a classifier on it, they achieve 88.33% accuracy on some held-out data (the goal is to predict whether a review is positive or negative. Using a whole heap of unlabeled training data, they could improve the accuracy to 88.89%.
Very cool stuff.
More recently, I came across a paper from George E. Dahl, Ryan Adams, and Hugo Larochelle, published at ICML 2012, called Training Restricted Boltzmann Machines on Word Observations. RBMs are a bit out of my area of expertise, but it seems they’ve come up with some fancy MCMC methods that allow them to train them on really large vocabularies (and n-grams from these vocabularies, 5-grams even). They mentioned sentiment classification in the abstract, so I gave it a skim. Turns out they use a big RBM to build features from text, and use those features to train a classifier to predict sentiment, and they test it on the same IMDB dataset as in the previous paper I mentioned. Again, with some heavy preprocessing, a heavy-duty RBM, and a lot of parameter fiddling, they manage to achieve even better accuracy: 89.23%.
So my question was, how hard is this task, and do we really need such heavy machinery to do well? It seems easy, movie reviews that have the word “awesome” are positive and “terrible” are negative. So I fired up my favourite classifier, Vowpal Wabbit. VW learns a linear classifier, and is optimized for big data with lots of features. Instead of preprocessing the data, I just hoped that a bit of regularization would do the trick (i.e., assign 0 weight to uninformative words). Converting the data into vw-friendly format was a couple of sed and awk commands, and I wrote a script to do a quick search over a reasonable parameter space for the regularization (l1 and l2) parameters. I just guessed at a number of passes and value for the power_t parameter (controls how fast the learning rate decreases).
The model takes about 30 seconds to learn on my fancy 2.8GHz Xeon desktop machine (parameter tuning took a couple of hours), and achieves an accuracy of 88.64%, plus or minus a few hundredths of a percent since it’s an online algorithm and we randomize the order of the training data. No preprocessing, no stop-word removal, no vector space model, no tfidf magic, no nothing. Raw word counts as features, 30 seconds of vw, and we have a linear classifier that is within 1% of the best reported result on this data.
Go vw! It took me longer to write this post than it did to code the experiments. Don’t believe me, download the data and try it for yourself:
cat aclImdb/train/labeledBow.feat | \ sed -n 's/^\([7-9]\|10\)\s/&/p' | \ sed -e "s/^\([7-9]\|10\)\s//" | \ awk '{ print "1 '"'"'pos_" (NR-1) " |features " $0}' > train.vw cat aclImdb/train/labeledBow.feat | \ sed -n 's/^[1-4]\s/&/p' | \ sed -e "s/^[1-4]\s//" | \ awk '{ print "0 '"'"'neg_" (NR-1) " |features " $0}' >> train.vw cat aclImdb/test/labeledBow.feat | \ sed -n 's/^\([7-9]\|10\)\s/&/p' | \ sed -e "s/^\([7-9]\|10\)\s//" | \ awk '{ print "1 '"'"'pos_" (NR-1) " |features " $0}' > test.vw cat aclImdb/test/labeledBow.feat | \ sed -n 's/^[1-4]\s/&/p' | \ sed -e "s/^[1-4]\s//" | \ awk '{ print "0 '"'"'neg_" (NR-1) " |features " $0}' >> test.vw ruby -e 'File.open("audit.vw","w") do |f| f.puts "|features #{(0..89525).to_a.collect {|x| "#{x}:1"}.join(" ")}" end' rm .cache shuf train.vw | vw --adaptive --power_t 0.2 -c -f model.dat --passes 200 --l1 5e-8 --l2 5e-8 --sort_features cat test.vw | cut -d ' ' -f 1 > labels cat test.vw | vw -t -i model.dat -p pred_out.tmp --quiet cat audit.vw | vw -t -i model.dat -a --quiet > audit.log cat pred_out.tmp | cut -d ' ' -f 1 > pred_out rm pred_out.tmp perf -files labels pred_out -easy
It’s here as a gist as well, if that’s more convenient.
What’s cool about a linear classifier is you can easily see what features have larger weights. The top keywords for positive and negative reviews are what you’d expect (only considering words that occur frequently in the reviews):
Top positive words, in order of importance: excellent, superb, wonderful, favorite, enjoyed, perfect, brilliant, enjoyable, amazing, great, fantastic, highly, best, unique, liked, loved, hilarious.
Top negative words, in order of importance: worst, waste, awful, boring, terrible, poorly, avoid, horrible, worse, dull, lame.
Here’s a script to print the 1,000 most frequently occurring words in decreasing order of absolute weight, if you’re interested. Obviously you have to run the above code first to generate audit.log
.
So that’s sentiment analysis for you. It seems that on a very restricted domain (movie reviews), one can do really well with a linear classifier. I wonder why, especially in the Maas et al. paper, they didn’t even bother comparing against one of the simplest classifiers around*. Or maybe they did…(just kidding)
I want to be clear that my intention isn’t to criticize either of these papers. I think they’re great works, I’m a huge fan of many of the authors, and the techniques are general. Neither of the techniques are billed as “sentiment analysis algorithms”. I only want to point out that maybe this isn’t the best task to demonstrate that the vector space representations they generate are useful, since in the raw feature space, simple methods already achieve very good performance.
—
* Of course I’m referring to the linear classifier, not the vw learning algorithm, which is not simple, but is fast and fantastic.