A Quick Sentiment Analysis Experiment

I’ve been curious about sentiment analysis, for some stuff I’m working on unrelated to my thesis or research. I came across a great paper by Andrew Maas and some colleagues at Stanford called Learning Word Vectors for Sentiment Analysis, published at ACL 2011. They introduce a dataset based on IMDB reviews. The goal is to classify a review as positive (ratings between 7-10 out of 10) or negative (ratings between 1-3). After some heavy preprocessing (stop-word removal, infrequent word removal, etc.) and running it through a fancy probabilistic model, representing it as a vector space model, doing some tfidf trickery, and then running a classifier on it, they achieve 88.33% accuracy on some held-out data (the goal is to predict whether a review is positive or negative. Using a whole heap of unlabeled training data, they could improve the accuracy to 88.89%.

Very cool stuff.

More recently, I came across a paper from George E. Dahl, Ryan Adams, and Hugo Larochelle, published at ICML 2012, called Training Restricted Boltzmann Machines on Word Observations. RBMs are a bit out of my area of expertise, but it seems they’ve come up with some fancy MCMC methods that allow them to train them on really large vocabularies (and n-grams from these vocabularies, 5-grams even). They mentioned sentiment classification in the abstract, so I gave it a skim. Turns out they use a big RBM to build features from text, and use those features to train a classifier to predict sentiment, and they test it on the same IMDB dataset as in the previous paper I mentioned. Again, with some heavy preprocessing, a heavy-duty RBM, and a lot of parameter fiddling, they manage to achieve even better accuracy: 89.23%.

So my question was, how hard is this task, and do we really need such heavy machinery to do well? It seems easy, movie reviews that have the word “awesome” are positive and “terrible” are negative. So I fired up my favourite classifier, Vowpal Wabbit. VW learns a linear classifier, and is optimized for big data with lots of features. Instead of preprocessing the data, I just hoped that a bit of regularization would do the trick (i.e., assign 0 weight to uninformative words). Converting the data into vw-friendly format was a couple of sed and awk commands, and I wrote a script to do a quick search over a reasonable parameter space for the regularization (l1 and l2) parameters. I just guessed at a number of passes and value for the power_t parameter (controls how fast the learning rate decreases).

The model takes about 30 seconds to learn on my fancy 2.8GHz Xeon desktop machine (parameter tuning took a couple of hours), and achieves an accuracy of 88.64%, plus or minus a few hundredths of a percent since it’s an online algorithm and we randomize the order of the training data. No preprocessing, no stop-word removal, no vector space model, no tfidf magic, no nothing. Raw word counts as features, 30 seconds of vw, and we have a linear classifier that is within 1% of the best reported result on this data.

Go vw! It took me longer to write this post than it did to code the experiments. Don’t believe me, download the data and try it for yourself:

cat aclImdb/train/labeledBow.feat | \
  sed -n 's/^\([7-9]\|10\)\s/&/p' | \
  sed -e "s/^\([7-9]\|10\)\s//" | \
  awk '{ print "1 '"'"'pos_" (NR-1) " |features " $0}' > train.vw
cat aclImdb/train/labeledBow.feat | \
  sed -n 's/^[1-4]\s/&/p' | \
  sed -e "s/^[1-4]\s//" | \
  awk '{ print "0 '"'"'neg_" (NR-1) " |features " $0}' >> train.vw
cat aclImdb/test/labeledBow.feat | \
  sed -n 's/^\([7-9]\|10\)\s/&/p' | \
  sed -e "s/^\([7-9]\|10\)\s//" | \
  awk '{ print "1 '"'"'pos_" (NR-1) " |features " $0}' > test.vw
cat aclImdb/test/labeledBow.feat | \
  sed -n 's/^[1-4]\s/&/p' | \
  sed -e "s/^[1-4]\s//" | \
  awk '{ print "0 '"'"'neg_" (NR-1) " |features " $0}' >> test.vw
ruby -e 'File.open("audit.vw","w") do |f| f.puts "|features #{(0..89525).to_a.collect {|x| "#{x}:1"}.join(" ")}" end'
rm .cache
shuf train.vw | vw --adaptive --power_t 0.2 -c -f model.dat --passes 200 --l1 5e-8 --l2 5e-8 --sort_features
cat test.vw | cut -d ' ' -f 1 > labels
cat test.vw | vw -t -i model.dat -p pred_out.tmp --quiet
cat audit.vw | vw -t -i model.dat -a --quiet  > audit.log

cat pred_out.tmp | cut -d ' ' -f 1 > pred_out
rm pred_out.tmp

perf -files labels pred_out -easy

It’s here as a gist as well, if that’s more convenient.

What’s cool about a linear classifier is you can easily see what features have larger weights. The top keywords for positive and negative reviews are what you’d expect (only considering words that occur frequently in the reviews):

Top positive words, in order of importance: excellent, superb, wonderful, favorite, enjoyed, perfect, brilliant, enjoyable, amazing, great, fantastic, highly, best, unique, liked, loved, hilarious.

Top negative words, in order of importance: worst, waste, awful, boring, terrible, poorly, avoid, horrible, worse, dull, lame.

Here’s a script to print the 1,000 most frequently occurring words in decreasing order of absolute weight, if you’re interested. Obviously you have to run the above code first to generate audit.log.

So that’s sentiment analysis for you. It seems that on a very restricted domain (movie reviews), one can do really well with a linear classifier. I wonder why, especially in the Maas et al. paper, they didn’t even bother comparing against one of the simplest classifiers around*. Or maybe they did…(just kidding)

I want to be clear that my intention isn’t to criticize either of these papers. I think they’re great works, I’m a huge fan of many of the authors, and the techniques are general. Neither of the techniques are billed as “sentiment analysis algorithms”. I only want to point out that maybe this isn’t the best task to demonstrate that the vector space representations they generate are useful, since in the raw feature space, simple methods already achieve very good performance.

* Of course I’m referring to the linear classifier, not the vw learning algorithm, which is not simple, but is fast and fantastic.

Tagged , , ,

12 thoughts on “A Quick Sentiment Analysis Experiment

  1. Brian says:

    > –l1 5e-8 –l2 5e-8
    Did you select these regularization parameters based on performance in the training set, or performance in the test set? Did the parameters affect results significantly?
    Impressive regardless.

    • Jordan says:

      Hi Brian. Yeah, all the parameter tuning was done on the training set. I didn’t really look much at how the parameters affected the results, this was all thrown together very quickly, as I’m in the final sprint of writing my thesis. I tried 5e{-2..-10}.

  2. osdf says:

    Nice post! A recent ACL paper by Wang/Manning may be interesting to you: http://www.stanford.edu/~sidaw/cgi-bin/home/lib/exe/fetch.php?media=papers:compareacl.pdf

    • Jordan says:

      Thanks, that’s great. My hunch is that proper regularization should act as data-driven stopword removal, so it’s nice to see much more empirical evidence supporting that.

  3. Sam says:

    how do you produce audit.vw?

    • Jordan says:

      Whoops, forgot to include that. It’s just a one-liner in ruby, something like

      File.open("audit2.vw","w") do |f| f.puts "|features #{(0..89525).to_a.collect {|x| "#{x}:1"}.join(" ")}" end

      Basically just one example with every feature active with a value of 1.

  4. Sam says:

    you print not the top 1000 words, but the top words from the first 1000 lines of the lexicon.

    if you include the whole lexicon, you will find that the most negative words are “etta” and “bardwork” and the most positive words are “definitive”, “knockout”, “pasdar”,
    “creepiness-sniffing”, “rabin”, “oberon”, “doesnt” and “noooo”.

    • Jordan says:

      Hi Sam. You’re right, the description of what I was printing out wasn’t perfectly clear. I only considered the 1,000 most frequently occurring words when printing out the weights, as those are the words I’m most interested in. There are certainly words that occur in one or two strongly polarized review that will get high-magnitude weights, but looking at those weights isn’t really all that informative.

  5. Charles Pritchard says:

    Is the “perf” script one of your utilities or included from another library?

  6. Igor says:

    Thanks for the post!
    For some reason the latest version of vw doesn’t seem to work as well. I am getting an 84.5-85.7% accuracy instead of 88.6% (so there is noise too). Do you know what might have changed? Do I need to adjust the learning parameters?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: