About Me

As an explanation for why there have been no posts in a long time, I’m now working on the Growth and Engagement team at Facebook. I’m on a small subteam that is tasked with creating metrics to enable us to directly compare and contrast user experience across the various platforms, such as web and mobile. Facebook has over 1B active users, and relies heavily on machine learning for a wide variety of tasks. Thus, for someone interested in ubiquitous machine learning (e.g., me), I can’t think of a better place to be. It’s a lot of fun, but has also kept me extremely busy.

A Quick Sentiment Analysis Experiment

I’ve been curious about sentiment analysis, for some stuff I’m working on unrelated to my thesis or research. I came across a great paper by Andrew Maas and some colleagues at Stanford called Learning Word Vectors for Sentiment Analysis, published at ACL 2011. They introduce a dataset based on IMDB reviews. The goal is to classify a review as positive (ratings between 7-10 out of 10) or negative (ratings between 1-3). After some heavy preprocessing (stop-word removal, infrequent word removal, etc.) and running it through a fancy probabilistic model, representing it as a vector space model, doing some tfidf trickery, and then running a classifier on it, they achieve 88.33% accuracy on some held-out data (the goal is to predict whether a review is positive or negative. Using a whole heap of unlabeled training data, they could improve the accuracy to 88.89%.

Very cool stuff.

More recently, I came across a paper from George E. Dahl, Ryan Adams, and Hugo Larochelle, published at ICML 2012, called Training Restricted Boltzmann Machines on Word Observations. RBMs are a bit out of my area of expertise, but it seems they’ve come up with some fancy MCMC methods that allow them to train them on really large vocabularies (and n-grams from these vocabularies, 5-grams even). They mentioned sentiment classification in the abstract, so I gave it a skim. Turns out they use a big RBM to build features from text, and use those features to train a classifier to predict sentiment, and they test it on the same IMDB dataset as in the previous paper I mentioned. Again, with some heavy preprocessing, a heavy-duty RBM, and a lot of parameter fiddling, they manage to achieve even better accuracy: 89.23%.

So my question was, how hard is this task, and do we really need such heavy machinery to do well? It seems easy, movie reviews that have the word “awesome” are positive and “terrible” are negative. So I fired up my favourite classifier, Vowpal Wabbit. VW learns a linear classifier, and is optimized for big data with lots of features. Instead of preprocessing the data, I just hoped that a bit of regularization would do the trick (i.e., assign 0 weight to uninformative words). Converting the data into vw-friendly format was a couple of sed and awk commands, and I wrote a script to do a quick search over a reasonable parameter space for the regularization (l1 and l2) parameters. I just guessed at a number of passes and value for the power_t parameter (controls how fast the learning rate decreases).

The model takes about 30 seconds to learn on my fancy 2.8GHz Xeon desktop machine (parameter tuning took a couple of hours), and achieves an accuracy of 88.64%, plus or minus a few hundredths of a percent since it’s an online algorithm and we randomize the order of the training data. No preprocessing, no stop-word removal, no vector space model, no tfidf magic, no nothing. Raw word counts as features, 30 seconds of vw, and we have a linear classifier that is within 1% of the best reported result on this data.

Go vw! It took me longer to write this post than it did to code the experiments. Don’t believe me, download the data and try it for yourself:

cat aclImdb/train/labeledBow.feat | \
  sed -n 's/^\([7-9]\|10\)\s/&/p' | \
  sed -e "s/^\([7-9]\|10\)\s//" | \
  awk '{ print "1 '"'"'pos_" (NR-1) " |features " $0}' > train.vw
cat aclImdb/train/labeledBow.feat | \
  sed -n 's/^[1-4]\s/&/p' | \
  sed -e "s/^[1-4]\s//" | \
  awk '{ print "0 '"'"'neg_" (NR-1) " |features " $0}' >> train.vw
cat aclImdb/test/labeledBow.feat | \
  sed -n 's/^\([7-9]\|10\)\s/&/p' | \
  sed -e "s/^\([7-9]\|10\)\s//" | \
  awk '{ print "1 '"'"'pos_" (NR-1) " |features " $0}' > test.vw
cat aclImdb/test/labeledBow.feat | \
  sed -n 's/^[1-4]\s/&/p' | \
  sed -e "s/^[1-4]\s//" | \
  awk '{ print "0 '"'"'neg_" (NR-1) " |features " $0}' >> test.vw
ruby -e 'File.open("audit.vw","w") do |f| f.puts "|features #{(0..89525).to_a.collect {|x| "#{x}:1"}.join(" ")}" end'
rm .cache
shuf train.vw | vw --adaptive --power_t 0.2 -c -f model.dat --passes 200 --l1 5e-8 --l2 5e-8 --sort_features
cat test.vw | cut -d ' ' -f 1 > labels
cat test.vw | vw -t -i model.dat -p pred_out.tmp --quiet
cat audit.vw | vw -t -i model.dat -a --quiet  > audit.log

cat pred_out.tmp | cut -d ' ' -f 1 > pred_out
rm pred_out.tmp

perf -files labels pred_out -easy

It’s here as a gist as well, if that’s more convenient.

What’s cool about a linear classifier is you can easily see what features have larger weights. The top keywords for positive and negative reviews are what you’d expect (only considering words that occur frequently in the reviews):

Top positive words, in order of importance: excellent, superb, wonderful, favorite, enjoyed, perfect, brilliant, enjoyable, amazing, great, fantastic, highly, best, unique, liked, loved, hilarious.

Top negative words, in order of importance: worst, waste, awful, boring, terrible, poorly, avoid, horrible, worse, dull, lame.

Here’s a script to print the 1,000 most frequently occurring words in decreasing order of absolute weight, if you’re interested. Obviously you have to run the above code first to generate audit.log.

So that’s sentiment analysis for you. It seems that on a very restricted domain (movie reviews), one can do really well with a linear classifier. I wonder why, especially in the Maas et al. paper, they didn’t even bother comparing against one of the simplest classifiers around*. Or maybe they did…(just kidding)

I want to be clear that my intention isn’t to criticize either of these papers. I think they’re great works, I’m a huge fan of many of the authors, and the techniques are general. Neither of the techniques are billed as “sentiment analysis algorithms”. I only want to point out that maybe this isn’t the best task to demonstrate that the vector space representations they generate are useful, since in the raw feature space, simple methods already achieve very good performance.

* Of course I’m referring to the linear classifier, not the vw learning algorithm, which is not simple, but is fast and fantastic.

Tagged , , ,

Great paper: Unsupervised Indoor Location

I haven’t blogged in a long time, so I’m looking for any excuse. Coming across a really cool paper seems a good one.

The paper is from Romit Roy Choudhury’s group at Duke, and appears to be a collaboration with some researchers at EJUST in Egypt. The title is No Need to War-Drive: Unsupervised Indoor Localization, and was published in MobiSys (one of that groups’ 4 papers), which is taking place right now.

The main idea is very clever. Basically, dead reckoning on a mobile phone sucks, but there are lots of ‘landmarks’ that can be established from various sensors (e.g., magnetometer, sound, ambient light, etc.), that can be used to ground the estimate. These landmarks are discovered automatically using clustering. If the goal is to place a user on a map, then they can also use certain preexisting landmarks. For instance, they are able to accurately detect when a user moves up or down an elevator, escalator, or staircase, and so they can use the known locations of these structures to anchor landmarks to a map. They did some thorough testing in two buildings on campus as well as in a mall, and the accuracy is exceptional. The paper is written somewhat informally, but tells a great story of how the system was developed. There isn’t a strong novel contribution, but a fantastic combination of some clever tricks to build an indoor localization system that requires no calibration.

Kudos to the authors on a great paper. For anyone thinking about indoor localization, I recommend this work.

Are the sensors coming?

A recent blog post on the New York Times blog, titled The Sensors are Coming! describes a whole list of sensors that are expected to be included in future smartphones and mobile devices. Most smartphones contain accelerometers, which, by detecting the constant downward acceleration due to gravity, can determine the orientation of the device.  This allows the screen to orient itself when you rotate the device.

Accelerometers are also used for gaming, allowing you to steer a car by tilting the phone, for example. Recent phones such as the Nexus S now are shipping with gyroscopes, which are much more accurate at detecting changes in the orientation of the phone than accelerometers alone, allowing for more precision in the gameplay.

The NYT blog mentions other sensors, such as heart-rate, skin temperatures and perspiration sensors for determining physiological conditions, and environmental sensors such as additional microphones, thermometers, altimeters, and humidity sensors.

I would guess that the altimeters would, in fact, be barometric pressure sensors, since in my experience with the awesome Mobile Sensing Platform devices that Intel was kind enough to donate to our research group, they can fairly accurately discern between different floors of a building, for example, if properly calibrated. Calibration is no easy task, and it is more likely that these sensors will be used to detect relative changes in altitude, but this can be useful, for example, to improve the ability of phones for estimating calorie expenditure.

I believe that more environmental sensors may be included in future smartphones, mostly because they’re cheap and work well. I’m less certain about physiological sensors, for a number of reasons.

First, and most obvious in my mind, is that measurement apparatus for heart rate, galvanic skin response, and the like, need to be carefully placed on the skin in order to produce any meaningful data. I attended the AAAI Spring Symposium meeting on Computational Physiology in March 2011, and heard numerous stories from sensor vendors such as BodyMedia about how hard it is to build these devices. They work pretty well, but they are fickle and even at best the data is very noisy. Good enough to roughly approximate heart rate and caloric expenditure, but not precise enough for much else. The notion that your heart rate could be monitored by a sensor in a phone carried in your pocket (not even touching your bare skin let alone being held firmly against it), is incredibly unlikely. Moreover, I can’t imagine that there is any reliable way to measure heart rate in a way that is robust with respect to how people hold and carry their phones. More likely is that devices like the ones produced by BodyMedia will be more prevalent.

The second reason is that if these physiological sensors were capable of being used in a medical diagnostic-like setting, then they would have to get approval from organizations like the FDA. Since the idea is that they would be included in phones on which various applications can be installed, it is hard to argue that they cannot be used for this purpose, since anyone can write an app that would perform this function (maybe not well, but that’s not what is important). This is certainly not my area of expertise, but I would imagine that this would mean that any phone that includes these types of sensors would have to be approved by the FDA, or equivalent organization, and this isn’t something that phone vendors are going to put themselves through.

So in conclusion, I don’t think that the fancy physiological sensors are going to be included in smartphones, I think that they will still be relegated to special-purpose devices. These devices will (and already do) interface with your phone to facilitate interaction. It’s certainly fun to imagine a future where Angry Birds levels get easier or harder depending on the level of frustration the player is experiencing, measure by GSR.

Friends, Romans, countrymen, lend me your consolidated.db

A big part of my thesis is going to be on extracting location information from gps, cellular radio, and wifi signals. There’s a big brou-haha in the media right now because some researchers have rediscovered [1] that iOS (the OS on the iPhone and other Apple devices) logs location information to a file called consolidated.db, which is stored indefinitely on the phone (update below). Of course, privacy folks are going berserk because this could be described as “Apple secretly tracking locations of all iPhone users”, which to me is disingenuous and tends towards fear-mongering rather than informing. I don’t think that anyone would be surprised if you told them that their iPhone has the capability of determining its location, and caching data is how applications reduce latency and bandwidth usage for network services. There’s absolutely no evidence that this data is ever shared with anyone, so it seems like it’s hardly a privacy violation, or as big a deal as people are making of it.  Also, it’s worth noting that Android logs this kind of data in the radio log and some cache files, it just doesn’t keep it around for quite as long. I read somewhere that Windows Phones cache this data as well.

The data stored in consolidated.db is very coarse location data. I have seen no evidence that the location of the phone is stored, only information about the locations of visible wifi base-stations and cell towers, which are used for triangulating the phone. Since it’s a cache, it will likely only contain a single entry for each base-station or tower, and so someone could figure out the first time you were in a specific location, but not much else (how long you were there, if and when you returned to that location, for example).

This is exactly the type of data that I’ve been hoping to collect for my research. I have an Android application that does just this, but much more frequently, and have been logging my data for almost 9 months. However, what I really intend to study is how to combine data from many people to better understand human mobility patterns, and the challenge is to actually collect this kind of data from many people. The data that I seek would include information about all the places a person visits, how long they remain there, mode of transportation, etc., and so it’s no surprise that people are hesitant to just hand over that information.

But this data would be invaluable to my research. I have approval from the Ethics Board at McGill to do this kind of research, and I can ensure that any data I collect is stored in an encrypted (secure) form, and will not be shared with anyone, not even other researchers in our group. So I’m begging the world, please send me your consolidated.db files. You’ll be helping out a poor grad student!

If you have a Mac (or Windows with python installed), and you’ve backed up your iPhone, I have a modified a script that will locate the consolidated.db file from the most recent backup and copy it to your desktop. Just run this script (it’s a python file, but the McGill web server tries to run files with .py extensions), and if it says that it found the consolidated.db file and copied it to your desktop, then please email me (jordan.frank@cs.mcgill.ca) that file. I will be forever indebted.

Update: It appears that Apple has released an update to shorten the time for which this data is cached, and also to prevent this database from being backed up in iTunes. Bummer for me, but definitely better for protecting user privacy…

[1] I first read about this in Feb. 2011, and researchers have known about this as far back as Sept., 2010 (link1, link2)


LDAing Twitter

Over the past few months I’ve been reading a lot about Bayesian nonparametrics. My current thoughts on the matter will make up a future post. A good starting point for understanding the hierarchical Dirichlet process mixture model is to understand the parametric form, an example of which is LDA. LDA, or latent Dirichlet allocation [1] is an approach to modeling a process by which natural-language documents are constructed.  LDA uses the bag-of-words paradigm, where documents are really just collections of words, ignoring the order in which they appear. Obviously a simplistic view, but one that is reasonable. For the remainder of this post, I’m going to be intentionally vague about the details, as rather than get bogged down in them, I’ll leave it to those who are interested to read one of the hundreds of existing tutorials.

So the LDA story goes like this. There are K possible topics that people can write about: technology, sports, pop culture, computer science, biology, etc. Associated with each topic is distribution over words, so, for example, in the biology topic the probability associated with the word “cell” would be higher than the probability of the word “robot”, while the opposite would likely be true for the technology topic. Associated with each document is a distribution over topics, so, for example, an article in the New York Times might be about how technology is affecting the study of biological organisms, making it roughly half about the technology topic and half biology. Finally, a document is generated, first by choosing its length N, and then N times choosing first a topic t from the topic distribution, then a word from the word distribution, conditioned on t. So, in our imaginary article, approximately half of the words would be related to technology, and half to biology.

The hierarchical part of this is that we want topics to be shared among different documents, and so we need a way of tying the distributions for each document. The nonparametric extension does away with K, allowing the number of topics to depend on the data.

The trouble, then, is to estimate all of these distributions from data. That is, no one tells us what the topics are, just that there are K of them, and no one tells us what words are associated with what topics, or in what proportions. This is all learned; given a number, K, and a bunch of documents, learn everything else. Traditionally, this required having a huge batch of documents, and chugging away with a sampler or by optimising some variational bound on the posterior. More recently, Matt Hoffmann developed a way to optimise the variational bound using an online stochastic gradient approach [2]. This allows us to incrementally feed LDA documents in small batches, rather than all at once. Matt also released code. He also has an implementation in Vowpal Wabbit, but I can’t seem to figure out how to interpret the results, so I’m sticking with the python version for now.

So, to get some experience with this online version of LDA, I wrote some scripts to listen to the Twitter streaming API, collect sets of 256 Tweets from the live feed, and performing batch updates for LDA. I am using two dictionaries, a big one containing 38138 words, and a smaller one containing 7702 words, and run separate processes, one for each dictionary. The scripts run non-stop, 24/7, parsing the Twitterverse. I used the default parameters that Matt used for analysing Wikipedia, but increased the batch size, since Tweets are much shorter than Wikipedia articles. I think that it’ll be really interesting to see how these topic models hold up on very short documents.

The results are updated every hour or so, and can be viewed here. On the results page, you first select a dictionary (I prefer the results on the smaller dictionary), and then you can scroll through the topics, viewing the 50 most likely words for that topic, by clicking on the arrows, or using the left and right arrow keys on your keyboard (much more convenient). It’s interesting to note that even with the python implementation of online LDA, it takes more time to parse 256 tweets from the streaming API than it does for the algorithm to process a batch of 256 tweets (although I think that says more about the performance of the tweet parser than it does about the LDA code).

The results are kind of cool. I’m not convinced that it’s not just my mind drawing relationships where there aren’t any, but some of the topics seem to make sense. At the time of writing this, the scripts had parsed 2.5 million random tweets. There aren’t any obviously identifiable topics, like technology or biology, but different words that are used in similar contexts seem to arise in the same topics. Remember, the words are just meaningless features in a document, so the system has no idea that words like woman and women are similar. Therefore, it’s nice to see them both appear in a common topic. Other examples are “didn’t, doesn’t, shouldn’t” all being in a topic, and “london, ireland, dublin” all appearing in another topic. There are a whole bunch more, but I don’t want to ruin all of the surprises.

So anyway, I thought that this was a neat little project, and I’m looking forward to any feedback that you have. It’s possible that LDA is just going to fail horribly on such small documents, and likely that I’m using awful parameters, but the results look a bit promising. Currently, we need a dictionary of acceptable words, and given the 140 character limit on tweets, people tend to use shortened versions of words that might not show up in the dictionary. Vowpal Wabbit doesn’t have this restriction (it hashes words, rather than associating an index with every word in the dictionary), so if I can just figure out how to interpret the results, I’ll switch over and see what happens. I’ll also probably switch to a C++ version, at least for the JSON parsing of the twitter stream, because currently the performance is abysmal, and although I haven’t really profiled it to ensure it’s the parsing that is the bottleneck, I’m fairly confident that it is.

Now I’m on to thinking about how to visualize the output. I have in my mind some far-out interface where there is a graph of tweets with edges connecting those that are close in topic-space, and you can zoom around the graph. Of course you’d want to do this in a web browser, and so the trick is how to avoid having to send 5 million tweets to the client. I have a script that can take a tweet, or any sample sentence, and find the 10 nearest neighbours in topic-space (using the dot product between topic vectors as a measure of similarity), but the thing takes 10 minutes just to load and parse the 2.5 million tweets that I’ve collected. Not really suited for a web service.

I’m also (finally) getting used to talking about tweets without grinning foolishly every time I write or say the word.

tldr: Scraping twitter, handing batches to online LDA with 100 topics, results here.

[1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation, JMLR, 3(Jan):993-1022, 2003.
[2] M. Hoffman, D. Blei, and F. Bach, Online Learning for Latent Dirichlet Allocation, NIPS, 2010.

Tagged , ,

First Post

Welcome to my blog. I’ve thought about blogging for a long time, have even started a few, but this time I’m going to force myself to blog. I find that if I don’t write much for a while, I get much worse at writing. Makes sense. So to force myself to keep up my writing chops, I’m going to maintain this blog, and do my best to ensure that it doesn’t get stale.

The real spark that started me on this renewed path to blogging is a cool little project that I’ve been working on in the minimal spare time I’ve had during the IJCAI, ICML, and ECML submission blitz. So, without further ado, I’ll start on a post that might actually interest others.