Monthly Archives: April 2011

Friends, Romans, countrymen, lend me your consolidated.db

A big part of my thesis is going to be on extracting location information from gps, cellular radio, and wifi signals. There’s a big brou-haha in the media right now because some researchers have rediscovered [1] that iOS (the OS on the iPhone and other Apple devices) logs location information to a file called consolidated.db, which is stored indefinitely on the phone (update below). Of course, privacy folks are going berserk because this could be described as “Apple secretly tracking locations of all iPhone users”, which to me is disingenuous and tends towards fear-mongering rather than informing. I don’t think that anyone would be surprised if you told them that their iPhone has the capability of determining its location, and caching data is how applications reduce latency and bandwidth usage for network services. There’s absolutely no evidence that this data is ever shared with anyone, so it seems like it’s hardly a privacy violation, or as big a deal as people are making of it.  Also, it’s worth noting that Android logs this kind of data in the radio log and some cache files, it just doesn’t keep it around for quite as long. I read somewhere that Windows Phones cache this data as well.

The data stored in consolidated.db is very coarse location data. I have seen no evidence that the location of the phone is stored, only information about the locations of visible wifi base-stations and cell towers, which are used for triangulating the phone. Since it’s a cache, it will likely only contain a single entry for each base-station or tower, and so someone could figure out the first time you were in a specific location, but not much else (how long you were there, if and when you returned to that location, for example).

This is exactly the type of data that I’ve been hoping to collect for my research. I have an Android application that does just this, but much more frequently, and have been logging my data for almost 9 months. However, what I really intend to study is how to combine data from many people to better understand human mobility patterns, and the challenge is to actually collect this kind of data from many people. The data that I seek would include information about all the places a person visits, how long they remain there, mode of transportation, etc., and so it’s no surprise that people are hesitant to just hand over that information.

But this data would be invaluable to my research. I have approval from the Ethics Board at McGill to do this kind of research, and I can ensure that any data I collect is stored in an encrypted (secure) form, and will not be shared with anyone, not even other researchers in our group. So I’m begging the world, please send me your consolidated.db files. You’ll be helping out a poor grad student!

If you have a Mac (or Windows with python installed), and you’ve backed up your iPhone, I have a modified a script that will locate the consolidated.db file from the most recent backup and copy it to your desktop. Just run this script (it’s a python file, but the McGill web server tries to run files with .py extensions), and if it says that it found the consolidated.db file and copied it to your desktop, then please email me ( that file. I will be forever indebted.

Update: It appears that Apple has released an update to shorten the time for which this data is cached, and also to prevent this database from being backed up in iTunes. Bummer for me, but definitely better for protecting user privacy…

[1] I first read about this in Feb. 2011, and researchers have known about this as far back as Sept., 2010 (link1, link2)


LDAing Twitter

Over the past few months I’ve been reading a lot about Bayesian nonparametrics. My current thoughts on the matter will make up a future post. A good starting point for understanding the hierarchical Dirichlet process mixture model is to understand the parametric form, an example of which is LDA. LDA, or latent Dirichlet allocation [1] is an approach to modeling a process by which natural-language documents are constructed.  LDA uses the bag-of-words paradigm, where documents are really just collections of words, ignoring the order in which they appear. Obviously a simplistic view, but one that is reasonable. For the remainder of this post, I’m going to be intentionally vague about the details, as rather than get bogged down in them, I’ll leave it to those who are interested to read one of the hundreds of existing tutorials.

So the LDA story goes like this. There are K possible topics that people can write about: technology, sports, pop culture, computer science, biology, etc. Associated with each topic is distribution over words, so, for example, in the biology topic the probability associated with the word “cell” would be higher than the probability of the word “robot”, while the opposite would likely be true for the technology topic. Associated with each document is a distribution over topics, so, for example, an article in the New York Times might be about how technology is affecting the study of biological organisms, making it roughly half about the technology topic and half biology. Finally, a document is generated, first by choosing its length N, and then N times choosing first a topic t from the topic distribution, then a word from the word distribution, conditioned on t. So, in our imaginary article, approximately half of the words would be related to technology, and half to biology.

The hierarchical part of this is that we want topics to be shared among different documents, and so we need a way of tying the distributions for each document. The nonparametric extension does away with K, allowing the number of topics to depend on the data.

The trouble, then, is to estimate all of these distributions from data. That is, no one tells us what the topics are, just that there are K of them, and no one tells us what words are associated with what topics, or in what proportions. This is all learned; given a number, K, and a bunch of documents, learn everything else. Traditionally, this required having a huge batch of documents, and chugging away with a sampler or by optimising some variational bound on the posterior. More recently, Matt Hoffmann developed a way to optimise the variational bound using an online stochastic gradient approach [2]. This allows us to incrementally feed LDA documents in small batches, rather than all at once. Matt also released code. He also has an implementation in Vowpal Wabbit, but I can’t seem to figure out how to interpret the results, so I’m sticking with the python version for now.

So, to get some experience with this online version of LDA, I wrote some scripts to listen to the Twitter streaming API, collect sets of 256 Tweets from the live feed, and performing batch updates for LDA. I am using two dictionaries, a big one containing 38138 words, and a smaller one containing 7702 words, and run separate processes, one for each dictionary. The scripts run non-stop, 24/7, parsing the Twitterverse. I used the default parameters that Matt used for analysing Wikipedia, but increased the batch size, since Tweets are much shorter than Wikipedia articles. I think that it’ll be really interesting to see how these topic models hold up on very short documents.

The results are updated every hour or so, and can be viewed here. On the results page, you first select a dictionary (I prefer the results on the smaller dictionary), and then you can scroll through the topics, viewing the 50 most likely words for that topic, by clicking on the arrows, or using the left and right arrow keys on your keyboard (much more convenient). It’s interesting to note that even with the python implementation of online LDA, it takes more time to parse 256 tweets from the streaming API than it does for the algorithm to process a batch of 256 tweets (although I think that says more about the performance of the tweet parser than it does about the LDA code).

The results are kind of cool. I’m not convinced that it’s not just my mind drawing relationships where there aren’t any, but some of the topics seem to make sense. At the time of writing this, the scripts had parsed 2.5 million random tweets. There aren’t any obviously identifiable topics, like technology or biology, but different words that are used in similar contexts seem to arise in the same topics. Remember, the words are just meaningless features in a document, so the system has no idea that words like woman and women are similar. Therefore, it’s nice to see them both appear in a common topic. Other examples are “didn’t, doesn’t, shouldn’t” all being in a topic, and “london, ireland, dublin” all appearing in another topic. There are a whole bunch more, but I don’t want to ruin all of the surprises.

So anyway, I thought that this was a neat little project, and I’m looking forward to any feedback that you have. It’s possible that LDA is just going to fail horribly on such small documents, and likely that I’m using awful parameters, but the results look a bit promising. Currently, we need a dictionary of acceptable words, and given the 140 character limit on tweets, people tend to use shortened versions of words that might not show up in the dictionary. Vowpal Wabbit doesn’t have this restriction (it hashes words, rather than associating an index with every word in the dictionary), so if I can just figure out how to interpret the results, I’ll switch over and see what happens. I’ll also probably switch to a C++ version, at least for the JSON parsing of the twitter stream, because currently the performance is abysmal, and although I haven’t really profiled it to ensure it’s the parsing that is the bottleneck, I’m fairly confident that it is.

Now I’m on to thinking about how to visualize the output. I have in my mind some far-out interface where there is a graph of tweets with edges connecting those that are close in topic-space, and you can zoom around the graph. Of course you’d want to do this in a web browser, and so the trick is how to avoid having to send 5 million tweets to the client. I have a script that can take a tweet, or any sample sentence, and find the 10 nearest neighbours in topic-space (using the dot product between topic vectors as a measure of similarity), but the thing takes 10 minutes just to load and parse the 2.5 million tweets that I’ve collected. Not really suited for a web service.

I’m also (finally) getting used to talking about tweets without grinning foolishly every time I write or say the word.

tldr: Scraping twitter, handing batches to online LDA with 100 topics, results here.

[1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation, JMLR, 3(Jan):993-1022, 2003.
[2] M. Hoffman, D. Blei, and F. Bach, Online Learning for Latent Dirichlet Allocation, NIPS, 2010.

Tagged , ,

First Post

Welcome to my blog. I’ve thought about blogging for a long time, have even started a few, but this time I’m going to force myself to blog. I find that if I don’t write much for a while, I get much worse at writing. Makes sense. So to force myself to keep up my writing chops, I’m going to maintain this blog, and do my best to ensure that it doesn’t get stale.

The real spark that started me on this renewed path to blogging is a cool little project that I’ve been working on in the minimal spare time I’ve had during the IJCAI, ICML, and ECML submission blitz. So, without further ado, I’ll start on a post that might actually interest others.