Get your AOL data on

by greg on August 7, 2006

Now that you’ve downloaded the AOL data (if not, find a mirror), get the stuff in a database – you’re not going to be able to open these puppies in Excel. Josh uses PostgreSQL, and has posted how he imported the AOL data here. I’m more of a MySQL guy myself, so I plagarized him a bit to do my own import.

First, I got the ten unzipped files into one big ol’ file:

cat user-ct-test-collection-*.txt | grep -v “AnonID” | grep -v “\\\.” > aoldb.txt

Then I popped open the MySQL shell:

mysql -u root -p none-of-your-business

Then, the following:

create database aolsearch;
use aolsearch;
create table aolquery (anonid integer, query varchar(1024), querytime datetime, itemrank integer, clickurl varchar(1024));
load data local infile ‘aoldb.txt’ into table aolquery;

Bingo:

Query OK, 36389532 rows affected, 65535 warnings (7 min 52.56 sec)
Records: 36389532 Deleted: 0 Skipped: 0 Warnings: 16946940

(The warnings are from the records without click-through URLs.)

Thank god for the below:

mysql> select * from aolquery where query = ‘greg yardley’;
Empty set (1 min 40.79 sec)

I’m not going to really get a chance to dig in until tonight, but there’s a lot to learn about user behavior from the data set. For instance, this query tells me there’s 329 unique people in the data set who were looking to refinance their homes:

mysql> select count(distinct anonid) from query where query like ‘%refinance%’;

{ 6 comments… read them below or add one }

John K August 7, 2006 at 8:44 pm

Greg, how long does your mysql box take to add an index to a 36M row table?

Since I like being pedantic and didactic, fyi:
mysql -u root -p none-of-your-business

would start up using a database called none-of-your-business. -p’mypass’ is the way to specify the passwd on the cmd line

Greg Yardley August 7, 2006 at 9:48 pm

I didn’t bother creating any indexes – when I get a chance to play more with the data I’ll do that and let you know. I need to index at least the AnonID, I’m thinking.

Got to pick up my MacBook from the shop; it’s way faster than this PowerBook G4 and I’ll be playing with both it and this data tonight.

You get a pass on being a pedant as long as you’re a didact.

Justin August 11, 2006 at 4:38 am

Well i indexed the whole data set today (on all 36m rows) i used CREATE INDEX aolstuff ON aoldata2 (anonid, query); and it took 6 hours and 13 minutes…
machine specs are:

dual p3 750
512 meg Ram
debian linux sarge
mysql 4.1.11 (default debian apt-get package)

if anyone wants to play with the indexed data and doesnt wanna index themself left me know (recklessop@gmail.com or AIM ;) pauljusm

max August 13, 2006 at 8:44 pm

I use postgresql here at http://aol.zanoza.lv. Now building full text index using tsearch2.

Justin August 15, 2006 at 1:52 am

does anyone want a copy of the mysqldump of the indexed aoldata… full 2gig or whatever file… it would save you 6 hours of indexing… i can zip it if anyone wants a copy

ricky September 3, 2006 at 10:16 pm

what the hell is all this 6 hours about it takes less than a hour

Leave a Comment

Previous post:

Next post: