Now that you’ve downloaded the AOL data (if not, find a mirror), get the stuff in a database – you’re not going to be able to open these puppies in Excel. Josh uses PostgreSQL, and has posted how he imported the AOL data here. I’m more of a MySQL guy myself, so I plagarized him a bit to do my own import.
First, I got the ten unzipped files into one big ol’ file:
cat user-ct-test-collection-*.txt | grep -v “AnonID” | grep -v “\\\.” > aoldb.txt
Then I popped open the MySQL shell:
mysql -u root -p none-of-your-business
Then, the following:
create database aolsearch;
use aolsearch;
create table aolquery (anonid integer, query varchar(1024), querytime datetime, itemrank integer, clickurl varchar(1024));
load data local infile ‘aoldb.txt’ into table aolquery;
Bingo:
Query OK, 36389532 rows affected, 65535 warnings (7 min 52.56 sec)
Records: 36389532 Deleted: 0 Skipped: 0 Warnings: 16946940
(The warnings are from the records without click-through URLs.)
Thank god for the below:
mysql> select * from aolquery where query = ‘greg yardley’;
Empty set (1 min 40.79 sec)
I’m not going to really get a chance to dig in until tonight, but there’s a lot to learn about user behavior from the data set. For instance, this query tells me there’s 329 unique people in the data set who were looking to refinance their homes:
mysql> select count(distinct anonid) from query where query like ‘%refinance%’;
{ 6 comments… read them below or add one }
Greg, how long does your mysql box take to add an index to a 36M row table?
Since I like being pedantic and didactic, fyi:
mysql -u root -p none-of-your-business
would start up using a database called none-of-your-business. -p’mypass’ is the way to specify the passwd on the cmd line
I didn’t bother creating any indexes – when I get a chance to play more with the data I’ll do that and let you know. I need to index at least the AnonID, I’m thinking.
Got to pick up my MacBook from the shop; it’s way faster than this PowerBook G4 and I’ll be playing with both it and this data tonight.
You get a pass on being a pedant as long as you’re a didact.
Well i indexed the whole data set today (on all 36m rows) i used CREATE INDEX aolstuff ON aoldata2 (anonid, query); and it took 6 hours and 13 minutes…
machine specs are:
dual p3 750
512 meg Ram
debian linux sarge
mysql 4.1.11 (default debian apt-get package)
if anyone wants to play with the indexed data and doesnt wanna index themself left me know (recklessop@gmail.com or AIM
pauljusm
I use postgresql here at http://aol.zanoza.lv. Now building full text index using tsearch2.
does anyone want a copy of the mysqldump of the indexed aoldata… full 2gig or whatever file… it would save you 6 hours of indexing… i can zip it if anyone wants a copy
what the hell is all this 6 hours about it takes less than a hour