Any lessons from the AOL data?
08-11-2006, 04:26 AM
|
|
Posts: 118
|
What I find interesting is the keywords used to find particular sites. Have a look at this URL, it makes use of the AOL blunder :thumbup:
http://www.askthebrain.com/aol/
|
|
|
|
08-11-2006, 11:32 AM
|
|
Posts: 4
|
Quote:
Originally Posted by august
my queries seem to be taking 1 to 3 seconds max, in over 15 million entries
|
Can you please give some info on how you managed to index this dataset?
What sort of queries are you running that are flying along so quickly?
Are you running COUNT queries?
Thanks in advance for any help
Rgds
RC
|
|
|
|
08-11-2006, 04:11 PM
|
|
Posts: 440
|
I forget where it was, (seochat maybe), but I read a case study, where number one dropped for a few days to number two. The went from 30,000 hits/day to 12,000.
The aol data is pretty interesting. I'd love to see a web interface to mine it .
|
|
|
|
08-11-2006, 05:11 PM
|
|
Posts: 240
|
Queries on about 3.5million rows for me are taking about 45-60 seconds!
But then I just shunted it into MS Access without any indexes  Maybe I'll import it into the MySQL server I've got running locally and see if its any better
|
|
|
|
08-11-2006, 09:45 PM
|
|
Posts: 117
|
How are you guys loading these files?
I'm just trying to load 1/10 size pieces and I'm having trouble. (on a 900MB RAM machine)
Notepad and Wordpad can't handle them.
OpenOffice takes about 10 minutes and then locks up when the progress bar is full. (I'd have to use Base, since Calc only loads the first 65K lines and there are ~2M in just one of the pieces)
DOS Type and Find work, but I was looking for something a little more powerful.
|
|
|
|
08-11-2006, 10:03 PM
|
|
Posts: 1
|
Anyone care to post data beyond the first 10 results? I'd be curious to see how #11 compares, seeing as how it's at the top of the second page.
Does it beat out spots #9 and #10?
|
|
|
|
08-12-2006, 04:47 AM
|
|
Posts: 328
|
Quote:
Originally Posted by gopher292
How are you guys loading these files?
I'm just trying to load 1/10 size pieces and I'm having trouble. (on a 900MB RAM machine)
Notepad and Wordpad can't handle them.
OpenOffice takes about 10 minutes and then locks up when the progress bar is full. (I'd have to use Base, since Calc only loads the first 65K lines and there are ~2M in just one of the pieces)
DOS Type and Find work, but I was looking for something a little more powerful.
|
whaaa. what would you do with the data when you have it in an editor? you can't get any figures out of that, you really nead to load it into a database.
I tried Access 2000 first, but that's really slow and I could only import 4 out of 10 files.
Now I have all 10 in MySQL 5 and that works quite ok..
|
|
|
|
08-12-2006, 04:48 AM
|
|
Posts: 328
|
Quote:
Originally Posted by NuiLoa
Anyone care to post data beyond the first 10 results? I'd be curious to see how #11 compares, seeing as how it's at the top of the second page.
Does it beat out spots #9 and #10?
|
No, it doesn't by a long shot  People just don't click through to 2nd page..
here, this is still out of my sample I used previously;
Code:
1 3275637 42,25%
2 925507 11,94%
3 656290 8,47%
4 468703 6,05%
5 377877 4,87%
6 309669 3,99%
7 262140 3,38%
8 230981 2,98%
9 229909 2,97%
10 218748 2,82%
11 50615 0,65%
12 43030 0,56%
13 40246 0,52%
14 37455 0,48%
15 36177 0,47%
16 29800 0,38%
17 27769 0,36%
18 26025 0,34%
19 25738 0,33%
20 24573 0,32%
21 22868 0,29%
22 21961 0,28%
23 21738 0,28%
24 21018 0,27%
25 20935 0,27%
|
|
|
|
08-12-2006, 07:13 AM
|
|
Posts: 1
|
Quote:
Originally Posted by NuiLoa
Anyone care to post data beyond the first 10 results? I'd be curious to see how #11 compares, seeing as how it's at the top of the second page.
Does it beat out spots #9 and #10?
|
#11 does not beat out spots #9 and #10, turns out page 1 is the almighty god of search results. However, what you suspected can be observed with later pages. I don't know if that's useful though because all other pages after the first and second get almost no hits in comparison to the first page.
I have posted some figures on my web page at u500k.erinye.com including clicks by rank for all 500 ranks in the data set. There are other interesting observations to be made, too, if you're willing to look at more than what's SEO relevant.
|
|
|
|
08-12-2006, 09:19 AM
|
|
Posts: 23
|
|
|
|
|
08-12-2006, 10:42 AM
|
|
Posts: 27
|
I'd like to see the top click getters along with their Alexa, PR, backlink count from Google/Yahoo, and total pages indexed from Google/Yahoo.
I know that's a lot to ask!
With proper indexing and temp tables, these quries should not exceed a second in execution. Get rid of the "www." from the URLs in the db.
Ty
|
|
|
|
08-12-2006, 12:41 PM
|
|
Posts: 4
|
Code:
Rank # Clickthroughs % Delta #n-1 Delta #1
19434540 100%
1 8220278 42.30% n/a n/a
2 2316738 11.92% -71.82% -71.82%
3 1640751 8.44% -29.18% -80.04%
4 1171642 6.03% -28.59% -85.75%
5 943667 4.86% -19.46% -88.52%
6 774718 3.99% -17.90% -90.58%
7 655914 3.37% -15.34% -92.02%
8 579206 2.98% -11.69% -92.95%
9 549196 2.83% -5.18% -93.32%
10 577325 2.97% 5.12% -92.98%
11 127688 0.66% -77.88% -98.45%
12 108555 0.56% -14.98% -98.68%
13 101802 0.52% -6.22% -98.76%
14 94221 0.48% -7.45% -98.85%
15 91020 0.47% -3.40% -98.89%
16 75006 0.39% -17.59% -99.09%
17 70054 0.36% -6.60% -99.15%
18 65832 0.34% -6.03% -99.20%
19 62141 0.32% -5.61% -99.24%
20 58384 0.30% -6.05% -99.29%
21 55471 0.29% -4.99% -99.33%
31 23041 0.12% -58.46% -99.72%
41 14024 0.07% -39.13% -99.83%
Hope that displays properly.
I have some charts and the above table over on redcardinal.ie/seo/12-08-2006/clickthrough-analysis-of-aol-datatgz/. (cant post urls because I'm too new  )
For anyone trying to load the files:
1. crete table without indices:
create database aol;
use aol;
create table searches (
id int not null primary key auto_increment,
anonid int,
query varchar(200),
querytime timestamp,
itemrank int,
clickurl varchar(200);
2. use the mysql load command:
LOAD DATA INFILE '/path/AOL-user-ct-collection/user-ct-test-collection-[file number].txt' INTO TABLE searches IGNORE 1 LINES (anonid, query, querytime, itemrank, clickurl);
I created a small php script and the data loaded in a couple of minutes.
Indexing will take hours on the full dataset (36m rows). Indexing itemrank doesnt take too long, query and clickurl gonna take a while though.
|
|
|
|
08-12-2006, 12:44 PM
|
|
Posts: 4
|
Quote:
Originally Posted by tkroll
I'd like to see the top click getters along with their Alexa, PR, backlink count from Google/Yahoo, and total pages indexed from Google/Yahoo.
With proper indexing and temp tables, these quries should not exceed a second in execution.
Ty
|
Hi
Just noticed your post above mine. Can you give us some sql to achieve this pls?
Its taken me a couple of days just to get to what I've posted below and I too would like to be able to interogate the data a bit more than I have to date.
Thanks
|
|
|
|
08-13-2006, 03:56 AM
|
|
Posts: 64
|
Does anyone have a version of this in bite sized chunks that can be opened with excel?
Thanks,
Dave.
|
|
|
|
08-13-2006, 04:58 AM
|
|
Posts: 12
|
Quote:
Originally Posted by gopher292
How are you guys loading these files?
I'm just trying to load 1/10 size pieces and I'm having trouble. (on a 900MB RAM machine)
Notepad and Wordpad can't handle them.
OpenOffice takes about 10 minutes and then locks up when the progress bar is full. (I'd have to use Base, since Calc only loads the first 65K lines and there are ~2M in just one of the pieces)
DOS Type and Find work, but I was looking for something a little more powerful.
|
I kept them divided into 10 files, then used a MySQL management software to convert from the tab files into SQL INSERT INTO statements. Then I zipped it up, put it up on the server, unzipped it there, and importated into MySQL.
Quote:
Originally Posted by mattd
Queries on about 3.5million rows for me are taking about 45-60 seconds!
But then I just shunted it into MS Access without any indexes  Maybe I'll import it into the MySQL server I've got running locally and see if its any better
|
I made indexes on all of the fields, to make sure it's running quick, the max it've taken is about 20 seconds for most complex queries.
|
|
|
|
08-13-2006, 04:58 AM
|
|
Posts: 12
|
Quote:
Originally Posted by Hawaii SEO
Does anyone have a version of this in bite sized chunks that can be opened with excel?
Thanks,
Dave.
|
How small are you talking? what would you be able to deduce with such small data?
|
|
|
|
08-13-2006, 11:16 AM
|
|
Posts: 27
|
|
|
|
|
08-13-2006, 01:19 PM
|
|
Posts: 245
|
Quote:
Originally Posted by KevinJB
|
Thanks but is there anywhere I can get say 1m for excel or access?
Alos those URL searches - remember AOL will be the error page for AOL users and maybe it's like the MSN error page.
|
|
|
|
08-13-2006, 01:23 PM
|
|
Posts: 1
|
portal, what query did you use to filter out the first 7,752,953 clicks? And did you mean clicks or records? You do realize that not all searches resulted in clicks, right?
-Michael
|
|
|
|
08-13-2006, 03:35 PM
|
|
Posts: 328
|
Quote:
Originally Posted by mvandemar
portal, what query did you use to filter out the first 7,752,953 clicks? And did you mean clicks or records? You do realize that not all searches resulted in clicks, right?
-Michael
|
I didn't use no query;
I simply only loaded the first 4 txt files into the database. Since I tried it in access first the dbase reached it's maximum size after 7,752,953 clicks.. (14,4 milion queries or so).
So yes, I know not every search resulted in click.. my sample was 7,752,953 clicks and over 14 milion searches (records)
Right now Im running some queries in MySQL on the whole dataset...
|
|
|
|
|
« Reply to Any lessons from the AOL data?
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|