Reply
Crawling websites that DONT exist
Old 10-15-2005, 01:21 PM Crawling websites that DONT exist
camperjohn's Avatar
Ultra Talker

Posts: 268
Location: San Diego
Today's Final Jeapordy Question is:

How would you go about finding domain names / websites that DON'T exist?

For example,

www.ftt.com
www.your-domain-name-here.com

These domain names are registered, but there is nothing on them. Does anyone have ideas on how to find a list of website that do not have anything on them?

Here are my initial thoughts:

-Somehow find a global list of all domain names (!)

-download index.html / root page

-if header says "permantly moved", doesn't exist

-if no index.html / root page,query whoisdatabase to see if it is registered (tests original list), if not registered, abort move on to next name

-if 404 recieved or timed out or no index.html / root page, then doesn't exist

-if root exists, scan for "under construction", or "coming soon" (make list of things people put on pages that are under construction)

-if root exists, test if page is smaller than 1000 bytes (javascript redirects?)

Those are my thoughts. Unfortunately, I don't know how to get a list of websites in the first place. I can't crawl Google because...those websites dont have anything on them and nobody will link to them!

Naturally I won't get ALL domain names because there are 85billion of them, but even if there was a way I could grab 1000 at a time or something. Like all that match www.aaa*.com then aab*.com etc down the line.

My goal is to find unused domain names, or domain names that have not really anything on them.

Any ideas?

John at mccarthy dot net

Last edited by camperjohn : 10-15-2005 at 01:25 PM.
camperjohn is offline
Reply With Quote
View Public Profile Visit camperjohn's homepage!
 
When You Register, These Ads Go Away!
Old 10-18-2005, 12:02 PM Determining what's out there...
danlefree's Avatar
Skilled Talker

Posts: 61
I think you'd be best advised to build your list sequentially - after all, there are several ways in which a domain name may not "exist".

Start by building an algorithm for domain name discovery or a dictionary of domain names you'd like to explore.

Note: Domain name purchasing services like Yahoo! Domains do not require a login (or completed purchase!) and, if a domain name which is already taken is entered, will display a nice listed of domain names which are available for sale.

1) Ping the domain - if there's no response to a ping, chances are the domain name is *not* being used.

2) If there is a response to the ping, use PHP's cURL or fgets() to grab the domain's homepage - if the returned file is less than 1 or 2 KB, there's a high likelihood the domain belongs to someone but really isn't being used for anything.

3) Check the headers on that reply - as you noted, you'll want to keep track of domains that return a 300-series code (and you might also want to check where they point to - more domains to cross off the list).

Sounds like an interesting project.
__________________
DAN LEFREE CONSULTING || HADEAN LLC

Last edited by danlefree : 10-18-2005 at 12:04 PM.
danlefree is offline
Reply With Quote
View Public Profile Visit danlefree's homepage!
 
Old 10-18-2005, 12:58 PM
ExpressoDan's Avatar
Ultra Talker

Posts: 317
Name: This Space for Rent
Location: Georgia
Quote:
Originally Posted by danlefree
2) If there is a response to the ping, use PHP's cURL or fgets() to grab the domain's homepage -
What are cURL and fgets()? Are there any tools that will "grab the domains homepage" for languages other than PHP?

Dan
ExpressoDan is offline
Reply With Quote
View Public Profile Visit ExpressoDan's homepage!
 
Old 10-18-2005, 03:25 PM
camperjohn's Avatar
Ultra Talker

Posts: 268
Location: San Diego
Here is what I have so far:

I started with a list of "recently registered domain names", for 2003,2004 and 2005, by month...then by day

http://www.namestead.com/new%2Ddomains%2Dlists/

There are about 800,000 domain names on this list for 3 years back. First I grabbed all those domain names into text files.

Then:
- I Whois to find if they are still registered
- I ping them to find if they exist
- I download the page and check for the above items, including
> 404
> 403
> 300
"under"
"construction"
"parked"
"authorization required"
"go away"
page less than 1 kb in size

So far it's working well, finding all kinds of "useless" websites.

JM
camperjohn is offline
Reply With Quote
View Public Profile Visit camperjohn's homepage!
 
Old 10-18-2005, 03:27 PM
camperjohn's Avatar
Ultra Talker

Posts: 268
Location: San Diego
Quote:
Originally Posted by ExpressoDan
What are cURL and fgets()? Are there any tools that will "grab the domains homepage" for languages other than PHP?

Dan
Borland Builder has NMHTTP1 component that allows you to look at headers, cookies and body of an html page. This is really easy to do, you can write a program in 5 minutes to crawl the internet or download pictures etc etc.

Microsoft MSVC++ Has Winsock that I use to download pages directly. I don't ues this as much.

JM
camperjohn is offline
Reply With Quote
View Public Profile Visit camperjohn's homepage!
 
Reply     « Reply to Crawling websites that DONT exist
 

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off




   
RSS Feed  Feeds: RSS   JS   XML
RSS Feed  Feeds for this forum: RSS   JS   XML

 


Page generated in 0.13599 seconds with 12 queries