 |
06-10-2008, 02:52 PM
|
What do you think?
|
Posts: 61
|
What do you think of this guys http://www.firewallforums.com/forums/showthread.php?p=8 robots.txt file to block bad bots? At least it will block the bots that are not so bad that they disregard robots.txt I guess.
It is over 2 years old so I guess some of the bots may have started behaving better and some may have gotten worse.
|
|
|
|
06-10-2008, 08:10 PM
|
Re: What do you think?
|
Posts: 9,465
Name: Steven Bradley
Location: Boulder, Colorado
|
Seems like a lot of overkill. The bots that will honor robots.txt will probably respond to User-agent: * and listing the robots that won't honor robots.txt is kind of pointless since they're not honoring the file.
Last edited by vangogh : 06-12-2008 at 07:45 PM.
|
|
|
|
06-11-2008, 11:53 AM
|
Re: What do you think?
|
Posts: 61
|
There is another thread http://www.webmaster-talk.com/the-ot...tml#post736801 in which I was trying to get some answers about bots that disregard robots.txt.
In that post I explain why I feel the need to try to ban bots at all.
Ultimately, I went with a 5 year old list I found on the Internet and added the bot that has caused me concern.
note: If you can point me to a more up to date list of bots to block I would appreciate it.
# block blocking by user agent added 061008
RewriteCond %{HTTP_USER_AGENT} ^AnotherBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]
|
|
|
|
06-11-2008, 07:09 PM
|
Re: What do you think?
|
Posts: 9,465
Name: Steven Bradley
Location: Boulder, Colorado
|
You can start with the Web Robots Pages. They won't have everything listed. Bots written for nefarious purposes usually don't announce themselves to the world.
I'd still say what you're doing is overkill. It sounds like you want to stop one or a few bots so why block them all? From what you describe in your other post I doubt it was a hacking attempt, but rather a bot looking to scrape your content.
|
|
|
|
06-11-2008, 09:32 PM
|
Re: What do you think?
|
Posts: 3,024
Name: Forrest Croce
Location: Seattle, WA
|
Quote:
Originally Posted by vangogh
You can start with the Web Robots Pages. They won't have everything listed. Bots written for nefarious purposes usually don't
announce themselves to the world.
|
Nefarious bots have all kinds of 'detection avoidance' built in, from anonymizers and user agent cloaking, to the speed and timing they make requests. But those are only the ones whose authors sat down and said "hey, this is kind of evil; people won't like it, so I better hide my tracks."
It's getting easier and easier to make a bot. I'd think most of the bots on the web don't know how to interact with a robots.txt file.
|
|
|
|
06-12-2008, 02:15 AM
|
Re: What do you think?
|
Posts: 61
|
Thanks For Replying.
Quote:
Originally Posted by vangogh
...why block them all?
|
The bad bots list I am using is mostly blocking bots that others have identified as bad. I found a number of lists via Google, but most of them were, at least, several years old. However, I did find the bots being blocked were, many times, the same from list to list. Finding the same bots being blocked in lists from several different sources, for me, was enough to trust that, at least when the lists were written, those were/are bots that mean my site no good at all. Copying and pasting the list into .htaccess was EASY and so, at first, I could not think of a convincing reason not to do just that.
After getting a good number of 500 errors thrown I did realize that one must be very careful with mod_rewrite, but I still think I should just block bots that "good" webmasters have found to be bad IF it is a simple matter to do so.
Quote:
Originally Posted by vangogh
...bot looking to scrape your content.
|
You are probably right. Is "scraping content" just an automated(bot) method of copying my content to use on their own website, etcetera, or is content scraping more complex than that?
Quote:
Originally Posted by ForrestCroce
Nefarious bots have all kinds of 'detection avoidance' built in...
|
I believe you. Although I am looking for a more up to date list I sourced the list I am using from information that is, at least, a few years old. I am sure current bots, including updated versions of the ones I am blocking, are much BETTER at being bad than they were a few years ago. I just hope they are not so much better as to render my little .htaccess list GOOD for nothing.
wkitty42(webmasterworld.com/forum92/205.htm) offers what, to me, looks like a more sophisticated blocking "list" that goes beyond the rudimentary, but I will need to read/understand more about wkitty42's more sophisticated methods before I feel comfortable using them. ...AND wkitty42's list is still a 5 years old list. Further, I am not at all ready to do things like block ia_archiver. I don't even think I should be putting crawl delays, in robots.txt, for googlebot, msnbot, or others.
Quote:
Originally Posted by ForrestCroce
It's getting easier and easier to make a bot. I'd think most of the bots on the web don't know how to interact with a robots.txt file.
|
Is there any GOOD use that an average webmaster can put his/her own bot to?
Incidentally, I did use a "bot trap" suggestion I stumbled on:
I added Disallow: /email-addresses/ to robots.txt. There is no real folder by that name on the website in question, but, according to the information I learned this from, I should ban bots that "try" to go there anyway.
Last edited by 052808 : 06-12-2008 at 06:19 PM.
|
|
|
|
06-12-2008, 07:51 PM
|
Re: What do you think?
|
Posts: 9,465
Name: Steven Bradley
Location: Boulder, Colorado
|
I understand why you're trying to block them all now and .htaccess would be the way to go since those bots probably aren't paying attention to robots.txt.
That's what scraping content is. No something you want to see, though not quite as nefarious as hacking your site. I usually deal with scrapers by adding links in my content to other content on my site. That way if I get scraped at least I get some links back in the bargain. Low quality links, but still it's something.
There are good uses for bots. Think search engine for one. The average webmaster could use one simply for research. A bot will likely find the information faster than you will and they free up your time for something else. Bots aren't bad by definition. They just are. It's how they're used that's good or bad.
|
|
|
|
06-12-2008, 10:55 PM
|
Re: What do you think?
|
Posts: 61
|
Here is another suspicious visitor to my site from a Ripe Network Coordination Centre IP address:
Host: 82.161.231.16 /
Http Code: 200 Date: Jun 12 18:06:44 Http Version: HTTP/1.0 Size in Bytes: 11975
Referer: -
Agent: Mozilla/2.0 (compatible; MSIE 3.0B; Windows NT)
I did some Google searching for MSIE 3.0B and found that the SERP contained, predominately, website site logs that, like mine, were visited by MSIE 3.0B. One webpage seemed to mention something about crawlers and I tend to believe this was NOT a human BECAUSE there are image preloads and other things that will get accessed when a human enters my site, using a traditional browser, through /.
The only other time I see a lack of those preloads, etcetera, is when I visit my own site through a text only proxy. I am not the first one to say it, but all I ever seem to see in terms of traffic to my site, and sometimes attempts at my computer's firewall, from RIPE IP addresses is abuse.
note: I copy and pasted that log entry. Looks like my logs/statistics people do not know how to spell Referer.
Quote:
Originally Posted by vangogh
...I usually deal with scrapers by adding links in my content to other content on my site.
|
Thanks. I do have links within my content, but I did not know of the "benefit" you point out of course.
Quote:
Originally Posted by vangogh
There are good uses for bots. Think search engine for one. The average webmaster could use one simply for research. A bot will likely find the information faster than you will and they free up your time for something else. Bots aren't bad by definition. They just are. It's how they're used that's good or bad.
|
I did realize search engines of course. So how do I make a bot? Must a bot be "run" from a web hosting account OR would I dispatch a bot from my dell or my mac?
Last edited by 052808 : 06-12-2008 at 11:05 PM.
|
|
|
|
06-13-2008, 05:32 PM
|
Re: What do you think?
|
Posts: 9,465
Name: Steven Bradley
Location: Boulder, Colorado
|
You can run a bot from anywhere as long as you have internet access. Bots aren't that much different from a browser. They're just after different information.
Do you know for a fact that all the robots you list are trying to do your site harm? Robots are normal. Most sites probably get more robotic traffic than human traffic. A robot is not by definition trying to cause your site harm.
Last edited by vangogh : 06-14-2008 at 04:37 PM.
|
|
|
|
06-13-2008, 10:46 PM
|
Re: What do you think?
|
Posts: 61
|
Quote:
Originally Posted by vangogh
Do you know for a fact that all the robots you list are trying to do your site harm?.
|
No.
Quote:
Originally Posted by 052808
The bad bots list I am using is mostly blocking bots that others have identified as bad. I found a number of lists via Google, but most of them were, at least, several years old. However, I did find the bots being blocked were, many times, the same from list to list. Finding the same bots being blocked in lists from several different sources, for me, was enough to trust that, at least when the lists were written, those were/are bots that mean my site no good at all.
|
Quote:
Originally Posted by 052808
...I am not at all ready to do things like block ia_archiver. I don't even think I should be putting crawl delays, in robots.txt, for googlebot, msnbot, or others.
|
Quote:
Originally Posted by 052808
wkitty42(webmasterworld.com/forum92/205.htm) offers what, to me, looks like a more sophisticated blocking "list" that goes beyond the rudimentary, but I will need to read/understand more about wkitty42's more sophisticated methods before I feel comfortable using them. ...AND wkitty42's list is still a 5 years old list.
|
......
|
|
|
|
06-15-2008, 12:54 AM
|
Re: What do you think?
|
Posts: 30
Name: Tim
Location: Tennessee
|
I agree, I doubt if most of the bad bots pay much attention to the robots.txt file.
|
|
|
|
|
« Reply to What do you think?
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|
|