Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Blocking unknown and unclear bots



Dave M G writes:

 > > So why were people saying these bots were "bad"?
 > Short answer:
 > 
 > Crawling for emails or information to use for spam... maybe?

No, apparently they're just bad because they're on some list.  (See
below.)

 > Here, just by way of example, is a list of bad bots:
 > 
 > http://www.invision-graphics.com/robotstxt_badbots.html

"Mr. Foot, this is Mr. Bullet."  Do you really want to commit DoS on
your clients' users?  From that list:

User-agent: Wget
User-agent: asterias
User-agent: httplib
User-agent: Wget/1.6
User-agent: Wget/1.5.3

wget is either the first or second (after curl) most popular
command-line based web retrieval tool, while httplib is Python's
generic retrieval tool *library* and is probably incorporated in a
number of innocuous applications, and asterias is probably based on
http://asterias.bioinfo.cnio.es/, which is a distributed tool for
analyzing DNA IIUC.

These may very well have been observed to behave as "bad bots" (but
since all bots do the same thing, namely, follow every link, I don't
see how you determine that!), but either their names are being spoofed
or (in the case of wget) it's multiple use (can be a spider or can be
an ordinary retrieval tool).

If you really want to do this kind of thing, you should decide which
bots you want to let in (I'm sure Google is high on your list, for
example), and then restrict to those user agents and also by domain
and/or IP block (or address if it's consistent).


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links