ExtortionLetterInfo Forums

ELI Forums => Getty Images Letter Forum => Topic started by: lucia on December 02, 2011, 02:03:17 PM

Title: Bot trap for image browsing
Post by: lucia on December 02, 2011, 02:03:17 PM
Hi all,

I got an extortion letter early this week.  Oddly, I'd been working on banning bots all last month but I hadn't been worried about images. My issue was cpu and memory, which bots were sucking like crazy.  Images are static, so don't cause that problem. Needless to say, I now see the need to try to trap bots that are racing through images.   I've been reading various concerns  and had some of my own. These here including:

1) Not knowing IPs of things like picScout's current for certain.
2) PicScout (or others in future) changing IPs.
3) Masking of user agents

etc.

So, I want to come up with a way that a blogger, web site or forum host can at identify bots as they crawl and block them. No way is going to be 100% effective, but this morning I've been ginning up an idea based on the bot crawler here:

http://danielwebb.us/software/bot-trap/

That 'bot crawler will not work for catching some programmed image crawling bots I've seen crawling my blog because at least some programmed image crawling bots aren't going to hit a php file on purpose.  They are programmed to just crawl through images leaving php files alone. The also don't make mistakes. (I know how to catch bots that make mistakes on a wordpress blog and would know how to do it here at the forum. More on that later.)

My idea for catching what I might name "pure image browzing bots" is to do this:

1) add directory specific .htaccess files to directories I wish to prevent browzing by image bots. (These would b at least in my image directories. I could put them higher up-- but I need to be sure I know how to avoid screwing up a complicated .htaccess file in that case.  Anyway, I really only want to block these guys from images.)
2)  add an image or multiple images that I *never* link on purpose to my site. These can be 1 pixel colored images or anything.  For now, call that image 'honeyPotImage.jpg'
3) in a top level htaccess, send any bot trying to 'honeyPotImage.jpg' those specific images to a bot-trap written in php.  This bot-trap is somewhat similar to the one above.
4) Add the IP of all bots sent to the trap to the appropriate htaccess files. 
5) After (4) the bots (or whoever gets trapped) will no longer be able to load images in the protected directories even when they load text. Note: because they can load text, human visitors to my blog will be able to tell me that images vanished. This will let me unban them-- taking care to do this in a way that I think will still protect me from bots.

FWIW: I'll be adding some whitelisted hosts to the tool. My first draft has google and bingbot white listed.

I'm going to get this working for my blogs. I was wondering if others would be interested in using it once it's working? If yes, I might ask you questions to figure out how to make this user friendly. Also, if people do use it, at some point, we may want to share lists of user agents and IPs we are seeing racing through images. 

This sharing could be automated and  would help us identify any changes in IP ranges or host addresses and help people at sites 2-N ban the creepy bots as soon as they are detected at site 1.

FWIW: Lots of people at web host forums are complaining about these bots for reasons other than concerns about getting a Getty letter. The bots just race through, suck bandwidth, clutter up server logs and are just a plain old nuisance. Because of the latter, if the system is made convenient, we might be able to get lots of people using it. But first I think I just need to  know if anyone would like to volunteer to try it in a week or two after I have it working. Actually, probably by Wed.
Title: Re: Bot trap for image browsing
Post by: Robert Krausankas (BuddhaPi) on December 02, 2011, 02:32:56 PM
This has been on my list to do as well, just haven't had the time to try out the bot trap...that being said I'd be willing to give it a shot when you get it completed... I need to digest this a bit more before I open my trap, but i might have some further ideas/suggestions..
Title: Re: Bot trap for image browsing
Post by: lucia on December 02, 2011, 03:12:26 PM
I saw this bot-trap suggested a few times.  I got to it quickly because, believe it or not, I was working on something to bounce bots from Wordpress already, and I'd been using some of the idea around the bot already.

You'll see I discussed some fiddling at my blog:
http://rankexploits.com/musings/2011/sorry-bergen-norway/ (http://rankexploits.com/musings/2011/sorry-bergen-norway/)

I even set up a new blog to discuss the fiddling. (Though, very little is discussed at the new one.)
http://rankexploits.com/protect/ (http://rankexploits.com/protect/)

But up until this week, I didn't see any big reason to block the bots that do nothing but load images.  I thought they were obnoxious, but they didn't spike memory or cpu.


So, suggest away. The sooner the better.  I'm not a programmer-- but I can program.  One thing I find is that it helps to have a plan before coding rather than coding away and then changing to suit a new plan. Plus, I'm perfectly capable of ignoring an idea if I think there is some reason it should be ignored.  Also, if it gets to off topic, we can move the conversation to the "new" blog, and then just post synopses.