Hi all,
I got an extortion letter early this week. Oddly, I'd been working on banning bots all last month but I hadn't been worried about images. My issue was cpu and memory, which bots were sucking like crazy. Images are static, so don't cause that problem. Needless to say, I now see the need to try to trap bots that are racing through images. I've been reading various concerns and had some of my own. These here including:
1) Not knowing IPs of things like picScout's current for certain.
2) PicScout (or others in future) changing IPs.
3) Masking of user agents
etc.
So, I want to come up with a way that a blogger, web site or forum host can at identify bots as they crawl and block them. No way is going to be 100% effective, but this morning I've been ginning up an idea based on the bot crawler here:
http://danielwebb.us/software/bot-trap/
That 'bot crawler will not work for catching some programmed image crawling bots I've seen crawling my blog because at least some programmed image crawling bots aren't going to hit a php file on purpose. They are programmed to just crawl through images leaving php files alone. The also don't make mistakes. (I know how to catch bots that make mistakes on a wordpress blog and would know how to do it here at the forum. More on that later.)
My idea for catching what I might name "pure image browzing bots" is to do this:
1) add directory specific .htaccess files to directories I wish to prevent browzing by image bots. (These would b at least in my image directories. I could put them higher up-- but I need to be sure I know how to avoid screwing up a complicated .htaccess file in that case. Anyway, I really only want to block these guys from images.)
2) add an image or multiple images that I *never* link on purpose to my site. These can be 1 pixel colored images or anything. For now, call that image 'honeyPotImage.jpg'
3) in a top level htaccess, send any bot trying to 'honeyPotImage.jpg' those specific images to a bot-trap written in php. This bot-trap is somewhat similar to the one above.
4) Add the IP of all bots sent to the trap to the appropriate htaccess files.
5) After (4) the bots (or whoever gets trapped) will no longer be able to load images in the protected directories even when they load text. Note: because they can load text, human visitors to my blog will be able to tell me that images vanished. This will let me unban them-- taking care to do this in a way that I think will still protect me from bots.
FWIW: I'll be adding some whitelisted hosts to the tool. My first draft has google and bingbot white listed.
I'm going to get this working for my blogs. I was wondering if others would be interested in using it once it's working? If yes, I might ask you questions to figure out how to make this user friendly. Also, if people do use it, at some point, we may want to share lists of user agents and IPs we are seeing racing through images.
This sharing could be automated and would help us identify any changes in IP ranges or host addresses and help people at sites 2-N ban the creepy bots as soon as they are detected at site 1.
FWIW: Lots of people at web host forums are complaining about these bots for reasons other than concerns about getting a Getty letter. The bots just race through, suck bandwidth, clutter up server logs and are just a plain old nuisance. Because of the latter, if the system is made convenient, we might be able to get lots of people using it. But first I think I just need to know if anyone would like to volunteer to try it in a week or two after I have it working. Actually, probably by Wed.
I got an extortion letter early this week. Oddly, I'd been working on banning bots all last month but I hadn't been worried about images. My issue was cpu and memory, which bots were sucking like crazy. Images are static, so don't cause that problem. Needless to say, I now see the need to try to trap bots that are racing through images. I've been reading various concerns and had some of my own. These here including:
1) Not knowing IPs of things like picScout's current for certain.
2) PicScout (or others in future) changing IPs.
3) Masking of user agents
etc.
So, I want to come up with a way that a blogger, web site or forum host can at identify bots as they crawl and block them. No way is going to be 100% effective, but this morning I've been ginning up an idea based on the bot crawler here:
http://danielwebb.us/software/bot-trap/
That 'bot crawler will not work for catching some programmed image crawling bots I've seen crawling my blog because at least some programmed image crawling bots aren't going to hit a php file on purpose. They are programmed to just crawl through images leaving php files alone. The also don't make mistakes. (I know how to catch bots that make mistakes on a wordpress blog and would know how to do it here at the forum. More on that later.)
My idea for catching what I might name "pure image browzing bots" is to do this:
1) add directory specific .htaccess files to directories I wish to prevent browzing by image bots. (These would b at least in my image directories. I could put them higher up-- but I need to be sure I know how to avoid screwing up a complicated .htaccess file in that case. Anyway, I really only want to block these guys from images.)
2) add an image or multiple images that I *never* link on purpose to my site. These can be 1 pixel colored images or anything. For now, call that image 'honeyPotImage.jpg'
3) in a top level htaccess, send any bot trying to 'honeyPotImage.jpg' those specific images to a bot-trap written in php. This bot-trap is somewhat similar to the one above.
4) Add the IP of all bots sent to the trap to the appropriate htaccess files.
5) After (4) the bots (or whoever gets trapped) will no longer be able to load images in the protected directories even when they load text. Note: because they can load text, human visitors to my blog will be able to tell me that images vanished. This will let me unban them-- taking care to do this in a way that I think will still protect me from bots.
FWIW: I'll be adding some whitelisted hosts to the tool. My first draft has google and bingbot white listed.
I'm going to get this working for my blogs. I was wondering if others would be interested in using it once it's working? If yes, I might ask you questions to figure out how to make this user friendly. Also, if people do use it, at some point, we may want to share lists of user agents and IPs we are seeing racing through images.
This sharing could be automated and would help us identify any changes in IP ranges or host addresses and help people at sites 2-N ban the creepy bots as soon as they are detected at site 1.
FWIW: Lots of people at web host forums are complaining about these bots for reasons other than concerns about getting a Getty letter. The bots just race through, suck bandwidth, clutter up server logs and are just a plain old nuisance. Because of the latter, if the system is made convenient, we might be able to get lots of people using it. But first I think I just need to know if anyone would like to volunteer to try it in a week or two after I have it working. Actually, probably by Wed.