You want some 'bots to come by. In particular, you probably want the google bot to come by because google runs a real search engine and you do want people who might be readers or customers to find your site.
What you generally want to discourage is 'bots who are either
- sucking utterly excessive amounts of bandwidth, cpu or memory doing entirely unreasonable searches to detect "copyright" violations.
- sucking utterly excessive amounts of bandwidth, cpu or memory so they can create their own data base to sell SEO services
- sucking utterly excessive amounts of bandwidth, cpu or memory because they just don't want to spend money programming a bot that doesn't misbehave.
- shows any evidence of doing something dangerous that might look like it's trying to crack into your server, steal email addresses, leave malware etc.
The copyright issue comes up here, but it's really not the main reason to try to control bots. Also, controlling bots can't protect you from a copyright owner suing you if you really are abusing copyright. What it can do is give you limited protection by making it more difficult for something like the picscout bot to race quickly through your site downloading hundreds of pages at little expense to itself-- but significant to
you. If you selectively block these bots, you can at least make it so that getty or picscout have to spend
their human time or energy searching your site.
To tell you how bad it can get: Shortly before I got my getty letter, I noticed my blog was crashing a lot. I don't know if this had anything to do with picscout-- but I could see spikes. After I got my 2nd getty letter, I noticed that an agent hosted on bezeqint
hammered my blog and caused it to crash. Oddly, they were not served pages. Zbblock was denying them pages, but the bot just kept requesting over and over and over and over and.... you get the picture.
My blog has a decent readership with quite a few people who work in IT. I got a great deal of advice, and decided that on shared hosting, the solution would be to a) host on cloudflare and b) write some code to auto-ban IPs that do certain things. The system seems to be working quite well, and I have greatly reduced image scraping, crack attempts, visits from bandwidth sucking referrer loggers etc.
I have no illusion that someone looking for copyright violations can't visit my site-- but it is now somewhat more difficult for them to do race through quickly using automated tools. This means it is more expensive for something like picscout to run a scraper.
But returning to the issue with google: I don't block google. I let them visit, cache, surf whatever it likes. It's a mostly well behaved bot and google has actually created a service that benefits
me. I'm willing to let them use all the bandwidth they use.
For what it's worth: if picscouts method of looking for images ends up being writing a tool that does one bajillion searches at google and afterwards visit the blog post where they think a violation exist, that's fine with me. If they do it that way, they mostly be sucking up google's bandwidth, not mine. Also, their bot will have to spend much more processing time to search. That's money from their pockets, not mine.
I have no objection to copyright owners trying to identify copyright infringers and enforce it. What I object to is services that operate in a way that deprives my blog readers of access because my blog periodically crashes when the image scraping service decides they want to search and which by increasing the resources I need to rent to keep my blog running shifts much of the cost of
their searching onto me. If they want to make money by searching for violations that's fine. But I don't want to donate my server resources to their commercial enterprise.