631
Getty Images Letter Forum / Re: Picscout / DMCA question
« on: December 13, 2011, 03:15:00 PM »
Ok- but would now you might wonder as a techical nuts and bolts matter, can anything lie about the useragent?
Easily! I very easily program Firefox to leave the useragent "Googlebot/2.1 ( http://www.googlebot.com/bot.html)", or I could protram it to tell the server I am visiting using "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)". Doing so would be called "spoofing".
Spoofing can be legitimate. For example: I have spoofed the referrer and hit my own site to see whether I have correctly programmed .htaccess to block to block the baidubot. After testing, I change my user agent back to the default. Should I ever mistakenly crawl as the baidubot, I'll probably find myself blocked all over the place!
Next question: Do things spoof? Oh Yes! I could show examples of obvious spoofing, but I'm going to show one of suspected spoofing instead. Let me return to the example of something I saw in my server logs. This time I'm going to highlight something called the IP in bold:
180.175.7.236 - - [12/Dec/2011:01:17:22 -0800] "GET /blog/name_of_page/ HTTP/1.1" 403 521 "-" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.803.0 Safari/535.1"
The IP address is related to the machine through which the connection occurred. It tells me more about "who" might have connected than the useragent. In fact, I can look up 180.175.7.236 here:
http://whois.domaintools.com/180.175.7.236
What this tells me is whatever that is connects through "China Chinanet Shanghai Province Network". In fact everything with IPs starting with 180. come through "China Chinanet Shanghai Province Network".
If you do a little checking you will find that rumor has it that IPs near that value are the Baiduspider! My guess is this entry corresponds to an attempt by the baidubot to connect while spoofing the user agent. (Mind you, this is not necessarily so. But I suspect it.) Remember, if it leaves "Baidu" in the user agent, it is blocked. But it's possible for the person who programmed the bot to tell it to initially tell the truth, but change user agent if it gets back a "403" or "forbidden" message. (You can see this happen in server logs.) Now, I could maybe call a lawyer and try to assemble a case about Baidu-- but likely that would be expensive. And anyway, I might have trouble proving it was Baidu lying. After all, for all I know "China Chinanet Shanghai Province Network" is the chinese equivalent of Comcast and I'm blocking a real visitor. I doubt it, but I might be.
Since I don't have any particular need for traffic from China, I decided to deal with the huge number of spammy hits from this IP range by blocking by IP. It turns out my .htaccess also contains
order allow,deny
# baidu spider various ranges
deny from 119.
deny from 123.125.71
deny from 124.114
deny from 124.115
deny from 180
deny from 220.181
deny from 183
# china China Fujian Chinanet Fujian Province Network
deny from 120.37.209.57
# copyscape
deny from 212.100.254.105
# block picscout
deny from bezeqint.net
deny from 82.80.249
deny from 82.80.252
deny from 62.0.8.
deny from gettyimages.com
deny from gettywan.com
deny from picscout.com
deny from istockphoto.com
allow from all
Once again, this is edited to keep from filing the entire screen. The bold 'deny from 180' means I deny all IPs starting with 180-- which means no one can visit my blog if they connect through "China Chinanet Shanghai Province Network".
I'm sure you've also noticed '#block from picscout", right? All the commands between that line and "allow from all" block everything I've found that either is known or rumored to be associated with picscout or getty surfing. I have other blocks in place too.
As I've said on other thread, I'm trying to put a php script that will auto install and implement a lot of these blocks for people. I thought I'd be done more quickly, but as I checked things out, I needed to make sure it really, really does what I want it to do, and fairly easily. Given that, I might be charging a small amount for it. ($10 or so.) But it would put in blocks for things known or rumored to be getty/ picscout, etc. And do a few more things to protect people with a range of sites to some extent. (Nothing will give perfect protection. But you can make your site much less vulnerable by making it harder for picscout to crawl!)
Easily! I very easily program Firefox to leave the useragent "Googlebot/2.1 ( http://www.googlebot.com/bot.html)", or I could protram it to tell the server I am visiting using "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)". Doing so would be called "spoofing".
Spoofing can be legitimate. For example: I have spoofed the referrer and hit my own site to see whether I have correctly programmed .htaccess to block to block the baidubot. After testing, I change my user agent back to the default. Should I ever mistakenly crawl as the baidubot, I'll probably find myself blocked all over the place!
Next question: Do things spoof? Oh Yes! I could show examples of obvious spoofing, but I'm going to show one of suspected spoofing instead. Let me return to the example of something I saw in my server logs. This time I'm going to highlight something called the IP in bold:
180.175.7.236 - - [12/Dec/2011:01:17:22 -0800] "GET /blog/name_of_page/ HTTP/1.1" 403 521 "-" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.803.0 Safari/535.1"
The IP address is related to the machine through which the connection occurred. It tells me more about "who" might have connected than the useragent. In fact, I can look up 180.175.7.236 here:
http://whois.domaintools.com/180.175.7.236
What this tells me is whatever that is connects through "China Chinanet Shanghai Province Network". In fact everything with IPs starting with 180. come through "China Chinanet Shanghai Province Network".
If you do a little checking you will find that rumor has it that IPs near that value are the Baiduspider! My guess is this entry corresponds to an attempt by the baidubot to connect while spoofing the user agent. (Mind you, this is not necessarily so. But I suspect it.) Remember, if it leaves "Baidu" in the user agent, it is blocked. But it's possible for the person who programmed the bot to tell it to initially tell the truth, but change user agent if it gets back a "403" or "forbidden" message. (You can see this happen in server logs.) Now, I could maybe call a lawyer and try to assemble a case about Baidu-- but likely that would be expensive. And anyway, I might have trouble proving it was Baidu lying. After all, for all I know "China Chinanet Shanghai Province Network" is the chinese equivalent of Comcast and I'm blocking a real visitor. I doubt it, but I might be.
Since I don't have any particular need for traffic from China, I decided to deal with the huge number of spammy hits from this IP range by blocking by IP. It turns out my .htaccess also contains
order allow,deny
# baidu spider various ranges
deny from 119.
deny from 123.125.71
deny from 124.114
deny from 124.115
deny from 180
deny from 220.181
deny from 183
# china China Fujian Chinanet Fujian Province Network
deny from 120.37.209.57
# copyscape
deny from 212.100.254.105
# block picscout
deny from bezeqint.net
deny from 82.80.249
deny from 82.80.252
deny from 62.0.8.
deny from gettyimages.com
deny from gettywan.com
deny from picscout.com
deny from istockphoto.com
allow from all
Once again, this is edited to keep from filing the entire screen. The bold 'deny from 180' means I deny all IPs starting with 180-- which means no one can visit my blog if they connect through "China Chinanet Shanghai Province Network".
I'm sure you've also noticed '#block from picscout", right? All the commands between that line and "allow from all" block everything I've found that either is known or rumored to be associated with picscout or getty surfing. I have other blocks in place too.
As I've said on other thread, I'm trying to put a php script that will auto install and implement a lot of these blocks for people. I thought I'd be done more quickly, but as I checked things out, I needed to make sure it really, really does what I want it to do, and fairly easily. Given that, I might be charging a small amount for it. ($10 or so.) But it would put in blocks for things known or rumored to be getty/ picscout, etc. And do a few more things to protect people with a range of sites to some extent. (Nothing will give perfect protection. But you can make your site much less vulnerable by making it harder for picscout to crawl!)