Author Topic: Latest image scraper – a troll by any other name (Read 14815 times)

jot · « **on:** January 10, 2013, 06:08:27 PM »

Today as I was looking over our server logs for the New Year to make sure my new security measures were keeping the bad bots at bay, I came across this little nugget of info in my browsers used log…

BPImageWalker/2.0 (www.bdbrandprotect.com)

Haha! a new image scraper and from the well laid out name of the browser (they rarely are this kind), I knew they were not scanning my server to help me out in any way. A little research and I found out that this Canadian based company does “brand protection” services, and one of those services is scanning of images. Here is a link to a pdf that describes their capabilities…
http://www.brandprotect.com/files/BP_Services_Unique_Capabilities.pdf

More research turned up remarks on user agent forums about how this company’s bots ignore the robots.txt file (much like PicScout’s) and that most webmasters just block their domains and IP addresses. I was able to find out the domains they like to use…

bdbrandprotect.com
brandprotect.com
brandimensions.net
brandimensions.com

Here are the IP addresses I was able to verify so far….

72.14.164.103
72.14.163.101
72.14.163.107
72.14.170.60
216.183.93.163

And these addresses have been reported as some they have used in the past (could not verify with DNS records other than they are used at the same hosting company)…

72.14.164.122
72.14.164.131
72.14.164.143
72.14.164.157
72.14.164.161
72.14.164.176
72.14.164.183

Tomorrow I will be going over the firewall logs to see if I can spot exactly what kind of traffic and what time they “accessed” my web server. Now, this may be a legitimate company doing legitimate business, but if they have ways to “bypass” the most basic of web server security settings, then in my opinion, they are no better than hackers and I would refuse to do business with such companies.

crazycatlady · « **Reply #1 on:** January 10, 2013, 07:03:58 PM »

The site claims you can add a robots.txt file to block them: http://www.brandprotect.com/disallow-brandprotect-robots.html

Robert Krausankas (BuddhaPi) · « **Reply #2 on:** January 10, 2013, 08:17:58 PM »

Quote from: crazycatlady on January 10, 2013, 07:03:58 PM

The site claims you can add a robots.txt file to block them: http://www.brandprotect.com/disallow-brandprotect-robots.html

Hi crazycatlady!! Welcome!

lucia · « **Reply #3 on:** January 10, 2013, 09:36:31 PM »

Many bots that say you can use robots.txt to block them don't obey robots.txt. It's best to take additional measures to block these sorts of things.

Oscar Michelen · « **Reply #4 on:** January 11, 2013, 07:09:05 PM »

Trust lucia - she is the bomb on this issue!

lucia · « **Reply #5 on:** January 11, 2013, 10:09:11 PM »

Speaking of which, I've gone to using ZBblock to ban things. (I do special "odd" things for images" because ZBblock only protects '.php' files. Some requests for images are redirected to a php file.) But the 'custom' user agents I block are discernable in this bit of code:

Code: [Select]

<?php
# require_once("CookieChecks.php");
#echo("<b>useragent checks");
function CustomUserAgentCheck($useragent){
global $thishost, $ax, $whyblockout, $whyblockout2; 
global $requesturi, $lcrequesturi, $lcrequesturisws, $lcrequesturisws, $address; 
$ax_start=$ax;
$lcuseragent=strtolower($useragent);
$lcuseragentsws=preg_replace('/\s+/','',$lcuseragent);
$lcuseragentsws=preg_replace("/[^\x9\xA\xD\x20-\x7F]/",'',$lcuseragentsws);

$whyblockout2 .= "(check ua)";
	
$bad_UA="(psbot|picsearch|vlc|htmlparser|playstation|pixray|pix|picscout|pics|pict|phantom|copy|getty|tineye|wesee.|digimarc|bitvo|nsplayer|thumbnail|screenshot|snapshot|sindice|luminate|fyber|cydral|doubanbot|webcollage|rganalytics|shot|snappreviewbot|version: xxxx|muso.com|musobot|photon|	brandprotect)";
$reason_stub=" Image user agent.  "; #
pregMatchTest( $bad_UA, $lcuseragent, $reason_stub, $insta=" INSTA-BAN. ");
# -------- --------   -------- --------  -------- --------   -------- -------- 	

$bad_UA="(mShots|TraumaCadX|BPImageWalker|ImageProHD|WikioImagesBot|3.01 PBWF (Win95)|Corp_Device_User|CoverScout|ImageProHD|WikioImagesBot|nsplayer|J-BRW)";
$reason_stub=" Image user agent.  "; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ");
# -------- --------   -------- --------  -------- --------   -------- -------- 	

$bad_UA="(crowsnest|grepnetstat|inagist.com|js-kit|scraper|seo|warebay|whowhere|www-mechanize|intelium|magpie|patchone.se|scanner|riverglass|parser|funnelback)"; # RiverglassScanner
$reason_stub=" Scraper, snoop, or seo user agent.  "; #
pregMatchTest( $bad_UA, $lcuseragent, $reason_stub, $insta=" INSTA-BAN. ");
# -------- --------   -------- --------  -------- --------   -------- -------- 	

$bad_UA="(DDDDDD|000000|Spinn3r|spinn3r|rcMQUxf|B55|Weiterleitung|Baurat.de)";
$reason_stub=" Suspected anonymizer user agent.  "; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ");

# maybe these could be permitted to leave pings-- if I can figure out how those are left. 
$bad_UA="(Synapse|coccoc.vn|metauri|q0\.com|lynx|libwww|EAK01AG9|crawlerj|parsijoo)";
$reason_stub=" Suspected bot user agent. Snoop/Scrape/useless search etc."; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ");

# -------- --------   -------- --------  -------- --------   -------- -------- 	

	# include voracious, seo, subscription in subscription reputation management,mystery etc.
	# try adding info FunWebProducts Http. break words in   ProxiNet links (check logs. does apache leave linx or links?) 
	# Experimental SNAPSHOT
	# Catalog UnwindFetchor akarta Commons-HttpClient/3.1 Peeplo Screenshot Bot Phantom.js bot PycURL/7.19.7 MonTools.com   artviper(tm) RankFlex.com  Jigsaw/2.2.5 W3C_CSS_Validator_JFouffa/2.0 Y!J; for robot study; keyoshid store tweet twitter archive , Links2Go Similarity Engine , WebCapture  W3C_Validator/1.3 bdcindexer_2.6.2 (research@bdc) 192.comAgent Aberja Checkomat ntelliseek HyperixScoop spam  , Exalead NG/MimeLive Client (convert/http/0.120),  Lotus-Notes/5.0 CloakDetect  DeepIndex Data DataFountains/DMOZ Downloader Robots.txt finder
	
	
	# I shoud change it to self identify. They hit knitting haiku. Might hit.... uhmmm bannasties? The test blog? Need to find that. 
	# lots of crap at google appid.  I suspect I don't want to permit anything from '
	# google appid other than hitting feed.
	
$bad_UA="( 008/|80legs|2dayhost.com|aboundex|acoon|ahrefs|aihit|\(alpha\)|baidu|binlar|bixo|ccbot|checker|chilkat|clipish|cmsworldmap|coomnet|crowsnest|cuasar|digital alphaserver|crack|dataprovider|daumoa|detect|download|fairshare|fark.com|find|freewebmonitoring|fyber|gomez|govid.mobi|inagist|indexer|ips-agent|lead|linkalarm|linkbutler|linksleuth|linkcheck|linkfluence|linkdex|lumin|mj12|majestic|metamoji|missing|mojeek|monitoring|mozilla/0.91 beta|netseer|null|openindex|panopta|panscient|peerindex|pipl|portalimage|postrank|proximic|radian6|reverseget.com|seek|seo|searchme|siclab|scraper|shop|sistrix|showsiteinf|scoop|sniffer|spinn3r|super-goo|trendiction|thunderstone|tweet|unisterbot|unmask-parasites|urlchecker|wasalive|whatweb|whitehat|webinator|whowhere|wotbox|wada.vn|yacy|whatweb|wholinks|wikimpress|wocodi|yanga|@somewhere|info.netcraft.com|dom2dom|tenderlove/mechanize|edition yx|socialayer|linkjumper|coldfusion|abonti|genieo.|getfavicon|mrchrome|unwind|lucene|solr|drup|webmastercoffee)";
## web-sniffer| 

$reason_stub=" Subscription crawler service or crawler user agent.  "; #
pregMatchTest( $bad_UA, $lcuseragent, $reason_stub, $insta=" INSTA-BAN. ");


if(!inmatch($lcuseragent,'pubsubhubbub',"") && !inmatch($lcuseragent,'s~feedly-social',"")  ){
	$reason_stub=" Unauthorized google app. If you would like approval for this google ap, contact me so I can whitelist it.  "; #
	$bad_UA="(appid:)";
	pregMatchTest( $bad_UA, $lcuseragent, $reason_stub, $insta=" ");
}
# -------- --------   -------- --------  -------- --------   -------- -------- 	
	# my crons send this: Links (2.1pre37; Linux 3.1.9-vs2.3.2.5 x86_64)  unless I change it.

if($_SERVER['REMOTE_ADDR']!= $_SERVER['SERVER_ADDR']){  
	$bad_UA="(link)";
	$reason_stub=" Link hunting ua  "; #
	pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ");
} 

if(!inmatch($useragent ,"The Incutio XML-RPC PHP Library","") && !inmatch($useragent ,"FirePHP/","") ){
	$bad_UA="(java|ruby|pear.php|http_|nutch|drupal|curl|start.exe|wget|dataprovider|php|metauri|simbar)";
	$reason_stub=" Programming language, utillity or weird extension.  "; #
	pregMatchTest( $bad_UA, $lcuseragent, $reason_stub, $insta=" INSTA-BAN. ");
}
# -------- --------   -------- --------  -------- --------   -------- -------- 	

$bad_UA="(Butterfly|Curious George|DonkeyBot|EventGuruBot|EventMachine|Google/1.0|MaMa CaSpEr|IlTrovatore-Setaccio|Microsoft-WebDAV-MiniRedir|MSIECrawler|NerdByNature|PHP/SMF|Semantic|SymantecSpider|Searcharoo|SiteIntel|T-H-U-N-D-E-R-S-T-O-N-E|Trystero|Ukonline|Vagabondo|WWW-Mechanize|XML-RPC.NET|Voila|ELNSB50|bltformac| YE |boardreader|SIMBAR|Zend_Http_Client)";

$reason_stub=" User agent I just do not trust.   "; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ");
# -------- --------   -------- --------  -------- --------   -------- -------- 	

	# archaic or just badly behaved
	# I think I can add ^Mozilla/4.0$
$bad_UA="(Mozilla/0.6 Beta|Mozilla/4.0 \(compatible; ICS\)|^Mozilla/4.0$)"; 
$reason_stub=" Archaic user agent or obnoxious prefetcher. "; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ");


$bad_UA="(0000|DDDDD)";
$reason_stub=" Anonymizing user agent. "; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ");

# -------- --------   -------- --------  -------- --------   -------- -------- 	
$bad_UA="(@alexa.com|archive|heritrix|internetmemory|Svenska-webbsidor)"; # archive bot based on heritrix
$reason_stub=" Archivers like wayback, foreign wayback etc. "; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ");

# -------- --------   -------- --------  -------- --------   -------- -------- 	
 $bad_UA="(super-goo|siclab|yodao)";
$reason_stub=" Foreign language do not bring traffic user agent.  "; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ");

 $bad_UA="(\.asp|\.bbs|\.dll|\.exe|\.svn)";	#

$reason_stub=" Does not exist.  "; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN.  ua ");


	# echo("<center><br> Done with UA  checks. </center>");
unset ($bad_UA );

#if(strlen($useragent)<4 && !rmatch($requesturi,"/BanNasties/recieveZBBlock.php","") ){ $ax++; $whyblockout=" Blank user agent. INSTA-BAN. "; }

$ax += rmatch($useragent,"YI","Suspected hacktool (UA-142). "); //71
$ax += rmatch($useragent,"YE","Suspected hacktool (UA-142). "); //71


# ------------no idea.. but I don't trust. ------------------- '


$ax = $ax + (inmatch($useragent,"IlTrovatore-Setaccio","; not search engine bot?  Nasty. ")); #  

$ax = $ax + (inmatch($lcuseragent,"cis455crawler","; is455crawler mystery bot with no web page.  Nasty. ")); //


$ax = $ax + (inmatch($useragent,"OpenCalaisSemanticProxy","; OpenCalaisSemanticProxy  Nasty.  ")); //
$ax = $ax + (inmatch($useragent,"Spider.exe","; DLE_Spider.exe   Nasty. ")); //

# gigablast spider

if($address != '64.22.106.82'){
	$ax = $ax + (inmatch($useragent,"gigablast.com","; spoofing gigablast spider. INSTA-BAN.  ")); // 

}

if($ax_start<$ax){$whyblockout2 .= "(ua_cust)";}

return;
}	
?>

For reverence $lcuseragent is the lower case user agent.
$useragent is the useragent. Pregmatch sees whether anything in that '|' separated list is a match and anything with $ax>1 is banned.

If you use htaccess, you can discern the useragents and ban them. I also block all agents that ZBblock bans by default. As you can see, BPImageWalker identified as "Image User agent". There are more. . .

ws2001 · « **Reply #6 on:** July 24, 2013, 01:35:28 AM »

Quote from: jot on January 10, 2013, 06:08:27 PM

BPImageWalker/2.0 (www.bdbrandprotect.com)

bdbrandprotect.com
brandprotect.com
brandimensions.net
brandimensions.com

Here are the IP addresses I was able to verify so far….

72.14.164.103
72.14.163.101
72.14.163.107
72.14.170.60
216.183.93.163

And these addresses have been reported as some they have used in the past (could not verify with DNS records other than they are used at the same hosting company)…

72.14.164.122
72.14.164.131
72.14.164.143
72.14.164.157
72.14.164.161
72.14.164.176
72.14.164.183

Got visited, again, by the bandwidth pirate.

BPImageWalker/2.0 (IPs to block)
72.14.164.0 - 72.14.164.255
71.14.170.48 - 71.14.170.63
216.183.91.16 - 216.183.91.31
216.183.93.161 - 216.183.93.175

Blocked this bandwidth pirate for years. They used new IPs; added new IPs today. The rogue bot intrusion proved they ignore robots.txt. That and not bothering to GET the robots.txt file.

Must have gotten the GI rogue bot fanboys agitated; there's nothing there! Been visited the past month by a dozen or two rogue bot relatives. Typical harassment protocol of technonoobs throwing a tantrum.

Hint to rogue bots - 'you can go about your business, move along'.

gotletter · « **Reply #7 on:** July 24, 2013, 09:06:38 PM »

I ran across a friend of mine today that suggested putting a 'Terms of Website Use' on my website(s). In essence it reads along the lines of:

"Limits on Use of the Sites and Services
You agree not to engage in any of the following: (a) use any automated means, including, without limitation, agents, robots, scripts, or spiders, to access, monitor, data scrape, copy or transfer any part of the Sites or Services (including without limitation any User data such as Member website usage, purchase history, or any Registration Data, whether individually or in the aggregate); (b) use any device, software or program to attempt to data mine, data scrape or image lift (including without limitation any files contained in published or non-published pages on the site) other than what is stored in a browser's cache or cookie recall;"

He could not vouch for it's effectiveness but did mention that MANY big company websites have similar wordings to prevent various actions. In essence it's in place as a buffer to prevent anyone from looking and cruising your website in an attempt to purposely find something to try and use to cause you grief.

thoughts?

Robert Krausankas (BuddhaPi) · « **Reply #8 on:** July 24, 2013, 09:14:35 PM »

just my 2 cents....bots aren't going to read those terms, thus they will be ignored...not to mention most of these bots / spiders / scrapers are coming from other countries ( Picscout = Israel ) so they could care less about any terms.. I think the only way to make something like this effective, would be to stop every bot and user and force them to agree to these terms, which isn't really practical from many standpoints.

lucia · « **Reply #9 on:** July 24, 2013, 11:25:49 PM »

The terms aren't going to help with most of the bots especially not the bad bots. The things that can help:
1) robots.txt exclusion works for well behaved bots. (The well behaved bots are generally not a big problem.)
2) .htaccess rules can help with some bots provided you know user agents, IP ranges and so on.
3) ZBblock can help protect any of your .php resources. (This is a php program.)
4) Use a content delivery network to filter. (I use Cloudflare. I block all of China at Cloudflare.)

The difficulty is that you probably would like to block all 'bad' bots, especially images scrapers. But the various scraping companies will use a wide variety of IPs. They are very motivated to scrape and my impression is they have taken accounts of a variety of connection providing ISPs, hosting companies and so on. They are not all in Israel. Some spoof user agents to look like a browser. So, you will never be able to block all image scrapers. You can slow them down but it's quite a bit of work.

gotletter · « **Reply #10 on:** July 25, 2013, 05:10:32 AM »

It was explained to me more upon the approach that should a place (getty for example) contact me claiming infringement, that I could in turn go after them for violating my site's terms and use conditions.

Dunno, seems to me that if they are data mining and data scraping that that action alone is not 100% legal.

But then again if they are going about it from outside of the country where the laws are not the same...

Robert Krausankas (BuddhaPi) · « **Reply #11 on:** July 25, 2013, 07:08:45 AM »

Without them agreeing to those terms, they really mean nothing in the grand scheme of things.

Author Topic: Latest image scraper – a troll by any other name (Read 14815 times)

jot

Latest image scraper – a troll by any other name

crazycatlady

Re: Latest image scraper – a troll by any other name

Robert Krausankas (BuddhaPi)

Re: Latest image scraper – a troll by any other name

lucia

Re: Latest image scraper – a troll by any other name

Oscar Michelen

Re: Latest image scraper – a troll by any other name

lucia

Re: Latest image scraper – a troll by any other name

ws2001

Re: Latest image scraper – a troll by any other name

gotletter

Re: Latest image scraper – a troll by any other name

Robert Krausankas (BuddhaPi)

Re: Latest image scraper – a troll by any other name

lucia

Re: Latest image scraper – a troll by any other name

gotletter

Re: Latest image scraper – a troll by any other name

Robert Krausankas (BuddhaPi)

Re: Latest image scraper – a troll by any other name

Click Official ELI Links	Get Help With Your Extortion Letter \| ELI Phone Support \| ELI Legal Representation Program
	Show your support of the ELI website & ELI Forums through a PayPal Contribution. Thank you for supporting the ongoing fight and reporting of Extortion Settlement Demand Letters.

Official ELI Help Options	Get Help With Your Extortion Letter \| ELI Phone Support Call \| ELI Defense Letter Program
	Show your support of the ELI website & ELI Forums through a PayPal Contribution. Thank you for supporting the ongoing fight and reporting of Extortion Settlement Demand Letters.