Speaking of which, I've gone to using ZBblock to ban things. (I do special "odd" things for images" because ZBblock only protects '.php' files. Some requests for images are redirected to a php file.) But the 'custom' user agents I block are discernable in this bit of code:
<?php
# require_once("CookieChecks.php");
#echo("<b>useragent checks");
function CustomUserAgentCheck($useragent){
global $thishost, $ax, $whyblockout, $whyblockout2;
global $requesturi, $lcrequesturi, $lcrequesturisws, $lcrequesturisws, $address;
$ax_start=$ax;
$lcuseragent=strtolower($useragent);
$lcuseragentsws=preg_replace('/\s+/','',$lcuseragent);
$lcuseragentsws=preg_replace("/[^\x9\xA\xD\x20-\x7F]/",'',$lcuseragentsws);
$whyblockout2 .= "(check ua)";
$bad_UA="(psbot|picsearch|vlc|htmlparser|playstation|pixray|pix|picscout|pics|pict|phantom|copy|getty|tineye|wesee.|digimarc|bitvo|nsplayer|thumbnail|screenshot|snapshot|sindice|luminate|fyber|cydral|doubanbot|webcollage|rganalytics|shot|snappreviewbot|version: xxxx|muso.com|musobot|photon| brandprotect)";
$reason_stub=" Image user agent. "; #
pregMatchTest( $bad_UA, $lcuseragent, $reason_stub, $insta=" INSTA-BAN. ");
# -------- -------- -------- -------- -------- -------- -------- --------
$bad_UA="(mShots|TraumaCadX|BPImageWalker|ImageProHD|WikioImagesBot|3.01 PBWF (Win95)|Corp_Device_User|CoverScout|ImageProHD|WikioImagesBot|nsplayer|J-BRW)";
$reason_stub=" Image user agent. "; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ");
# -------- -------- -------- -------- -------- -------- -------- --------
$bad_UA="(crowsnest|grepnetstat|inagist.com|js-kit|scraper|seo|warebay|whowhere|www-mechanize|intelium|magpie|patchone.se|scanner|riverglass|parser|funnelback)"; # RiverglassScanner
$reason_stub=" Scraper, snoop, or seo user agent. "; #
pregMatchTest( $bad_UA, $lcuseragent, $reason_stub, $insta=" INSTA-BAN. ");
# -------- -------- -------- -------- -------- -------- -------- --------
$bad_UA="(DDDDDD|000000|Spinn3r|spinn3r|rcMQUxf|B55|Weiterleitung|Baurat.de)";
$reason_stub=" Suspected anonymizer user agent. "; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ");
# maybe these could be permitted to leave pings-- if I can figure out how those are left.
$bad_UA="(Synapse|coccoc.vn|metauri|q0\.com|lynx|libwww|EAK01AG9|crawlerj|parsijoo)";
$reason_stub=" Suspected bot user agent. Snoop/Scrape/useless search etc."; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ");
# -------- -------- -------- -------- -------- -------- -------- --------
# include voracious, seo, subscription in subscription reputation management,mystery etc.
# try adding info FunWebProducts Http. break words in ProxiNet links (check logs. does apache leave linx or links?)
# Experimental SNAPSHOT
# Catalog UnwindFetchor akarta Commons-HttpClient/3.1 Peeplo Screenshot Bot Phantom.js bot PycURL/7.19.7 MonTools.com artviper(tm) RankFlex.com Jigsaw/2.2.5 W3C_CSS_Validator_JFouffa/2.0 Y!J; for robot study; keyoshid store tweet twitter archive , Links2Go Similarity Engine , WebCapture W3C_Validator/1.3 bdcindexer_2.6.2 (research@bdc) 192.comAgent Aberja Checkomat ntelliseek HyperixScoop spam , Exalead NG/MimeLive Client (convert/http/0.120), Lotus-Notes/5.0 CloakDetect DeepIndex Data DataFountains/DMOZ Downloader Robots.txt finder
# I shoud change it to self identify. They hit knitting haiku. Might hit.... uhmmm bannasties? The test blog? Need to find that.
# lots of crap at google appid. I suspect I don't want to permit anything from '
# google appid other than hitting feed.
$bad_UA="( 008/|80legs|2dayhost.com|aboundex|acoon|ahrefs|aihit|\(alpha\)|baidu|binlar|bixo|ccbot|checker|chilkat|clipish|cmsworldmap|coomnet|crowsnest|cuasar|digital alphaserver|crack|dataprovider|daumoa|detect|download|fairshare|fark.com|find|freewebmonitoring|fyber|gomez|govid.mobi|inagist|indexer|ips-agent|lead|linkalarm|linkbutler|linksleuth|linkcheck|linkfluence|linkdex|lumin|mj12|majestic|metamoji|missing|mojeek|monitoring|mozilla/0.91 beta|netseer|null|openindex|panopta|panscient|peerindex|pipl|portalimage|postrank|proximic|radian6|reverseget.com|seek|seo|searchme|siclab|scraper|shop|sistrix|showsiteinf|scoop|sniffer|spinn3r|super-goo|trendiction|thunderstone|tweet|unisterbot|unmask-parasites|urlchecker|wasalive|whatweb|whitehat|webinator|whowhere|wotbox|wada.vn|yacy|whatweb|wholinks|wikimpress|wocodi|yanga|@somewhere|info.netcraft.com|dom2dom|tenderlove/mechanize|edition yx|socialayer|linkjumper|coldfusion|abonti|genieo.|getfavicon|mrchrome|unwind|lucene|solr|drup|webmastercoffee)";
## web-sniffer|
$reason_stub=" Subscription crawler service or crawler user agent. "; #
pregMatchTest( $bad_UA, $lcuseragent, $reason_stub, $insta=" INSTA-BAN. ");
if(!inmatch($lcuseragent,'pubsubhubbub',"") && !inmatch($lcuseragent,'s~feedly-social',"") ){
$reason_stub=" Unauthorized google app. If you would like approval for this google ap, contact me so I can whitelist it. "; #
$bad_UA="(appid:)";
pregMatchTest( $bad_UA, $lcuseragent, $reason_stub, $insta=" ");
}
# -------- -------- -------- -------- -------- -------- -------- --------
# my crons send this: Links (2.1pre37; Linux 3.1.9-vs2.3.2.5 x86_64) unless I change it.
if($_SERVER['REMOTE_ADDR']!= $_SERVER['SERVER_ADDR']){
$bad_UA="(link)";
$reason_stub=" Link hunting ua "; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ");
}
if(!inmatch($useragent ,"The Incutio XML-RPC PHP Library","") && !inmatch($useragent ,"FirePHP/","") ){
$bad_UA="(java|ruby|pear.php|http_|nutch|drupal|curl|start.exe|wget|dataprovider|php|metauri|simbar)";
$reason_stub=" Programming language, utillity or weird extension. "; #
pregMatchTest( $bad_UA, $lcuseragent, $reason_stub, $insta=" INSTA-BAN. ");
}
# -------- -------- -------- -------- -------- -------- -------- --------
$bad_UA="(Butterfly|Curious George|DonkeyBot|EventGuruBot|EventMachine|Google/1.0|MaMa CaSpEr|IlTrovatore-Setaccio|Microsoft-WebDAV-MiniRedir|MSIECrawler|NerdByNature|PHP/SMF|Semantic|SymantecSpider|Searcharoo|SiteIntel|T-H-U-N-D-E-R-S-T-O-N-E|Trystero|Ukonline|Vagabondo|WWW-Mechanize|XML-RPC.NET|Voila|ELNSB50|bltformac| YE |boardreader|SIMBAR|Zend_Http_Client)";
$reason_stub=" User agent I just do not trust. "; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ");
# -------- -------- -------- -------- -------- -------- -------- --------
# archaic or just badly behaved
# I think I can add ^Mozilla/4.0$
$bad_UA="(Mozilla/0.6 Beta|Mozilla/4.0 \(compatible; ICS\)|^Mozilla/4.0$)";
$reason_stub=" Archaic user agent or obnoxious prefetcher. "; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ");
$bad_UA="(0000|DDDDD)";
$reason_stub=" Anonymizing user agent. "; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ");
# -------- -------- -------- -------- -------- -------- -------- --------
$bad_UA="(@alexa.com|archive|heritrix|internetmemory|Svenska-webbsidor)"; # archive bot based on heritrix
$reason_stub=" Archivers like wayback, foreign wayback etc. "; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ");
# -------- -------- -------- -------- -------- -------- -------- --------
$bad_UA="(super-goo|siclab|yodao)";
$reason_stub=" Foreign language do not bring traffic user agent. "; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ");
$bad_UA="(\.asp|\.bbs|\.dll|\.exe|\.svn)"; #
$reason_stub=" Does not exist. "; #
pregMatchTest( $bad_UA, $useragent, $reason_stub, $insta=" INSTA-BAN. ua ");
# echo("<center><br> Done with UA checks. </center>");
unset ($bad_UA );
#if(strlen($useragent)<4 && !rmatch($requesturi,"/BanNasties/recieveZBBlock.php","") ){ $ax++; $whyblockout=" Blank user agent. INSTA-BAN. "; }
$ax += rmatch($useragent,"YI","Suspected hacktool (UA-142). "); //71
$ax += rmatch($useragent,"YE","Suspected hacktool (UA-142). "); //71
# ------------no idea.. but I don't trust. ------------------- '
$ax = $ax + (inmatch($useragent,"IlTrovatore-Setaccio","; not search engine bot? Nasty. ")); #
$ax = $ax + (inmatch($lcuseragent,"cis455crawler","; is455crawler mystery bot with no web page. Nasty. ")); //
$ax = $ax + (inmatch($useragent,"OpenCalaisSemanticProxy","; OpenCalaisSemanticProxy Nasty. ")); //
$ax = $ax + (inmatch($useragent,"Spider.exe","; DLE_Spider.exe Nasty. ")); //
# gigablast spider
if($address != '64.22.106.82'){
$ax = $ax + (inmatch($useragent,"gigablast.com","; spoofing gigablast spider. INSTA-BAN. ")); //
}
if($ax_start<$ax){$whyblockout2 .= "(ua_cust)";}
return;
}
?>
For reverence $lcuseragent is the lower case user agent.
$useragent is the useragent. Pregmatch sees whether anything in that '|' separated list is a match and anything with $ax>1 is banned.
If you use htaccess, you can discern the useragents and ban them. I also block all agents that ZBblock bans by default. As you can see, BPImageWalker identified as "Image User agent". There are more. . .