Click Official ELI Links
Get Help With Your Extortion Letter | ELI Phone Support | ELI Legal Representation Program
Show your support of the ELI website & ELI Forums through a PayPal Contribution. Thank you for supporting the ongoing fight and reporting of Extortion Settlement Demand Letters.

Author Topic: Another copyright bot: 80 legs.  (Read 10995 times)

lucia

  • Hero Member
  • *****
  • Posts: 767
    • View Profile
Another copyright bot: 80 legs.
« on: July 08, 2012, 07:50:59 AM »
I've been blocking 80 legs for a long time because they crawl too aggressively. I read it's blog -- it too is doing some copyright snooping (for fonts.)

http://blog.80legs.com/2010/08/10/case-study-monotype-imaging

Their user agent is: Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620

You can try robots.txt first.  They claim to obey it.  (I can't say. I now have a dynamic robots.txt and just ban anything I would forbid.)
This is a distributed agent using a wide range of IPs. So you must block by user agent.

Robert Krausankas (BuddhaPi)

  • ELI Defense Team Member
  • Administrator
  • Hero Member
  • *****
  • Posts: 3354
    • View Profile
    • ExtortionLetterInfo
Re: Another copyright bot: 80 legs.
« Reply #1 on: July 08, 2012, 10:59:18 AM »
Thanx for the heads up on this one, I'll block it via robots.txt as well as the useragent with htaccess..

I've been blocking 80 legs for a long time because they crawl too aggressively. I read it's blog -- it too is doing some copyright snooping (for fonts.)

http://blog.80legs.com/2010/08/10/case-study-monotype-imaging

Their user agent is: Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620

You can try robots.txt first.  They claim to obey it.  (I can't say. I now have a dynamic robots.txt and just ban anything I would forbid.)
This is a distributed agent using a wide range of IPs. So you must block by user agent.
Most questions have already been addressed in the forums, get yourself educated before making decisions.

Any advice is strictly that, and anything I may state is based on my opinions, and observations.
Robert Krausankas

I have a few friends around here..

Moe Hacken

  • Sr. Member
  • ****
  • Posts: 465
  • We have not yet begun to hack
    • View Profile
Re: Another copyright bot: 80 legs.
« Reply #2 on: July 08, 2012, 11:05:26 AM »
Trolling for font software has been happening for some time now. There are even large companies in the industry offering products to help businesses who use large collections of fonts protect themselves from unforced errors. Print shops, publishers and advertising agencies often share thousands of fonts on a LAN for their personnel to use and it can become very difficult to manage accidental misuse of a font.

Fonts are often licensed with clauses that allow giving the printer a copy for the purpose of imagesetting the output to create offset print plates, but are forbidden to use them for any other purpose. If a designer from the same print shop were to unwittingly (or knowingly) use the font to create a piece of artwork, the print shop would have committed an infringement and the client who provided the font may also be liable for providing it and losing control of it. The innocent end-user client whose artwork was created by the printer may also be dragged into the mess.

It's important to note that typefaces are not copyrightable, but font software is. If you trace a font from just looking at it and create your own font that is similar, keep all your files to prove that you never used the provider's font software. If you're using font squirrel or something like it on your website, BE VERY CAREFUL. The fonts will be stored in your server and the scraper will find you.

80legs' claim that they respect robots.txt is fatuous. They don't legally have to, according to discussions we've had repeatedly, so they can claim they play nice all they want but the voice out of the other side of their mouth is that theirs is "the most powerful web crawler ever." Take that, PicScout!

So powerful, in fact, that Yelp filed a lawsuit against them in March for scraping the bejeezus out of Yelp's servers collecting data to sell to third parties for God knows what purpose, which could include font trolling. Here's the text of the lawsuit:

http://tinyurl.com/7dqtnkb

80legs and the font foundries have been hurting print shops and their clients for years now. This is incredibly short-sighted as they're killing their own client base. With serious offshore competition and the price of paper putting near-fatal financial strain on the whole of the U.S. printing industry, the last thing they need is for one of their providers to cruelly troll them for the last drop of blood left.

Here's some good legal advice about protecting yourself from font trolls:

http://intellectual-property.lawyers.com/intellectual-property-licensing/Company-Sues-Over-Unauthorized-Use-of-Its-Fonts.html
« Last Edit: July 08, 2012, 11:11:41 AM by Moe Hacken »
I'd rather die on my feet than live on my knees

SoylentGreen

  • Hero Member
  • *****
  • Posts: 1503
    • View Profile
Re: Another copyright bot: 80 legs.
« Reply #3 on: July 08, 2012, 11:13:35 AM »
Great info here.
(Apparently, Riddick was quite the font thief.)

S.G.


Moe Hacken

  • Sr. Member
  • ****
  • Posts: 465
  • We have not yet begun to hack
    • View Profile
Re: Another copyright bot: 80 legs.
« Reply #4 on: July 08, 2012, 01:08:31 PM »
Here's another part of the complaint that Yelp filed against 80legs which could be applied to PicScout as well:

Quote
13. Defendants have packaged and sold "crawl packages," or bundles of data that they gathered from websites through their crawlers. Defendants sell such packages despite the fact that they do not own any rights in the underlying data which they are selling and regardless of whether their access to the underlying data was authorized or prohibited.

In the complaints filed against porn trolls, another interesting point is raised. In California, you need to be licensed to act as a private investigator. PicScout and 80legs are indeed practicing as private investigators when they crawl people's servers looking for copyright infringements. It would be good to ask if they indeed have the proper licenses to act as such. This may not apply to other states, but the above argument should cover just about any state in the Union.
I'd rather die on my feet than live on my knees

Moe Hacken

  • Sr. Member
  • ****
  • Posts: 465
  • We have not yet begun to hack
    • View Profile
Re: Another copyright bot: 80 legs.
« Reply #5 on: July 08, 2012, 01:20:22 PM »
Another interesting morsel of information from Yelp's complaint against 80legs:

Quote
17. Specifically, on or about Agust 14, 2009, Yelp added the user agent string "80bot" - the user agent string used by 80legs's crawler - to a list of "disallowed" robots on its robots.txt file. Yelp did so to ensure that neither 80legs, nor anyone affiliated with 80legs, would crawl the Yelp Site.

18. Unknown to Yelp, after Yelp instructed 80legs not to access the Yelp Site through revisions to its robots.txt file, Defendants began using a new user agent, "008", instead of the previous user agent "80bot". In November 2011, Yelp discovered that Defendants were packaging and selling Yelp's data that Defendants apparently were continuing to obtain from the Yelp Site. Specifically, Defendants were offering "Yelp Crawl Packages," or a "pre-configured live crawl" of the Yelp Site. Defendants described their product as a "crawl of listings and reviews on Yelp." Defendants were charging $700 per month to its customers for the "crawl packages." Defendants also offered to sell Yelp's archived data, at a price of $1,000 per million archived records."

It appears that 80legs' shape shifting is not limited to changing IP number blocks. That is one evil and mean-spirited troll-bot.
I'd rather die on my feet than live on my knees

Greg Troy (KeepFighting)

  • ELI Defense Team Member
  • Administrator
  • Hero Member
  • *****
  • Posts: 1859
    • View Profile
    • Yeah, We Do That.
Re: Another copyright bot: 80 legs.
« Reply #6 on: July 08, 2012, 03:14:37 PM »
Lucia,SG and/or Moe would you guys be willing to consider doing something like Robert did with his definitive list of copyright trolls but for user agents and IP addresses that need to be blocked? I'm searching through the different threads trying to find all the places where this was discussed and am finding it difficult to glean what I need out of the form due to its size. From what I have seen you all seem to be the most knowledgeable and probably already have a well-established list and if it was all contained in one thread it would be relatively simple to search for that thread and get the information you need. It would also be nice when we find anything new to just be able to add it to the archive.

I'm not trying to put more work on anyone but thought that you all might be the go to people on this. Let me know what you think and thanks!
Every situation is unique, any advice or opinions I offer are given for your consideration only. You must decide what is best for you and your particular situation. I am not a lawyer and do not offer legal advice.

--Greg Troy

Moe Hacken

  • Sr. Member
  • ****
  • Posts: 465
  • We have not yet begun to hack
    • View Profile
Re: Another copyright bot: 80 legs.
« Reply #7 on: July 08, 2012, 03:33:26 PM »
lucia is the local expert on that topic. I'm just happy to be learning from her and Buddhapi, who also has some expertise to share.

I'm going to give lucia a little backlink nod and point to her site, where she posted some very valuable info about blocking PicScout. That should be included in the compendium Greg has suggested, which I think is a fine idea:

http://rankexploits.com/protect/2011/12/four-steps-to-slow-down-image-scrapers/

Can't wait for lucia's Wordpress plugin to be ready. Hope this backlink helps the rankexploits.com rankings!  ;)
I'd rather die on my feet than live on my knees

lucia

  • Hero Member
  • *****
  • Posts: 767
    • View Profile
Re: Another copyright bot: 80 legs.
« Reply #8 on: July 08, 2012, 04:02:35 PM »
Greg--
There will never be a "definitive list".  These guys move around. As you can see, 80 legs changed it's user agent string name. They aren't the only one to do that. Also, the best method to block will depend on how your site is hosted. ( Shared hosting? Dedicated server? )  It will also depend on other features.

I've come to the conclusion that you can't just try to block copyright bots. If you want to control bots, you have to think about broadening your goals, and figuring out tradeoffs based on what sort of web site you are operating.  For example: If you are a vetranarian running a web page for the convenience of customers in a small town just outside of Omaha  Nebraska, your first step might be to block traffic from everything outside the US! This automatically catches Bezequint in Israel.  After that, you could start worrying about crawlers etc.   When you are worrying about crawlers, you can do a pretty thorough job. But even so, you could never be sure that Picscout won't come by since companies will take out accounts on a variety of IPs.

My main solution has been to use ZBblock on my site hosted at Dreamhost. This writes IP's it's blocked to a file, and I then have a script that reads the block list created on ZBblock  and bans lots of nasties at cloudflare (a free service which I now use for content delivery.)

But really-- if you use PHP, one of the best things you can do is get ZBblock (http://www.spambotsecurity.com) and install that.  This is-- after all-- my custome additions to the zillions of things Zaphod has designed his script to block.  Once you have ZBblock going, you still have to monitor your logs to notice suspicious things. BTW: He is adding many of the items in my custom signatures in his next update. He is very vigilant about finding new user agents-- and people report them to him.  Even if you can't use his script, downloading it and reading the signatures.inc file can provide you lots of information on what to block.

I do need to get my plugin working-- but it's only for Wordpress. It's mostly going to be useful for people on shared hosting who use php.  (The main difficulty is writing it so it's easy for people to install and use. But I'm blocking all sorts of stuff.

Greg Troy (KeepFighting)

  • ELI Defense Team Member
  • Administrator
  • Hero Member
  • *****
  • Posts: 1859
    • View Profile
    • Yeah, We Do That.
Re: Another copyright bot: 80 legs.
« Reply #9 on: July 08, 2012, 04:30:02 PM »
Lucia--

I do realize that there could never possibly be a definitive list and I was looking more for just a single thread where we can place what we have already discussed as well as anything new that should develop to make it easy to find. After reading your comments I see that the task is even more daunting than I imagined trying to catch and/or stop all of these bots and crawlers. Thank you for the link to Zblock I will download that and look at that this week.

That is awesome that Zaphod is including your custom signatures in his next update! You should be very proud! And I want you to know that we appreciate what you do here and in sharing your knowledge with us.

Greg--
There will never be a "definitive list".  These guys move around. As you can see, 80 legs changed it's user agent string name. They aren't the only one to do that. Also, the best method to block will depend on how your site is hosted. ( Shared hosting? Dedicated server? )  It will also depend on other features.

I've come to the conclusion that you can't just try to block copyright bots. If you want to control bots, you have to think about broadening your goals, and figuring out tradeoffs based on what sort of web site you are operating.  For example: If you are a vetranarian running a web page for the convenience of customers in a small town just outside of Omaha  Nebraska, your first step might be to block traffic from everything outside the US! This automatically catches Bezequint in Israel.  After that, you could start worrying about crawlers etc.   When you are worrying about crawlers, you can do a pretty thorough job. But even so, you could never be sure that Picscout won't come by since companies will take out accounts on a variety of IPs.

My main solution has been to use ZBblock on my site hosted at Dreamhost. This writes IP's it's blocked to a file, and I then have a script that reads the block list created on ZBblock  and bans lots of nasties at cloudflare (a free service which I now use for content delivery.)

But really-- if you use PHP, one of the best things you can do is get ZBblock (http://www.spambotsecurity.com) and install that.  This is-- after all-- my custome additions to the zillions of things Zaphod has designed his script to block.  Once you have ZBblock going, you still have to monitor your logs to notice suspicious things. BTW: He is adding many of the items in my custom signatures in his next update. He is very vigilant about finding new user agents-- and people report them to him.  Even if you can't use his script, downloading it and reading the signatures.inc file can provide you lots of information on what to block.

I do need to get my plugin working-- but it's only for Wordpress. It's mostly going to be useful for people on shared hosting who use php.  (The main difficulty is writing it so it's easy for people to install and use. But I'm blocking all sorts of stuff.
Every situation is unique, any advice or opinions I offer are given for your consideration only. You must decide what is best for you and your particular situation. I am not a lawyer and do not offer legal advice.

--Greg Troy

lucia

  • Hero Member
  • *****
  • Posts: 767
    • View Profile
Re: Another copyright bot: 80 legs.
« Reply #10 on: July 08, 2012, 05:01:56 PM »
Quote
That is awesome that Zaphod is including your custom signatures in his next update! You should be very proud! And I want you to know that we appreciate what you do here and in sharing your knowledge with us.
Only in rare instances did he learn these from me. He's been finding user agents and adding. So, he's finding some I found-- and more.  It's very rare that I find one before he does.

The forum over there is  not motivated by copyright issues. But they are motivated to prevent hacking and also save people bandwidth.    So lots of things get blocked merely because there is no benefit to a website owner to permit the bot.

For example: Think about 80 legs. It  might be visiting your site 10 times a second to find a copyright violation. It might be visiting because one of your competitors wants to learn something about your site -- to his advantage. It might be visiting because it lets people use it for free and someone who doesn't like you might decide to set it on you just to pester you. And so on.  So... why do you want to let this thing crawl?

Their blog suggests stupid things like: Maybe you are a blogger and you want advertisers to be able to serve your visitors better ads. Uhmm....My blog doesn't even have ads anyway. But suppose I did. If I had an account running banner ads with a particular ad agency, maybe that ad agency could tell me what spider he uses and I could let that spider visit.   Why should I want to let 80 legs crawl on the hypothetical theory that some ad agency somewhere in the world could make the ads I might deliver more responsive to my visitors?! 

There are all sorts of bots like that around. Plus-- the fact is-- no matter what they say they are doing, you don't really know what they are doing. But they are sucking your bandwidth. 

I'm at the point where if I can't figure out what a bot is, it's banned.  If they leave a link to a non-existant web page? Banned. Web page is impossible to understand? Banned.  We page says 'seo', 'reputation' etc? Banned.  Email to contact them? Banned until they answer the email. 

I have nothing against seo-- but lots of those bots are just voracious! 

But for an individual: When blocking, do think about what your business is. For lots of businesses, it might sometimes be useful to just block entire countries. For example: My hairdresser has a web site for convenience of customers. She could block everything outside the USA without harming herself at all!  On the other hand, as a blogger, I don't want to block everything outside the US. 

Greg Troy (KeepFighting)

  • ELI Defense Team Member
  • Administrator
  • Hero Member
  • *****
  • Posts: 1859
    • View Profile
    • Yeah, We Do That.
Re: Another copyright bot: 80 legs.
« Reply #11 on: July 08, 2012, 06:36:15 PM »
Thank you, more good information.

Quote
I'm at the point where if I can't figure out what a bot is, it's banned.  If they leave a link to a non-existant web page? Banned. Web page is impossible to understand? Banned.  We page says 'seo', 'reputation' etc? Banned.  Email to contact them? Banned until they answer the email. 

I am at the point where I am just starting to try and understand all this but I learn fast and should catch on before too long :)

I like the idea about blocking everything from outside the country, while the blog page in my website offers free DIY tips to homeowners my actual business is just limited to a three county area or about a 25 mile radius from my home. There's really no need for anything from outside the US to access my website.
Every situation is unique, any advice or opinions I offer are given for your consideration only. You must decide what is best for you and your particular situation. I am not a lawyer and do not offer legal advice.

--Greg Troy

lucia

  • Hero Member
  • *****
  • Posts: 767
    • View Profile
Re: Another copyright bot: 80 legs.
« Reply #12 on: July 08, 2012, 08:02:52 PM »
Greg--
If you have a blog with DYI tips, the tips really are good and you are hoping for links, you might want to limit to english speaking countries.  It's a balance. 

Or you could say block the countries that create the most hack/spam attempts and so on.  There are lots from the US-- but you aren't going to block the US. But you could get a list and go down:
Do you need anyone from the People's Republic of China visiting? No? Block.
Ukraine?
Thailand?
Israel?


I'd guess you might want to permit England, Canada, the US, Australia etc. Others: Block as you notice problems.

At cloudflare, I moderate China... and believe it or not, Brazil! Agents faking google bot with Brazilian IPS hammer my blog. I don't know why, but they do!

To some extent, you have to watch your logs.  But this is doable!

SoylentGreen

  • Hero Member
  • *****
  • Posts: 1503
    • View Profile
Re: Another copyright bot: 80 legs.
« Reply #13 on: July 08, 2012, 08:06:03 PM »
I'm kind of highjacking the thread here.
But here's one that creeps me out a bit:

Cyveillance?
http://en.wikipedia.org/wiki/Cyveillance
http://secpriv.com/who-is-cyveillance-and-why-should-you-care

S.G.

Moe Hacken

  • Sr. Member
  • ****
  • Posts: 465
  • We have not yet begun to hack
    • View Profile
Re: Another copyright bot: 80 legs.
« Reply #14 on: July 08, 2012, 08:53:12 PM »
S.G., that's not hijacking the thread at all. The thread is all about copyright bots and how to deal with them. Cyveillance has had people worried for some time now. I've read about them on a number of different civil rights advocacy websites.

The concern about bots, spiders and scrapers crashing into our servers to hose our bandwidth in order to collect content for third parties and whatever ends they may have in mind is widespread. It should be no surprise that an industry is emerging to fight them off. Here's one example of a company, which happens to be from Sweden, that offers software and services to block scraping:

http://blockscraping.com/

They don't mention a price structure, which usually means it's not cheap. This is basically an enterprise solution.

For the regular citizen web administrator, there's this tool:

http://antiscraper.com/faqs.aspx

This is along the lines of what lucia's plugin would do, but it's designed for a different kind of bad-bot. Since it can be customized, I guess one could make it block the crawlers we're interested in, such as PicScout and 80legs.

They do have a fee for using this, but it's a very reasonable $10 per year. This seems very useful for bloggers who have to deal with annoying content thieves. In the post-Panda-Penguin-Google rankings arena, content duplication has become very problematic and these hosers are putting the hurt on honest, hard-working bloggers by recklessly scraping and duplicating their original content.
« Last Edit: July 08, 2012, 09:19:25 PM by Moe Hacken »
I'd rather die on my feet than live on my knees

 

Official ELI Help Options
Get Help With Your Extortion Letter | ELI Phone Support Call | ELI Defense Letter Program
Show your support of the ELI website & ELI Forums through a PayPal Contribution. Thank you for supporting the ongoing fight and reporting of Extortion Settlement Demand Letters.