ExtortionLetterInfo Forums

ELI Forums => Getty Images Letter Forum => Topic started by: Robert Krausankas (BuddhaPi) on April 11, 2013, 04:10:02 PM

Title: new image bot/spider/scraper
Post by: Robert Krausankas (BuddhaPi) on April 11, 2013, 04:10:02 PM
http://www.imagewitness.com/index.php

keep an eye out for a new scraper...

the domain belongs to:
Registrant:
   Matthew Johnson
   8-230 Clovelly Rd
   Clovelly
   Sydney, New South Wales 2031
   Australia

there is no info on the site regarding the bot that scan, so I don't know if they abide by robots.txt, I'm doing more digging..probably will have to block them via ip address at the server level..just what we need another bandwidth sucking asshat.
Title: Re: new image bot/spider/scraper
Post by: Mulligan on April 11, 2013, 06:20:44 PM
That's pretty reasonable pricing. I wonder if this company's using Google's image search as the backbone? Is it possible to do that? With an API or something? See https://developers.google.com/image-search/
Title: Re: new image bot/spider/scraper
Post by: lucia on April 14, 2013, 05:14:27 PM
I'm sure it's technically feasible to jumpstart a search using Google. I don't know how much Google monitors use of it's search engine by bots of any sort.   But even if they do, someone with enough IPs and crawling slowly enough would likely be able to kickstart that way and gain a big advantage over human-powered search. 

From the point of view of a person who pays for her own hosting, I would prefer the image scraper to use Google to kickstart a search rather than have them hunting around my site, looking for and loading any and every image back to eternity.
Title: Re: new image bot/spider/scraper
Post by: Robert Krausankas (BuddhaPi) on April 14, 2013, 05:42:36 PM
not to mention, that if they are using a google search, who the hell would block a user-agent with google in it??...could be a script that either uses google or maybe even tineye, and brings back results stripping that info out..
Title: Re: new image bot/spider/scraper
Post by: lucia on April 14, 2013, 09:57:05 PM
I check whether an IP is spoofing the google referrer. If they are I block them. It's easy to check because google lets you run a reverse dns.  But that's not the same as blocking goog.e.

Blocking bots that spoof google doesn't let me block an anyone searching images cached by google. Once google has cached, the scraping is between google and the person scraping google.   

Some people might block google's image bot. It's possible to do without blocking the crawler that does non-image pages. Blocking the image crawler is probably neutral for lots of web sites because google search doesn't bring that much traffic for most blogs or sites. But some blogs and sites do get traffic from google search, so they won't want to block it.
Title: Re: new image bot/spider/scraper
Post by: Oscar Michelen on April 18, 2013, 08:08:55 PM
What the f*ck are you people talking about?  :o
Title: Re: new image bot/spider/scraper
Post by: Greg Troy (KeepFighting) on April 18, 2013, 08:12:39 PM
LOL,  thanks Oscar, that post gave me a much needed laugh! ;D  I'm mostly lost too.

What the f*ck are you people talking about?  :o
Title: Re: new image bot/spider/scraper
Post by: lucia on April 19, 2013, 08:58:15 AM
Ahh... Oscar. The conversation evolved.


Robert and I began by talking about methods an image search company could use to search for images that have been copied.

The two methods are being discussed.   Here's are the methods in a hypothetical.

Suppose I have 10 images on my site.  The company X is hired to by photographer Y to discover whether I am violating Y's copyright by hosting his image.   

Method I: Company X can create a 'bot' (aka computer program or script), that comes to my site, tries to load every file on my server possibly copying each to its own server. This is generally called 'scraping'. When it finds an image, it compares that image to photographer Y's image. If there is a match, it reports the URI for that image to photographer Y. 

Method II: Company X can create a bot that goes to Google's image search page scours all those images, when it finds a match to photographer Y's image, the bot is programmed to click over to the offending site (in this case mine). After clicking over, it identifies the URI and reports that to photographer Y.  This method will permit Company A to find images on my site provided I have been permitting Google to crawl my images and save those to Google's server. (Note, I have changed verbs from "scrape" to "crawl". The two actions could be described as being 'exactly the same';  scraping could be defined as "unwelcome crawling". Sort of like "Stop your pawing!","Cara mia! That was a caress!")

Because Method I consumes lots of my server resource (my $$), I prefer company X use Method II. That would mean they consume lots of Google's server resources (Googles $$).  As it happens Method II might also consume fewer of company X's resources because they know that everything at Google images is an image and they don't have to crawl through lots of non-image material to find image links.  So company X might want to do this.  This method II is what Mulligan was suggesting when he kicked this off with

I wonder if this company's using Google's image search as the backbone? Is it possible to do that?

Method II would be "using Google's image search as the backbone" of company X's system.  But now lets turn to Mulligan's second question: Is it possible?
 

 I'm pretty sure Google wouldn't like company X to use Google's search in quite this way both because (a) it consumes Googles server resources and (b) crawlers don't click advertising links and so don't make Google any money.

So, Google would likely be motivated to take steps to make it difficult for companies of this nature from using Google search in quite this way.  These steps might involve having its legal eagles write TOS that prohibit the behavior or it might involve using technology to notice the behavior and prevent or throttle it. The first would involve people like you writing a TOS, but it would be toothless if programmers and coders didn't do something to notice the behavior or collect evidence. And if they can notice or identify the behavior, they are likely to try to prevent it or throttle it.  And Google is chockful of people who know how to code. So, I would bet they use code to inhibit Company A from using the image search as a backbone. That said: It's impossible to use technology entirely prevent a determined party from scraping a public facing resource while still permitting public access, so its possible Company A does it nonetheless.

So, that was the first part of the conversation.   

After that, Robert switched to: Who would block Google from  crawling?  And I told him some people might block Google from crawling images and explained why they  might do so. Robert likely understands my fuller point because he already knows that Google names their crawler and has different ones. Their image crawler is separate from their text crawler. So if I like, I can block Googles image crawler but permit the text crawler.  One reason I might block the former and not the later is that I get very little 'good' traffic from image searches. I get lots of 'good' traffic from the text searches.  (I'm the one who defines what's good from my point of view.)

So.. I think that's what the F we were talking about. :)

Title: Re: new image bot/spider/scraper
Post by: Robert Krausankas (BuddhaPi) on April 19, 2013, 09:07:06 AM
as a general rule of thumb I generally do block googles image crawler, for the exact reason Lucia states, I get no valuable traffic on my site due to images...copyright-trolls.com however is different, I do allow google to index those images...clearly they get tagged, captioned and have alt text...simply doing an image search for Timothy B McCormack yields some very good results....
Title: Re: new image bot/spider/scraper
Post by: Jerry Witt (mcfilms) on April 19, 2013, 11:35:52 AM
Although I salute the effort Robert and Lucia put into blocking these bots, I really don't see it worth the effort. I was convinced the first time I tried to visit Lucia's site and my IP was blocked. I can't afford to start blocking potential customers, so I knew this wasn't for me.

Besides, I know if I were designing some sort of "image scanning spider" I would make it possible for it to crawl the site under a proxy and perhaps spoof the user agent. And if that was rejected, I'd simply have a human go poke around.

To me the best course of action is to simply not use images that aren't yours to use.

Are both of you really loosing money to bots that are hogging up bandwidth? Because the hosting I usually have includes far more available bandwidth than the tiny slice that is used by bots.
Title: Re: new image bot/spider/scraper
Post by: Robert Krausankas (BuddhaPi) on April 19, 2013, 11:43:07 AM
Although I salute the effort Robert and Lucia put into blocking these bots, I really don't see it worth the effort. I was convinced the first time I tried to visit Lucia's site and my IP was blocked. I can't afford to start blocking potential customers, so I knew this wasn't for me.

Besides, I know if I were designing some sort of "image scanning spider" I would make it possible for it to crawl the site under a proxy and perhaps spoof the user agent. And if that was rejected, I'd simply have a human go poke around.

To me the best course of action is to simply not use images that aren't yours to use.

Are both of you really loosing money to bots that are hogging up bandwidth? Because the hosting I usually have includes far more available bandwidth than the tiny slice that is used by bots.

no Jerry it's not costing me anything worth discussing, and trust me I spend very little time blocking anything, I have better things to do..and trust me I use only "good" images, as a means of protecting my customers, I generally as a matter of coarse block the image bot, if it will save me a headache and them calling me saying "I got letter wut do" it's worth the 5 minutes it takes me...I've also resorted to having clients sign a hold harmless agreement if they supply images to me...I can tell you how many times I've explained in very simple terms, that they need to use licensed images or there own, I direct them to pond, and invariably they send me images from google searches..and I call them out on it, they then get send the hold harmless and they are on their own..becomes frustrating sometimes to say the least.
Title: Re: new image bot/spider/scraper
Post by: stinger on April 19, 2013, 11:56:25 AM
Lucia, I have a question about method I that you describe.

If Getty comes to my site through picscout, and copies my copyrighted images onto their server so they can then scan them to determine if they are theirs, is that act of copying my images a copyright infringement by picscout?  Or do they have to publish something first?  It seems like they have taken something that is not theirs and made a copy of it.

If it is infringement, one would have to find a way to prove that this is happening.  I don't want to use the words S.G. hates, but could a poorly designed picscout open Getty up to some sort of large scale action?  They are certainly not teaching or commenting on my photos.

If someone steals a priceless painting and doesn't display it, they are still guilty of theft if caught.  Is that also true with digital images?
Title: Re: new image bot/spider/scraper
Post by: Robert Krausankas (BuddhaPi) on April 19, 2013, 12:26:08 PM
Lucia, I have a question about method I that you describe.

If Getty comes to my site through picscout, and copies my copyrighted images onto their server so they can then scan them to determine if they are theirs, is that act of copying my images a copyright infringement by picscout?  Or do they have to publish something first?  It seems like they have taken something that is not theirs and made a copy of it.

If it is infringement, one would have to find a way to prove that this is happening.  I don't want to use the words S.G. hates, but could a poorly designed picscout open Getty up to some sort of large scale action?  They are certainly not teaching or commenting on my photos.

If someone steals a priceless painting and doesn't display it, they are still guilty of theft if caught.  Is that also true with digital images?

no it would not be infringement, and picscout operates out of Israel, they don't follow US law
Title: Re: new image bot/spider/scraper
Post by: stinger on April 19, 2013, 02:02:53 PM
Thanks, Robert.
Title: Re: new image bot/spider/scraper
Post by: lucia on April 19, 2013, 04:27:16 PM
Although I salute the effort Robert and Lucia put into blocking these bots, I really don't see it worth the effort. I was convinced the first time I tried to visit Lucia's site and my IP was blocked. I can't afford to start blocking potential customers, so I knew this wasn't for me.

I've gotten better at not blocking people. But you are correct that with respect to the copyright extortion issue, the correct practice is not to use images that don't belong to you. 

The main reason I block bots is that I see lots of bots that do nothing but raise my server costs. These aren't necessarily image bots; in fact most are not. This is a problem for a hobby blog that generates $0 by design and has absolute positively no revenue model.  If I block someone, I block someone. It's not desirable, but I don't lose customers or revenue. 

I've gotten better at figuring out how to block the non-image bots too. But once again: in my case, it's acceptable to sometimes block people. That likely is not the case for your business.

Title: Re: new image bot/spider/scraper
Post by: lucia on April 19, 2013, 04:34:26 PM
Lucia, I have a question about method I that you describe.

If Getty comes to my site through picscout, and copies my copyrighted images onto their server so they can then scan them to determine if they are theirs, is that act of copying my images a copyright infringement by picscout?  Or do they have to publish something first?  It seems like they have taken something that is not theirs and made a copy of it.
First that's a legal question and my answer to the legal question is "I don't know". Oscar might know whether their copying onto their server violates copyright or failing that, whether someone alleging copyright would have a colorable case.  I assume picscout would claim fair use for the copy. They might win. Oscar would be the one who could speculate intelligently about that.
Second: They may  not load the images onto their server. 
Third: This might be largely hypothetical since you don't know whether a visit results in their making a copy to their server. They may merely compare. 
If it is infringement, one would have to find a way to prove that this is happening.
Supoena? Discovery after they sue you? Those are the only legal methods I can think of. The illegal ones would be "cracking into their computer".  I'm not sure how you would bring that evidence forward.

I don't want to use the words S.G. hates, but could a poorly designed picscout open Getty up to some sort of large scale action?  They are certainly not teaching or commenting on my photos.
Once again: Legal question.  Worse: complicated fair use question about a transformative use.  (Google gets to copy and cache.  I think that's been deemed fair use.)

If someone steals a priceless painting and doesn't display it, they are still guilty of theft if caught.  Is that also true with digital images?
Well.... Getty seems to say it applies to digital images. :)
Title: Re: new image bot/spider/scraper
Post by: stinger on April 20, 2013, 11:16:06 AM
Thanks for those thoughts, Lucia.
Title: Re: new image bot/spider/scraper
Post by: Oscar Michelen on April 27, 2013, 03:42:25 PM
To my knowledge, Getty does not download every picture it finds onto their servers.  PicScout does the search without having to do that. Looking at images to determine if they are owned or licensed by Getty would not constitute infringement as it is not really a "use" never mind whether its a fair use.   
Title: Re: new image bot/spider/scraper
Post by: Engel Nyst on September 21, 2016, 12:09:42 PM
I know it's an old topic, just a note about the technical matter here.

PicScout can't look at images without copying them on their servers. It has to download the images.

For example, the technical documentation of TinEye, a similar software, says:
https://services.tineye.com/developers/matchengine/methods/compare.html
Quote
If you are comparing by image or filepath then operations are performed in the order in which they are received. If you are comparing by URL then the images are downloaded before the operations go into the queue.

The "operations" may also involve (additional) copying into RAM, but we can ignore that. In order to operate however, all these bots/crawlers first download the images.

I don't question it's fair use, though; just about the technical aspect of it.
Title: Re: new image bot/spider/scraper
Post by: Matthew Chan on October 03, 2016, 07:10:38 AM
We rarely get into the technical side or detail of how Tineye or Picscout works mostly because it hasn't come up before nor has the discussion been at the forefront.  As such, I don't think most of us at ELI have done research on the matter.  However, it does make for an interesting discussion of how it all works in detail.

But as always, Engel Nyst provides good feedback on the matter.

IN particular, I found this page particularly helpful in describing how Tineye works which I would extrapolate to how Picscout works.

https://services.tineye.com/developers/matchengine/what.html