Author Topic: new image bot/spider/scraper (Read 10837 times)

Robert Krausankas (BuddhaPi) · « **on:** April 11, 2013, 04:10:02 PM »

http://www.imagewitness.com/index.php

keep an eye out for a new scraper...

the domain belongs to:
Registrant:
Matthew Johnson
8-230 Clovelly Rd
Clovelly
Sydney, New South Wales 2031
Australia

there is no info on the site regarding the bot that scan, so I don't know if they abide by robots.txt, I'm doing more digging..probably will have to block them via ip address at the server level..just what we need another bandwidth sucking asshat.

Mulligan · « **Reply #1 on:** April 11, 2013, 06:20:44 PM »

That's pretty reasonable pricing. I wonder if this company's using Google's image search as the backbone? Is it possible to do that? With an API or something? See https://developers.google.com/image-search/

lucia · « **Reply #2 on:** April 14, 2013, 05:14:27 PM »

I'm sure it's technically feasible to jumpstart a search using Google. I don't know how much Google monitors use of it's search engine by bots of any sort. But even if they do, someone with enough IPs and crawling slowly enough would likely be able to kickstart that way and gain a big advantage over human-powered search.

From the point of view of a person who pays for her own hosting, I would prefer the image scraper to use Google to kickstart a search rather than have them hunting around my site, looking for and loading any and every image back to eternity.

Robert Krausankas (BuddhaPi) · « **Reply #3 on:** April 14, 2013, 05:42:36 PM »

not to mention, that if they are using a google search, who the hell would block a user-agent with google in it??...could be a script that either uses google or maybe even tineye, and brings back results stripping that info out..

lucia · « **Reply #4 on:** April 14, 2013, 09:57:05 PM »

I check whether an IP is spoofing the google referrer. If they are I block them. It's easy to check because google lets you run a reverse dns. But that's not the same as blocking goog.e.

Blocking bots that spoof google doesn't let me block an anyone searching images cached by google. Once google has cached, the scraping is between google and the person scraping google.

Some people might block google's image bot. It's possible to do without blocking the crawler that does non-image pages. Blocking the image crawler is probably neutral for lots of web sites because google search doesn't bring that much traffic for most blogs or sites. But some blogs and sites do get traffic from google search, so they won't want to block it.

Oscar Michelen · « **Reply #5 on:** April 18, 2013, 08:08:55 PM »

What the f*ck are you people talking about?

Greg Troy (KeepFighting) · « **Reply #6 on:** April 18, 2013, 08:12:39 PM »

LOL, thanks Oscar, that post gave me a much needed laugh!

I'm mostly lost too.

Quote from: Oscar Michelen on April 18, 2013, 08:08:55 PM

What the f*ck are you people talking about?

lucia · « **Reply #7 on:** April 19, 2013, 08:58:15 AM »

Ahh... Oscar. The conversation evolved.

Robert and I began by talking about methods an image search company could use to search for images that have been copied.

The two methods are being discussed. Here's are the methods in a hypothetical.

Suppose I have 10 images on my site. The company X is hired to by photographer Y to discover whether I am violating Y's copyright by hosting his image.

Method I: Company X can create a 'bot' (aka computer program or script), that comes to my site, tries to load every file on my server possibly copying each to its own server. This is generally called 'scraping'. When it finds an image, it compares that image to photographer Y's image. If there is a match, it reports the URI for that image to photographer Y.

Method II: Company X can create a bot that goes to Google's image search page scours all those images, when it finds a match to photographer Y's image, the bot is programmed to click over to the offending site (in this case mine). After clicking over, it identifies the URI and reports that to photographer Y. This method will permit Company A to find images on my site provided I have been permitting Google to crawl my images and save those to Google's server. (Note, I have changed verbs from "scrape" to "crawl". The two actions could be described as being 'exactly the same'; scraping could be defined as "unwelcome crawling". Sort of like "Stop your pawing!","Cara mia! That was a caress!")

Because Method I consumes lots of my server resource (my $$), I prefer company X use Method II. That would mean they consume lots of Google's server resources (Googles $$). As it happens Method II might also consume fewer of company X's resources because they know that everything at Google images is an image and they don't have to crawl through lots of non-image material to find image links. So company X might want to do this. This method II is what Mulligan was suggesting when he kicked this off with

I wonder if this company's using Google's image search as the backbone? Is it possible to do that?

Method II would be "using Google's image search as the backbone" of company X's system. But now lets turn to Mulligan's second question: Is it possible?

I'm pretty sure Google wouldn't like company X to use Google's search in quite this way both because (a) it consumes Googles server resources and (b) crawlers don't click advertising links and so don't make Google any money.

So, Google would likely be motivated to take steps to make it difficult for companies of this nature from using Google search in quite this way. These steps might involve having its legal eagles write TOS that prohibit the behavior or it might involve using technology to notice the behavior and prevent or throttle it. The first would involve people like you writing a TOS, but it would be toothless if programmers and coders didn't do something to notice the behavior or collect evidence. And if they can notice or identify the behavior, they are likely to try to prevent it or throttle it. And Google is chockful of people who know how to code. So, I would bet they use code to inhibit Company A from using the image search as a backbone. That said: It's impossible to use technology entirely prevent a determined party from scraping a public facing resource while still permitting public access, so its possible Company A does it nonetheless.

So, that was the first part of the conversation.

After that, Robert switched to: Who would block Google from crawling? And I told him some people might block Google from crawling images and explained why they might do so. Robert likely understands my fuller point because he already knows that Google names their crawler and has different ones. Their image crawler is separate from their text crawler. So if I like, I can block Googles image crawler but permit the text crawler. One reason I might block the former and not the later is that I get very little 'good' traffic from image searches. I get lots of 'good' traffic from the text searches. (I'm the one who defines what's good from my point of view.)

So.. I think that's what the F we were talking about.

Robert Krausankas (BuddhaPi) · « **Reply #8 on:** April 19, 2013, 09:07:06 AM »

as a general rule of thumb I generally do block googles image crawler, for the exact reason Lucia states, I get no valuable traffic on my site due to images...copyright-trolls.com however is different, I do allow google to index those images...clearly they get tagged, captioned and have alt text...simply doing an image search for Timothy B McCormack yields some very good results....

Jerry Witt (mcfilms) · « **Reply #9 on:** April 19, 2013, 11:35:52 AM »

Although I salute the effort Robert and Lucia put into blocking these bots, I really don't see it worth the effort. I was convinced the first time I tried to visit Lucia's site and my IP was blocked. I can't afford to start blocking potential customers, so I knew this wasn't for me.

Besides, I know if I were designing some sort of "image scanning spider" I would make it possible for it to crawl the site under a proxy and perhaps spoof the user agent. And if that was rejected, I'd simply have a human go poke around.

To me the best course of action is to simply not use images that aren't yours to use.

Are both of you really loosing money to bots that are hogging up bandwidth? Because the hosting I usually have includes far more available bandwidth than the tiny slice that is used by bots.

Robert Krausankas (BuddhaPi) · « **Reply #10 on:** April 19, 2013, 11:43:07 AM »

Quote from: Jerry Witt (mcfilms) on April 19, 2013, 11:35:52 AM

Although I salute the effort Robert and Lucia put into blocking these bots, I really don't see it worth the effort. I was convinced the first time I tried to visit Lucia's site and my IP was blocked. I can't afford to start blocking potential customers, so I knew this wasn't for me.

Besides, I know if I were designing some sort of "image scanning spider" I would make it possible for it to crawl the site under a proxy and perhaps spoof the user agent. And if that was rejected, I'd simply have a human go poke around.

To me the best course of action is to simply not use images that aren't yours to use.

Are both of you really loosing money to bots that are hogging up bandwidth? Because the hosting I usually have includes far more available bandwidth than the tiny slice that is used by bots.

no Jerry it's not costing me anything worth discussing, and trust me I spend very little time blocking anything, I have better things to do..and trust me I use only "good" images, as a means of protecting my customers, I generally as a matter of coarse block the image bot, if it will save me a headache and them calling me saying "I got letter wut do" it's worth the 5 minutes it takes me...I've also resorted to having clients sign a hold harmless agreement if they supply images to me...I can tell you how many times I've explained in very simple terms, that they need to use licensed images or there own, I direct them to pond, and invariably they send me images from google searches..and I call them out on it, they then get send the hold harmless and they are on their own..becomes frustrating sometimes to say the least.

stinger · « **Reply #11 on:** April 19, 2013, 11:56:25 AM »

Lucia, I have a question about method I that you describe.

If Getty comes to my site through picscout, and copies my copyrighted images onto their server so they can then scan them to determine if they are theirs, is that act of copying my images a copyright infringement by picscout? Or do they have to publish something first? It seems like they have taken something that is not theirs and made a copy of it.

If it is infringement, one would have to find a way to prove that this is happening. I don't want to use the words S.G. hates, but could a poorly designed picscout open Getty up to some sort of large scale action? They are certainly not teaching or commenting on my photos.

If someone steals a priceless painting and doesn't display it, they are still guilty of theft if caught. Is that also true with digital images?

Robert Krausankas (BuddhaPi) · « **Reply #12 on:** April 19, 2013, 12:26:08 PM »

Quote from: stinger on April 19, 2013, 11:56:25 AM

Lucia, I have a question about method I that you describe.

If Getty comes to my site through picscout, and copies my copyrighted images onto their server so they can then scan them to determine if they are theirs, is that act of copying my images a copyright infringement by picscout? Or do they have to publish something first? It seems like they have taken something that is not theirs and made a copy of it.

If it is infringement, one would have to find a way to prove that this is happening. I don't want to use the words S.G. hates, but could a poorly designed picscout open Getty up to some sort of large scale action? They are certainly not teaching or commenting on my photos.

If someone steals a priceless painting and doesn't display it, they are still guilty of theft if caught. Is that also true with digital images?

no it would not be infringement, and picscout operates out of Israel, they don't follow US law

stinger · « **Reply #13 on:** April 19, 2013, 02:02:53 PM »

Thanks, Robert.

lucia · « **Reply #14 on:** April 19, 2013, 04:27:16 PM »

Quote from: Jerry Witt (mcfilms) on April 19, 2013, 11:35:52 AM

Although I salute the effort Robert and Lucia put into blocking these bots, I really don't see it worth the effort. I was convinced the first time I tried to visit Lucia's site and my IP was blocked. I can't afford to start blocking potential customers, so I knew this wasn't for me.

I've gotten better at not blocking people. But you are correct that with respect to the copyright extortion issue, the correct practice is not to use images that don't belong to you.

The main reason I block bots is that I see lots of bots that do nothing but raise my server costs. These aren't necessarily image bots; in fact most are not. This is a problem for a hobby blog that generates $0 by design and has absolute positively no revenue model. If I block someone, I block someone. It's not desirable, but I don't lose customers or revenue.

I've gotten better at figuring out how to block the non-image bots too. But once again: in my case, it's acceptable to sometimes block people. That likely is not the case for your business.

Author Topic: new image bot/spider/scraper (Read 10837 times)

Robert Krausankas (BuddhaPi)

new image bot/spider/scraper

Mulligan

Re: new image bot/spider/scraper

lucia

Re: new image bot/spider/scraper

Robert Krausankas (BuddhaPi)

Re: new image bot/spider/scraper

lucia

Re: new image bot/spider/scraper

Oscar Michelen

Re: new image bot/spider/scraper

Greg Troy (KeepFighting)

Re: new image bot/spider/scraper

lucia

Re: new image bot/spider/scraper

Robert Krausankas (BuddhaPi)

Re: new image bot/spider/scraper

Jerry Witt (mcfilms)

Re: new image bot/spider/scraper

Robert Krausankas (BuddhaPi)

Re: new image bot/spider/scraper

stinger

Re: new image bot/spider/scraper

Robert Krausankas (BuddhaPi)

Re: new image bot/spider/scraper

stinger

Re: new image bot/spider/scraper

lucia

Re: new image bot/spider/scraper

Click Official ELI Links	Get Help With Your Extortion Letter \| ELI Phone Support \| ELI Legal Representation Program
	Show your support of the ELI website & ELI Forums through a PayPal Contribution. Thank you for supporting the ongoing fight and reporting of Extortion Settlement Demand Letters.

Official ELI Help Options	Get Help With Your Extortion Letter \| ELI Phone Support Call \| ELI Defense Letter Program
	Show your support of the ELI website & ELI Forums through a PayPal Contribution. Thank you for supporting the ongoing fight and reporting of Extortion Settlement Demand Letters.