Author Topic: Image scraping: The Newsblur angle. (Read 8106 times)

lucia · « **on:** August 17, 2012, 09:10:47 AM »

I want to stop a moment and explain how a trove of blog "snapshots" (aka full copies of html) displayed by a third party (like Newsblur) can be used by a a picscout / getty / image scraper / copyright troll to efficiently find and inspect image. There are two ways I thought of relatively quickly and there may be more. Both are methods would be much easier to code and implement and would save a bot a lot of time relative to what exists absent a trove of conveniently supplied "snapshots".

If I were an image bot owned by a copyright troll, I would do this right now:
Method 1) Write a crawler to progressive visit copies of site with addresses like http://newsblur.com/reader/page/1 (Give it a try. )

I would compare each image at that site to my trove of images. That would essentially show my bot the header and "decoration" images that display in any blog plus any images that happen to be displayed on the top page on the day my bot visited.

After visiting 1, now load http://newsblur.com/reader/page/1 . (My site number is up above 1,000,000. So you'll want a bot. But it really is a trove.)

So, the existence of all these copies displayed publicly is a very handy thing for copyright trolls. The copy of my site is unauthorized. And this sort of thing is one of the reasons I am really pissed off.

Method 2) If copyright troll, I would also do the following:
Program a bot to visit http://www.newsblur.com/site/1. Code whatever is required to make the site believe you have clicked "feed". Cause the "scroll bar" to scroll down... keep scrolling... keep scrolling. (Both could be coded but requires programming. I'm a poor programmer, it would likely take me a day. A good programmer could do it more quickly. I'm sure the guys at Picscout can do it as soon as they read this post. That is assuming they haven't already learned of the existence of Newsblur in which case, figuring out that they could do it and how to do it would take them... on... 2 hours? )

Once this is done, the bot can them load every image in every blog post the blogger every posted. It "sees" every single address for every image and loads them the way a browser would.

None of this should make us happy.

That fact that I could think of these methods very quickly (and did) contributes to why I am very, very upset by the possibility of losing control of how my site displays. To the extent that the copying is involved, this is a copyright issue. But to the extent that it facilitates images scraping, it is a "getty copyright troll" issue.

Moe Hacken · « **Reply #1 on:** August 17, 2012, 01:31:37 PM »

That's a very good point. How do you intend to respond to the scraping?

lucia · « **Reply #2 on:** August 17, 2012, 01:52:38 PM »

Moe--
What I am describing is party A making it much, much easier for party B to scrape party C. (I would be C.) It happens that one of the things party A is doing is copying my site without permission and posting the copy in a public place. And I don't like that copying for many reasons. But what I am trying to explain is that one of the reasons I don't like it is that it facilitates party B.

As for any parties who scrape: I am already doing what I can to impede that.

lucia · « **Reply #3 on:** August 17, 2012, 01:59:39 PM »

Quote from: lucia on August 17, 2012, 01:52:38 PM

Moe--
What I am describing is party A making it much, much easier for party B to scrape party C. (I would be C.) It happens that one of the things party A is doing is copying my site without permission and posting the copy in a public place. And I don't like that copying for many reasons. But what I am trying to explain is that one of the reasons I don't like it is that it facilitates party B. So I would like party A to cease and desist copying-- just as I wrote on the other thread. Party A is not scraping.

As for any parties who scrape: I am already doing what I can to impede that. I do more or less the same things I was previously doing. But because of party A is making copies of my blog available in a certain specific way, a method that could make B's scraping attempts scraping easier has been created. B will find it easier to scrape my site. If you have a blog and A has copied your blog, B will find it easier to scrape your site. Same with Budhappi's blog and so on. Possibly none of you will notice the scrape.

I might intercept it. Or not.

meontheweb · « **Reply #4 on:** August 17, 2012, 03:24:41 PM »

Isn't innovation great? Love the fact that I can use the multitudes of RSS readers out there and track my favourite blogs; unfortunately this makes the job so much easier for trolls - they don't need to spend the money (or at least not a lot of money) on building their own infrastructure - they just use someone elses.

SoylentGreen · « **Reply #5 on:** August 17, 2012, 04:03:39 PM »

Thanks for your posts, Lucia...

It's interesting to hear both sides of issues relating to copyright and distribution of content in general.
Sorry to hear that you're being victimized by this, and I guess that many of us will fall victim soon, even if we don't notice it right away.

I sure that many are interested in solutions to this problem. You seem to be coming up with some interesting tactics.
Again, I also hope that others recognize that; maybe it could become profitable for you in the near future.
Personally, I would be willing to pay for scripts/protections that would prevent copying.

S.G.

lucia · « **Reply #6 on:** August 17, 2012, 05:24:12 PM »

Quote from: meontheweb on August 17, 2012, 03:24:41 PM

Isn't innovation great? Love the fact that I can use the multitudes of RSS readers out there and track my favourite blogs; unfortunately this makes the job so much easier for trolls - they don't need to spend the money (or at least not a lot of money) on building their own infrastructure - they just use someone elses.

I can't think of a single way normal RSS readers would make it easier for trolls to find things. Newsblurs novel new method of presenting material to the public would make it easier.

Jerry Witt (mcfilms) · « **Reply #7 on:** August 17, 2012, 10:49:10 PM »

Why not just send Newblur a cease and desist letter? It's your content and if you don't want them to re-purpose it, just tell them.

Moe Hacken · « **Reply #8 on:** August 17, 2012, 11:37:23 PM »

Jerry, that's kind of what I had in mind. Perhaps Newsblur would stop if asked. If not, perhaps it would be appropriate to consider a DMCA removal request. They're doing damage to their victims on many levels.

lucia · « **Reply #9 on:** August 18, 2012, 09:26:41 AM »

Jerry,
First: On this thread, my main intention is to alert readers that they should be aware that it is now easier for image scrapers to find images. This is separate from the copyright issue and merely has to do with the way this group is displaying stuff.

I'm not sure how everyone is going to detect this-- I know how I am. Among other things, in .htaccess, if the referrer is newsblur, any uploaded blog images will be replaced by pictures of a cat. You can see this if you load http://www.newsblur.com/site/1100897/ which currently seems to default to "feed". If not, click feed. Scroll down. You'll see cats. (I can also detect that I have an imperfection in my redirection because that first graph which is a '.png' ought to also be a cat, but it's a graph.)

Second: On the copyright issue: I am going to send a more formal cease and desist. But here's the background:

On newsblur's site, FAQ says to 'opt-out' we should email Clay Samuels the owner, founder, coder. I emailed and told him to take me off. He sent me a sales pitch telling me how great his service was and actually told me he was my best friend because he was copying my material in part. I repeated my request he stop. He did not respond to this email in any way. Silence. This pissed me off.

On Monday, I blogged, and I made changes to my page so that people viewing the fresh copies read my opinion about the practice and are autoforwarded to my real site. (This is called the "ass-hat message" which is written in javascript. To see it visit http://newsblur.com/reader/page/1100897. That's what people view if they click to the "story view" in the previous url which I sent you. which is the one I have asked him to take down and also the one that involves copying. You can see by scrolling down here:
http://www.newsblur.com/site/1100897/

On Tuesday I tweeted about the page display and the owner of newsblur made some quick changes which supposedly intended to eliminate the problem. He tweeted back that he had made changes to prevent loading. I tweeted back that copies were still displaying. Fresh copies continue to be made. (Although initially I thought they had stopped-- but I was mistaken. What had happened was merely that Newsblur has copied a display showing that their IP had been banned at Cloudflare. It looked pretty funny actually.)

Later, I noticed the newsblur continued to show fresh copies. I did some tweaking to verify that
a) their bot does not visit robots.txt (which would have told them their visits are disallowed)
and
b) their bot does not visit the "noarchive" metatag.)

This means that as far as I can tell, there is no way for my server to communicate my wish to "don't visit here" and "don't copy" to their bot. And the bot just copies.

I have also been taking fresh snapshots of his copies and my pages because at this point I anticipate that he might continue to fail to stop copying. In preparation for that I want to have a packet of "stuff". But my plan forward is:
1) Send him another email cease and desist.
2) If he does not cease and desist, send a DMCA notice to either his hosting company, his name server or both.

But before I do (2) I want to be certain that I have evidence in place that should he dispute my take down, I would have proper evidentiary materials so that I win and my court costs are covered. That's why I am asking people what I should log etc.

I also wouldn't mind if people might suggest whether they can think of any reason in the world why I might fail to win a case in copyright court. Because I certainly don't want to go to court and lose.

Robert Krausankas (BuddhaPi) · « **Reply #10 on:** August 18, 2012, 09:35:49 AM »

"I also wouldn't mind if people might suggest whether they can think of any reason in the world why I might fail to win a case in copyright court. Because I certainly don't want to go to court and lose."

If you file suit, you risk losing, plain and simple.there are guarantees! I've been back to newsblur several times and looked at the code and everytime I do it appears that the "copies" that are in his frame are coming from your site and not a "copy" per se... for example the graph.png you mentioned is coming from here:

http://rankexploits.com/musings/wp-content/uploads/2012/08/AugustPrediction-500x500.png

I think perhaps I got somewhere along the way here..in any case, IF he is indeed copying and he is aware of it, he would be a fool to counter a properly submitted DMCA take down request..

Oscar Michelen · « **Reply #11 on:** August 18, 2012, 11:54:18 AM »

I agree with Robert - send repeated DMCA requests when he copies your content. You will be in much stronger position if you decide to sue as well.

Moe Hacken · « **Reply #12 on:** August 18, 2012, 02:40:30 PM »

lucia, you can send the DMCA requests to any and/or all of the following: Their domain name registry, their ISP, and search engines such as Google. Google would remove offending URLs from their search results, which should have a positive effect on your SEO rankings by eliminating duplicate content, and a negative effect on their SEO rankings according to Google's latest announcements.

Sounds like these people are shameless asshats. I think you have a strong case and it's wise of you to document it thoroughly.

I don't understand why one should be required to opt-out of being blatantly plagiarized, but the least they could do is actually honor the offer to let you do so.

lucia · « **Reply #13 on:** August 18, 2012, 04:08:27 PM »

Robert--
The images are not copied. The html for the page is copied. On a mac I can click out of a frame by moving my mouse over the pane, use "contrl click" and selecting "view frame in new window". When I do that, what's in the frame opens in a new window. What opens is a copy of the top of my site. You can look in the address bar you can see the address is "http://newsblur.com/reader/page/1100897" not "http://rankexploits.com/musings". So, as far as I can tell he has copied http://rankexploits.com/musings onto his server where others access it using the address ""http://newsblur.com/reader/page/1100897" . When you view it framed, he is framing his copy, not the material at my site.

Because it's a precise copy of the html at "http://rankexploits.com/musings", "http://newsblur.com/reader/page/1100897" makes the exact same calls for images. So, the images that are called are hosted at my site. Images are not copied. But if I look in my server logs, the referrers for those images are not 'http://rankexploits.com/musings", the referrers are http://newsblur.com/reader/page/1100897.

Mind you: if you go to the "story" view, he is framing my individual blog posts. Why he copies the top of the blog but frames the individual stories I don't know. Presumably it has something to do with his desire to overlay all the clicking and whiring stuff on the top page of the blog but doesn't want to do it on the other pages.

Author Topic: Image scraping: The Newsblur angle. (Read 8106 times)

lucia

Image scraping: The Newsblur angle.

Moe Hacken

Re: Image scraping: The Newsblur angle.

lucia

Re: Image scraping: The Newsblur angle.

lucia

Re: Image scraping: The Newsblur angle.

meontheweb

Re: Image scraping: The Newsblur angle.

SoylentGreen

Re: Image scraping: The Newsblur angle.

lucia

Re: Image scraping: The Newsblur angle.

Jerry Witt (mcfilms)

Re: Image scraping: The Newsblur angle.

Moe Hacken

Re: Image scraping: The Newsblur angle.

lucia

Re: Image scraping: The Newsblur angle.

Robert Krausankas (BuddhaPi)

Re: Image scraping: The Newsblur angle.

Oscar Michelen

Re: Image scraping: The Newsblur angle.

Moe Hacken

Re: Image scraping: The Newsblur angle.

lucia

Re: Image scraping: The Newsblur angle.

Click Official ELI Links	Get Help With Your Extortion Letter \| ELI Phone Support \| ELI Legal Representation Program
	Show your support of the ELI website & ELI Forums through a PayPal Contribution. Thank you for supporting the ongoing fight and reporting of Extortion Settlement Demand Letters.

Official ELI Help Options	Get Help With Your Extortion Letter \| ELI Phone Support Call \| ELI Defense Letter Program
	Show your support of the ELI website & ELI Forums through a PayPal Contribution. Thank you for supporting the ongoing fight and reporting of Extortion Settlement Demand Letters.