I want to stop a moment and explain how a trove of blog "snapshots" (aka full copies of html) displayed by a third party (like Newsblur) can be used by a a picscout / getty / image scraper / copyright troll to efficiently find and inspect image. There are two ways I thought of relatively quickly and there may be more. Both are methods would be much easier to code and implement and would save a bot a lot of time relative to what exists absent a trove of conveniently supplied "snapshots".
If I were an image bot owned by a copyright troll, I would do this right now:
Method 1) Write a crawler to progressive visit copies of site with addresses like http://newsblur.com/reader/page/1 (Give it a try. )
I would compare each image at that site to my trove of images. That would essentially show my bot the header and "decoration" images that display in any blog plus any images that happen to be displayed on the top page on the day my bot visited.
After visiting 1, now load http://newsblur.com/reader/page/1 . (My site number is up above 1,000,000. So you'll want a bot. But it really is a trove.)
So, the existence of all these copies displayed publicly is a very handy thing for copyright trolls. The copy of my site is unauthorized. And this sort of thing is one of the reasons I am really pissed off.
Method 2) If copyright troll, I would also do the following:
Program a bot to visit http://www.newsblur.com/site/1. Code whatever is required to make the site believe you have clicked "feed". Cause the "scroll bar" to scroll down... keep scrolling... keep scrolling. (Both could be coded but requires programming. I'm a poor programmer, it would likely take me a day. A good programmer could do it more quickly. I'm sure the guys at Picscout can do it as soon as they read this post. That is assuming they haven't already learned of the existence of Newsblur in which case, figuring out that they could do it and how to do it would take them... on... 2 hours? )
Once this is done, the bot can them load every image in every blog post the blogger every posted. It "sees" every single address for every image and loads them the way a browser would.
None of this should make us happy.
That fact that I could think of these methods very quickly (and did) contributes to why I am very, very upset by the possibility of losing control of how my site displays. To the extent that the copying is involved, this is a copyright issue. But to the extent that it facilitates images scraping, it is a "getty copyright troll" issue.
If I were an image bot owned by a copyright troll, I would do this right now:
Method 1) Write a crawler to progressive visit copies of site with addresses like http://newsblur.com/reader/page/1 (Give it a try. )
I would compare each image at that site to my trove of images. That would essentially show my bot the header and "decoration" images that display in any blog plus any images that happen to be displayed on the top page on the day my bot visited.
After visiting 1, now load http://newsblur.com/reader/page/1 . (My site number is up above 1,000,000. So you'll want a bot. But it really is a trove.)
So, the existence of all these copies displayed publicly is a very handy thing for copyright trolls. The copy of my site is unauthorized. And this sort of thing is one of the reasons I am really pissed off.
Method 2) If copyright troll, I would also do the following:
Program a bot to visit http://www.newsblur.com/site/1. Code whatever is required to make the site believe you have clicked "feed". Cause the "scroll bar" to scroll down... keep scrolling... keep scrolling. (Both could be coded but requires programming. I'm a poor programmer, it would likely take me a day. A good programmer could do it more quickly. I'm sure the guys at Picscout can do it as soon as they read this post. That is assuming they haven't already learned of the existence of Newsblur in which case, figuring out that they could do it and how to do it would take them... on... 2 hours? )
Once this is done, the bot can them load every image in every blog post the blogger every posted. It "sees" every single address for every image and loads them the way a browser would.
None of this should make us happy.
That fact that I could think of these methods very quickly (and did) contributes to why I am very, very upset by the possibility of losing control of how my site displays. To the extent that the copying is involved, this is a copyright issue. But to the extent that it facilitates images scraping, it is a "getty copyright troll" issue.