<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: 2006: scraper volume forces new solutions</title>
	<atom:link href="http://yardley.ca/2005/12/29/2006-scraper-volume-forces-new-solutions/feed/" rel="self" type="application/rss+xml" />
	<link>http://yardley.ca/2005/12/29/2006-scraper-volume-forces-new-solutions/</link>
	<description>greg yardley on online product management</description>
	<lastBuildDate>Wed, 04 Jan 2012 05:04:13 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Dave McClure</title>
		<link>http://yardley.ca/2005/12/29/2006-scraper-volume-forces-new-solutions/comment-page-1/#comment-504</link>
		<dc:creator>Dave McClure</dc:creator>
		<pubDate>Thu, 05 Jan 2006 07:28:20 +0000</pubDate>
		<guid isPermaLink="false">http://yardley.ca/merge/?p=200#comment-504</guid>
		<description>while i won&#039;t disagree that such tools might become available, i think the motivation for most folks isn&#039;t there -- in the case of our search engine SimplyHired.com, when we crawl &amp; index data from other sites we send traffic *BACK* to the original source site.  in other words, just like Google, Yahoo, and other search engines we provide value through being a distribution channel.  most sites WANT their data to be listed, so that they get more USER TRAFFIC they&#039;d otherwise miss.

for a few larger sites that are destinations in their own right, it&#039;s possible they may decide the incremental traffic isn&#039;t so valuable (though i doubt that makes sense even for them).  however, certainly for the majority of &quot;long tail data sources&quot; out there, it does NOT make sense to block search engine crawlers that index their site... just like it wouldn&#039;t make sense to block the GoogleBot or slurp (Yahoo).

certainly the returned traffic needs to justify the resources used to crawl &amp; index those sites, but for most crawlers that don&#039;t hit sites more than once a day, that doesn&#039;t seem like a huge load.  (i&#039;d hazard a guess the Oodle-Craigslist story is the exception, not the rule)

finally, i think we need to define &quot;scraping&quot; vs &quot;indexing&quot; a little better... unless someone can explain to me why Google and Yahoo aren&#039;t also &quot;scraping&quot;?

my definition:

&quot;indexing&quot; (white hat) is when search engines:
 - display a summary of the source data
 - clearly note the source &amp; URL
 - send traffic back to the source

&quot;scraping&quot; (black hat) is when non-search engine crawlers:
 - display the *ENTIRE* data from the source
 - do NOT note the source or URL
 - do NOT send traffic back to the source

there&#039;s a big difference.  the former is symbiotic; the latter is parasitic.

in summary: unless you&#039;re Agence France Press or AOL, you probably don&#039;t choose to opt out of a search engine that&#039;s helping send users your way.

my .02,

- dave mcclure
  http://www.SimplyHired.com</description>
		<content:encoded><![CDATA[<p>while i won&#8217;t disagree that such tools might become available, i think the motivation for most folks isn&#8217;t there &#8212; in the case of our search engine SimplyHired.com, when we crawl &amp; index data from other sites we send traffic *BACK* to the original source site.  in other words, just like Google, Yahoo, and other search engines we provide value through being a distribution channel.  most sites WANT their data to be listed, so that they get more USER TRAFFIC they&#8217;d otherwise miss.</p>
<p>for a few larger sites that are destinations in their own right, it&#8217;s possible they may decide the incremental traffic isn&#8217;t so valuable (though i doubt that makes sense even for them).  however, certainly for the majority of &#8220;long tail data sources&#8221; out there, it does NOT make sense to block search engine crawlers that index their site&#8230; just like it wouldn&#8217;t make sense to block the GoogleBot or slurp (Yahoo).</p>
<p>certainly the returned traffic needs to justify the resources used to crawl &amp; index those sites, but for most crawlers that don&#8217;t hit sites more than once a day, that doesn&#8217;t seem like a huge load.  (i&#8217;d hazard a guess the Oodle-Craigslist story is the exception, not the rule)</p>
<p>finally, i think we need to define &#8220;scraping&#8221; vs &#8220;indexing&#8221; a little better&#8230; unless someone can explain to me why Google and Yahoo aren&#8217;t also &#8220;scraping&#8221;?</p>
<p>my definition:</p>
<p>&#8220;indexing&#8221; (white hat) is when search engines:<br />
 &#8211; display a summary of the source data<br />
 &#8211; clearly note the source &amp; URL<br />
 &#8211; send traffic back to the source</p>
<p>&#8220;scraping&#8221; (black hat) is when non-search engine crawlers:<br />
 &#8211; display the *ENTIRE* data from the source<br />
 &#8211; do NOT note the source or URL<br />
 &#8211; do NOT send traffic back to the source</p>
<p>there&#8217;s a big difference.  the former is symbiotic; the latter is parasitic.</p>
<p>in summary: unless you&#8217;re Agence France Press or AOL, you probably don&#8217;t choose to opt out of a search engine that&#8217;s helping send users your way.</p>
<p>my .02,</p>
<p>- dave mcclure<br />
  <a href="http://www.SimplyHired.com" rel="nofollow">http://www.SimplyHired.com</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John K</title>
		<link>http://yardley.ca/2005/12/29/2006-scraper-volume-forces-new-solutions/comment-page-1/#comment-503</link>
		<dc:creator>John K</dc:creator>
		<pubDate>Tue, 03 Jan 2006 18:11:02 +0000</pubDate>
		<guid isPermaLink="false">http://yardley.ca/merge/?p=200#comment-503</guid>
		<description>&lt;em&gt;
I was thinking more like other servers could be running scraper blocking scripts like mine might pool the combined list of scrapers blocked into a central server for processing.
&lt;/em&gt;

I agree with that concept, but I wonder why it hasn&#039;t happened for blog spam yet?  ( http://gotads.blogspot.com/2005/07/bayesian-blog-and-tag-spam-filtering.html )

Large content producers will produce APIs/Feeds that give access to their content.  Small content producers won&#039;t protect their content - it&#039;s not worth it - blogs today don&#039;t do it even though there is a big scraper problem there.

So I don&#039;t disagree with IncrediBill&#039;s main points, but I totally disagree with Greg&#039;s last point that services will cause scraping to decrease in volume any time soon.  Massive increases are what&#039;s really gonna happen.

It&#039;s just a case of the demand side being much more interested than the prevention side...

(See: rel=nofollow, credit card fraud, click fraud, etc. for similar situations).  Much easier to do than prevent, and vast numbers of people not interested in prevention.</description>
		<content:encoded><![CDATA[<p><em><br />
I was thinking more like other servers could be running scraper blocking scripts like mine might pool the combined list of scrapers blocked into a central server for processing.<br />
</em></p>
<p>I agree with that concept, but I wonder why it hasn&#8217;t happened for blog spam yet?  ( <a href="http://gotads.blogspot.com/2005/07/bayesian-blog-and-tag-spam-filtering.html" rel="nofollow">http://gotads.blogspot.com/2005/07/bayesian-blog-and-tag-spam-filtering.html</a> )</p>
<p>Large content producers will produce APIs/Feeds that give access to their content.  Small content producers won&#8217;t protect their content &#8211; it&#8217;s not worth it &#8211; blogs today don&#8217;t do it even though there is a big scraper problem there.</p>
<p>So I don&#8217;t disagree with IncrediBill&#8217;s main points, but I totally disagree with Greg&#8217;s last point that services will cause scraping to decrease in volume any time soon.  Massive increases are what&#8217;s really gonna happen.</p>
<p>It&#8217;s just a case of the demand side being much more interested than the prevention side&#8230;</p>
<p>(See: rel=nofollow, credit card fraud, click fraud, etc. for similar situations).  Much easier to do than prevent, and vast numbers of people not interested in prevention.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: IncrediBILL</title>
		<link>http://yardley.ca/2005/12/29/2006-scraper-volume-forces-new-solutions/comment-page-1/#comment-502</link>
		<dc:creator>IncrediBILL</dc:creator>
		<pubDate>Sat, 31 Dec 2005 22:28:34 +0000</pubDate>
		<guid isPermaLink="false">http://yardley.ca/merge/?p=200#comment-502</guid>
		<description>John,

This is a much more controllable situation compared to spam as email is an unknown source, if you want random email from legitimate sources you risk spammers as well.

The difference is search engines have known blocks of IP addresses and you let them thru and the rest of the rapid crawling, or even slow crawling sources you block.

The scrapers using legititmate crawler services like Alexa can easily be stopped with NOCACHE in all your pages and if Alexa refuses to honor the NOCACHE their are civil actions that can be taken to make them comply.

As far as a service, I wasn&#039;t even thinking real-time. I was thinking more like other servers could be running scraper blocking scripts like mine might pool the combined list of scrapers blocked into a central server for processing.

This list would then be available for others to download hourly, daily, weekly, whatever as needed and block scrapers they&#039;ve never seen before they even get to their web sites.

Just an idea, we shall see as my motivation for such grandiose schemes tends to wane after I&#039;ve successfully stopped scraping on my site ;)</description>
		<content:encoded><![CDATA[<p>John,</p>
<p>This is a much more controllable situation compared to spam as email is an unknown source, if you want random email from legitimate sources you risk spammers as well.</p>
<p>The difference is search engines have known blocks of IP addresses and you let them thru and the rest of the rapid crawling, or even slow crawling sources you block.</p>
<p>The scrapers using legititmate crawler services like Alexa can easily be stopped with NOCACHE in all your pages and if Alexa refuses to honor the NOCACHE their are civil actions that can be taken to make them comply.</p>
<p>As far as a service, I wasn&#8217;t even thinking real-time. I was thinking more like other servers could be running scraper blocking scripts like mine might pool the combined list of scrapers blocked into a central server for processing.</p>
<p>This list would then be available for others to download hourly, daily, weekly, whatever as needed and block scrapers they&#8217;ve never seen before they even get to their web sites.</p>
<p>Just an idea, we shall see as my motivation for such grandiose schemes tends to wane after I&#8217;ve successfully stopped scraping on my site <img src='http://yardley.ca/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John K</title>
		<link>http://yardley.ca/2005/12/29/2006-scraper-volume-forces-new-solutions/comment-page-1/#comment-501</link>
		<dc:creator>John K</dc:creator>
		<pubDate>Fri, 30 Dec 2005 07:53:16 +0000</pubDate>
		<guid isPermaLink="false">http://yardley.ca/merge/?p=200#comment-501</guid>
		<description>Good one.  I&#039;d buy the prediction of easy to use Apache modules first - since it seems that a service would be too slow, unless it was edge hosted or Google networked... maybe you connect to a service to get a whitelist.

But whatever the technology, it will no more put an end to scraping or aggregation than spam services have put an end to spam.  Moreover, public APIs like Yahoo and stuff like Alexa WSP will allow aggregators to outsource the scraping to crawlers that normal people won&#039;t want to block...</description>
		<content:encoded><![CDATA[<p>Good one.  I&#8217;d buy the prediction of easy to use Apache modules first &#8211; since it seems that a service would be too slow, unless it was edge hosted or Google networked&#8230; maybe you connect to a service to get a whitelist.</p>
<p>But whatever the technology, it will no more put an end to scraping or aggregation than spam services have put an end to spam.  Moreover, public APIs like Yahoo and stuff like Alexa WSP will allow aggregators to outsource the scraping to crawlers that normal people won&#8217;t want to block&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: IncrediBILL</title>
		<link>http://yardley.ca/2005/12/29/2006-scraper-volume-forces-new-solutions/comment-page-1/#comment-500</link>
		<dc:creator>IncrediBILL</dc:creator>
		<pubDate>Thu, 29 Dec 2005 08:59:00 +0000</pubDate>
		<guid isPermaLink="false">http://yardley.ca/merge/?p=200#comment-500</guid>
		<description>Thanks for the dig and I see you were previously in San Mateo, small world, I live in the adjacent town Foster City down on the bayshore.

Something I didn&#039;t mention in my blog YET was the fact that I&#039;ve noticed what appears to be either small hit and run scrapers that bail before scripts can nail them or distributed scrapers from multiple IPs.

The fact that scrapers MAY be taking a page from how the real search engines work is extremely disturbing and I&#039;ll post more about this topic if something becomes more obvious.

FYI, the idea has crossed my mind to start an anti-scraping service similar to the anti-spam RBL list that distributes lists in XML or .htaccess format from trusted sources to create a global blocked list.

Let&#039;s see if I&#039;m still motivated in Feb! ;)</description>
		<content:encoded><![CDATA[<p>Thanks for the dig and I see you were previously in San Mateo, small world, I live in the adjacent town Foster City down on the bayshore.</p>
<p>Something I didn&#8217;t mention in my blog YET was the fact that I&#8217;ve noticed what appears to be either small hit and run scrapers that bail before scripts can nail them or distributed scrapers from multiple IPs.</p>
<p>The fact that scrapers MAY be taking a page from how the real search engines work is extremely disturbing and I&#8217;ll post more about this topic if something becomes more obvious.</p>
<p>FYI, the idea has crossed my mind to start an anti-scraping service similar to the anti-spam RBL list that distributes lists in XML or .htaccess format from trusted sources to create a global blocked list.</p>
<p>Let&#8217;s see if I&#8217;m still motivated in Feb! <img src='http://yardley.ca/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
</channel>
</rss>

