Continuing predictions / general trends for 2006 – there’s going to be more aggressive spidering and site-scraping, to the point where it puts a serious load on any sites dependent on database queries. IncrediBILL‘s been writing on this – see his ‘High Speed Scrapers Steal More Than Pages‘ for an explanation of the problem (and links to a couple of ‘bot blocking scripts). These additional spiders will be a natural consequence of the entrepreneurial boomlet and the growing realization that aggregated content has real value; people looking for easy business models combined with the spread of easy-to-scrape formats across the Web are going to drive an explosion of vertical search engines and niche directories. The availability of better-structured data is only going to drive this – your site might not publish a feed, use microformats, or make use of structured blogging, but the mere knowledge that such content is out there somewhere is going to drive scrapers to your site to look for it.
Webmasters already combat increased scraping activity by manipulating the robots.txt file and blocking especially aggressive scrapers – see Craigslist’s relationship with Oodle or just browse through the archives at WebmasterWorld. But this approach isn’t going to scale well with additional crawler traffic; by the end of the year, I predict the rise of easy-to-use services that block all but the most cautious ‘bots automatically, giving webmasters the option of approving crawlers after they’re detected. These services will put an end to the scraping, but they’ll also make it very difficult for new aggregation-based businesses to get off the ground. Good news for the already-established players…
{ 5 comments… read them below or add one }
Thanks for the dig and I see you were previously in San Mateo, small world, I live in the adjacent town Foster City down on the bayshore.
Something I didn’t mention in my blog YET was the fact that I’ve noticed what appears to be either small hit and run scrapers that bail before scripts can nail them or distributed scrapers from multiple IPs.
The fact that scrapers MAY be taking a page from how the real search engines work is extremely disturbing and I’ll post more about this topic if something becomes more obvious.
FYI, the idea has crossed my mind to start an anti-scraping service similar to the anti-spam RBL list that distributes lists in XML or .htaccess format from trusted sources to create a global blocked list.
Let’s see if I’m still motivated in Feb!
Good one. I’d buy the prediction of easy to use Apache modules first – since it seems that a service would be too slow, unless it was edge hosted or Google networked… maybe you connect to a service to get a whitelist.
But whatever the technology, it will no more put an end to scraping or aggregation than spam services have put an end to spam. Moreover, public APIs like Yahoo and stuff like Alexa WSP will allow aggregators to outsource the scraping to crawlers that normal people won’t want to block…
John,
This is a much more controllable situation compared to spam as email is an unknown source, if you want random email from legitimate sources you risk spammers as well.
The difference is search engines have known blocks of IP addresses and you let them thru and the rest of the rapid crawling, or even slow crawling sources you block.
The scrapers using legititmate crawler services like Alexa can easily be stopped with NOCACHE in all your pages and if Alexa refuses to honor the NOCACHE their are civil actions that can be taken to make them comply.
As far as a service, I wasn’t even thinking real-time. I was thinking more like other servers could be running scraper blocking scripts like mine might pool the combined list of scrapers blocked into a central server for processing.
This list would then be available for others to download hourly, daily, weekly, whatever as needed and block scrapers they’ve never seen before they even get to their web sites.
Just an idea, we shall see as my motivation for such grandiose schemes tends to wane after I’ve successfully stopped scraping on my site
I was thinking more like other servers could be running scraper blocking scripts like mine might pool the combined list of scrapers blocked into a central server for processing.
I agree with that concept, but I wonder why it hasn’t happened for blog spam yet? ( http://gotads.blogspot.com/2005/07/bayesian-blog-and-tag-spam-filtering.html )
Large content producers will produce APIs/Feeds that give access to their content. Small content producers won’t protect their content – it’s not worth it – blogs today don’t do it even though there is a big scraper problem there.
So I don’t disagree with IncrediBill’s main points, but I totally disagree with Greg’s last point that services will cause scraping to decrease in volume any time soon. Massive increases are what’s really gonna happen.
It’s just a case of the demand side being much more interested than the prevention side…
(See: rel=nofollow, credit card fraud, click fraud, etc. for similar situations). Much easier to do than prevent, and vast numbers of people not interested in prevention.
while i won’t disagree that such tools might become available, i think the motivation for most folks isn’t there — in the case of our search engine SimplyHired.com, when we crawl & index data from other sites we send traffic *BACK* to the original source site. in other words, just like Google, Yahoo, and other search engines we provide value through being a distribution channel. most sites WANT their data to be listed, so that they get more USER TRAFFIC they’d otherwise miss.
for a few larger sites that are destinations in their own right, it’s possible they may decide the incremental traffic isn’t so valuable (though i doubt that makes sense even for them). however, certainly for the majority of “long tail data sources” out there, it does NOT make sense to block search engine crawlers that index their site… just like it wouldn’t make sense to block the GoogleBot or slurp (Yahoo).
certainly the returned traffic needs to justify the resources used to crawl & index those sites, but for most crawlers that don’t hit sites more than once a day, that doesn’t seem like a huge load. (i’d hazard a guess the Oodle-Craigslist story is the exception, not the rule)
finally, i think we need to define “scraping” vs “indexing” a little better… unless someone can explain to me why Google and Yahoo aren’t also “scraping”?
my definition:
“indexing” (white hat) is when search engines:
– display a summary of the source data
– clearly note the source & URL
– send traffic back to the source
“scraping” (black hat) is when non-search engine crawlers:
– display the *ENTIRE* data from the source
– do NOT note the source or URL
– do NOT send traffic back to the source
there’s a big difference. the former is symbiotic; the latter is parasitic.
in summary: unless you’re Agence France Press or AOL, you probably don’t choose to opt out of a search engine that’s helping send users your way.
my .02,
- dave mcclure
http://www.SimplyHired.com