Google’s Great Spam Quest

The search engine wants to weed out sites that create low quality articles simply as a way of luring people to online ads.

Tom Simonitearchive page

February 2, 2011

Google is working on ways to rid its search results of “content farms”—sites that create many pages of very cheap content crafted to appear high up in Google’s results. Speaking this week at Farsight 2011, a one-day event in San Francisco on the future of search, the firm’s principal search engineer, Matt Cutts, said that Google is considering tweaks to the algorithms that guide its search results. It’s also considering more radical tactics, such as letting users blacklist certain sites from the results they see.

In recent months, Google has been criticized by tech industry insiders for allowing so-called “content farms” to occupy high rankings in results for common searches. The operators of such sites create articles containing common search keywords and phrases as a way of luring visitors to their online ads. Much of the content on such sites, for example those operated by Demand Media, is created by very low-paid freelancers.

Search engines are currently being bested by those tactics, said Vivek Wadhwa, a visiting researcher in technology and business at Berkeley, Duke, and Harvard universities, at Tuesday’s event. “Over the last 15 years, search has changed very little,” he said, “but the Web has changed and become pretty clogged by spam.” Wadhwa said he realized the scale of the problem after small-scale experiments with his students revealed that the shortcomings in Google searches appeared frequently for common searches.

Cutts announced last week that Google’s algorithms had been altered to penalize sites that copied content from other sites as a way of climbing higher in search rankings. But he acknowledged that it was a challenge to identify and demote low-quality content. “Someone recently found five articles on how to tie shoes on one of these sites,” he said. “We want to find an algorithmic solution to this and are working on it.”

Some question whether an algorithmic approach can work. Startup search company Blekko uses a different approach: yesterday it announced that it had excluded 20 “spam” sites from its index entirely, based on which pages its users had marked as spam when they appeared in search results. The 20 sites include many often described as content farms, including Demand Media’s eHow site. Blekko, which launched last November, uses Wikipedia-like functionality to allow users to mark pages as spam, and to work together on filters (dubbed “slashtags”) that include or exclude sites from searches on particular topics.

“The Web has turned into a swamp,” said Blekko cofounder Rich Skrenta at the event, “because search engines gave URLs economic value.” Methods of ranking search results that rely mainly on which sites have the most links or keywords are no longer robust enough, he said. Instead, a more human touch is required.

Harry Shum, who leads development on Microsoft’s search engine, Bing, also appeared at the event and agreed that search companies need new approaches. “I think this is a big problem,” he said. “Google is overemphasizing the automatic approach. Maybe we need to take into account the authority of the authors of pages or other social information.” Bing has experimented with a feature that draws on information from a person’s Facebook friends to rank results.

Cutts claimed that it’s not Google’s style to make “editorial decisions” to block certain sites—the company would prefer to find purely automatic ways to filter out sites that don’t help users. “Using algorithms can work in German and Japanese as well as it does in English,” he pointed out. However, he also revealed that Google is experimenting, internally for now, with a Blekko-like strategy where users can wrest some control of their search results.

“I have a Chrome bar installed on my laptop that will let you block certain sites from results,” said Cutts. “If people want to send us direct feedback, that’s great.” However, he gave no indication of when the feature might be launched publicly.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.