Thursday, December 03, 2009

Recrawling and keeping search results fresh

A paper by three Googlers, "Keeping a Search Engine Index Fresh: Risk and Optimality in Estimating Refresh Rates for Web Pages" (not available online), is one of several recent papers looking at "the cost of a page being stale versus the cost of [recrawling]."

The core idea here is that people care a lot about some changes to web pages and don't care about others, and search engines need to respond to that to make search results relevant.

Unfortunately, our Googlers punt on the really interesting problem here, determining the cost of a page being stale. They simply assume any page that is stale hurts relevance the same amount.

That clearly is not true. Not only do some pages appear more frequently than other pages in search results, but also some changes to pages matter more to people than others.

Getting at the cost of being stale is difficult, but a good start is "The Impact of Crawl Policy on Web Search Effectiveness" (PDF) recently presented at SIGIR 2009. It uses PageRank and in-degree as a rough estimate of what pages people will see and click on in search results, then explores the impact of pages people want more frequently.

But that still does not capture whether the change is something people care about. Is, for example, the change below the fold on the page, so less likely to be seen? Is the change correcting a typo or changing an advertisement? In general, what is the cost of showing stale information for this page?

"Resonance on the Web: Web Dynamics and Revisitation Patterns" (PDF), recently presented at CHI, starts to explore that question, looking at the relationship between web content change and how much people want to revisit the pages, as well as thinking about the question of what is an interesting content change.

As it turns out, news is something where change matters and people revisit frequently, and there have been several attempts to treat real-time content such as news differently in search results. One recent example is "Click-Through Prediction for News Queries" (PDF), presented at SIGIR 2009, that describes one method of trying to know when people will want to see news articles for a web search query.

But, rather than coming up with rules for when content from various federated sources should be shown, I wonder if we cannot find a simpler solution. All of these works strive toward the same goal, understanding when people care about change. Relevance depends on what we want, what we see, and what we notice. Search results need only to appear fresh.

Recrawling high PageRank pages is a very rough attempt at making results appear fresh, since high PageRank means a page more likely to be shown and noticed at the top of search results, but it clearly is a very rough approximation. What we really want to know is: Who will see a change? If people see it, will they notice? If they notice, will they care?

Interestingly, people's actions tell us a lot about what they care about. Our wants and needs, where our attention lies, all live in our movements across the Web. If we listen carefully, these voices may speak.

For more on that, please see also my older posts, "Google toolbar data and the actual surfer model" and "Cheap eyetracking using mouse tracking".

Update: One month later, an experiment shows that new content on the Web can be generally available on Google search within 13 seconds.

No comments: