Tuesday, January 06, 2009

Blending the link and query-click graphs

A fun paper out of Yahoo Research, "Dr. Searcher and Mr. Browser: A Unified Hyperlink-Click Graph" (PDF), looks at the value of combining two graphs that search engines typically use as part of static and dynamic ranking, the query-click bipartite graph (which shows what pages people click on immediately after searching) and the link graph (which shows hyperlinks between pages on the Web).

The query-click graph is a bipartite graph with queries on one side and clicked documents on the other. A query (e.g. [greg linden]) is linked to a document (e.g. http://glinden.blogspot.com) if people who make the query click on that document. The links are usually weighted by the probability that someone clicks on the document given that search query. Search engines get the query-click graph by parsing their query logs. Random walks of the query-click graph have been a popular research topic for finding similar queries and similar documents.

The hyperlink graph is a graph where web pages are nodes and a link from one page to another is a directed edge in the graph. Search engines get a link graph by crawling the Web and parsing all the html of all the pages. Random walks of the link graph are used to find related documents and by algorithms such as PageRank to compute the authority of web pages.

The authors of this Yahoo paper had the idea of combining the two graphs and doing random walks of the combined graph. In particular, they model searchers starting with a query, then clicking, then randomly walking the hyperlink graph, until they eventually search again. The goal is to more accurately model searcher behavior and more accurately find the pages people actually want when they search.

I really like the idea and think it should be useful, but I cannot help but notice that the combined hyperlink and query-click graphs are trying to approximate how people actually move across the Web. If a search engine had good data on how people actually moved across the Web, it might not need this approximation.

For more on using data on how people actually move across the Web, please see my earlier posts, "Google Toolbar and the actual surfer model" and "Search trails and relevance".

1 comment:

Anonymous said...

The paper mentions in the conclusions that click-based ranking would be biased and should be used with other ranking methods because searchers would obviously click on the top results, which would lead to self-enforced ranking.

But I think the top results are self-enforcing (to a lesser extent though) even in link-based models and others. There are a large number of bloggers who, if they want to link to some source, would typically do a search for their keywords and link to the top results.

This happens so much when people write those (still) famous '10 ways to do blah-blah-blah'. And these same posts then get promoted on social media, twitter, etc. So the combined effect would still be biased even if you use social media popularity, bookmarks, links, etc. for ranking purposes.

It would be interesting if there can be a research on measuring the average boost (per niche & ranking position) in additional links, click-throughs, bookmarking, etc. after a result comes to the first page, and then discounting that boost to find new results.

As a sidenote, I think algos can also use (if not using already) the lack of boost in links, bookmarks for first page results to demote them. Like if result #2 & #3 are linked, bookmarked, twittered more than #1 in a given time, that could mean #1 should be demoted.