Saturday, February 27, 2010

Personalization and differential pricing

Google's Chief Economist Hal Varian has a new paper out, "Computer Mediated Transactions" (PDF). An excerpt of his predictions on personalization:
Instead of a "one size fits all" model, the web offers a "market of one" ... [powered by] suggestions of things to buy based on your previous purchases, or on purchases of customers like you.

Not only content, but prices may also be personalized, leading to various forms of differential pricing ... [But] the ability of firms to extract surplus [may be] quite limited when consumers are sophisticated ... [And] perfect price description and free entry ... pushes profits to zero, conferring all benefits to the customers.

The same sort of personalization can occur in advertising ... Google and Yahoo ... [already] allow users to specify their areas of interest and then see ads related to those interests. It is also relatively common for advertisers ... to show ads based on previous responses of users to related ads.
Back in 2000, Amazon got slammed (e.g. [1]) for an experiment with differential pricing, but Hal appears to be predicting differential pricing will rise again.

The paper also talks briefly about how experimentation changes how companies make decisions ("when experiments are cheap, they are likely provide more reliable answers than opinions"), data mining, online advertising, legal contracts that use computer monitoring to enforce their terms, and cloud computing. The paper is from the 2010 Ely Lecture at the American Economics Association and video of the talk is available.

Tuesday, February 23, 2010

How we all teach Google to Google

Steven Levy at Wired just posted an article, "How Google's Algorithm Rules the Web", with some fun details on how Google uses constant experimentation, logs of searches and clicks, and many small tweaks to keep improving their search results.

Well worth reading. Some excerpts as a teaser:
[Google Fellow Amit] Singhal notes that the engineers in Building 43 are exploiting ... the hundreds of millions who search on Google. The data people generate when they search -- what results they click on, what words they replace in the query when they're unsatisfied, how their queries match with their physical locations -- turns out to be an invaluable resource in discovering new signals and improving the relevance of results.

"On most Google queries, you're actually in multiple control or experimental groups simultaneously," says search quality engineer Patrick Riley. Then he corrects himself. "Essentially," he says, "all the queries are involved in some test." In other words, just about every time you search on Google, you're a lab rat.

This flexibility -- the ability to add signals, tweak the underlying code, and instantly test the results -- is why Googlers say they can withstand any competition from Bing or Twitter or Facebook. Indeed, in the last six months, Google has [found and] made more than 200 improvements.
Even so, this raises the question of where the point of diminishing returns is with more data and more users. While startups lack Google's heft, Yahoo and Bing are big enough that -- if they continuously experiment, tweak, and learn from their data as much as Google does -- search quality differences likely would be in an imperceptibly small chunk of long tail queries.

Google Reader recommends articles

In a post on the official Google Reader blog, "May we recommend...", Laurence Gonsalves describes a new recommendation feature for Google Reader that recommends articles based on what you have read in the past. An excerpt:
Many of you wanted to see even more personalized recommendations ... [Now], we've started inserting items selected just for you inside the Recommended items section. This is great if you've got interests that are less mainstream. If you love Lego robots, for example, then you should start to notice more of them in your Recommended items.
Sadly, no additional details appear to be available. In my usage, there were rare gems in the recommendations, but a lot of randomness, and a strong bias toward very popular items. The lack of explanation -- why was this item recommended? -- and lack of a way to correct the recommendations likely will make people less forgiving of these problems. I also saw recommendations for items I had already read; items you have already seen always should be filtered from recommendations.

For more on that, you might enjoy some of my previous posts on this topic, such as the Mar 2009 "What is a good recommendation algorithm?" and the much older Dec 2006 "The RSS beast".

Update: A couple weeks later, Google launches Google Reader Play, a StumbleUpon knock-off that recommends web content based on what you say you like. Googler Garrett Wu writes that "it uses the same technology as the Recommended Items feed in Reader." In my usage, it had the same problems too.

Tuesday, February 02, 2010

New details on LinkedIn architecture

Googler Daniel Tunkelang recently wrote a post, "LinkedIn Search: A Look Beneath the Hood", that has slides from a talk by LinkedIn engineers along with some commentary on LinkedIn's search architecture.

What makes LinkedIn search so interesting is that the search does real-time updates (the "time between when user updates a profile and being able to find him/herself by that update need to be near-instantaneous"), faceted search (">100 OR clauses", "NOT support", complex boolean logic, some facets are hierarchical, some are dynamic over time), and personalized relevance ranking of search results (ordered by distance in your LinkedIn social graph).

LinkedIn appears to use a combination of aggressive partitioning, keeping data in-memory, and a lot of custom code (mostly modifications to Lucene, some of which have been released open source) to handle these challenges. One interesting tidbit is, going against current conventional wisdom, LinkedIn appears to only use caching minimally, preferring to spend their efforts and machine resources on making sure they can recompute computations quickly than on hiding poor performance behind caching layers.