Sunday, April 26, 2009

Leaving Microsoft

With the dissolution of much of Live Labs, I have decided to resign from Microsoft. I will be leaving at the end of April.

Working at Microsoft on search and advertising turned out to be a lot of fun. High impact, useful, and interesting problems are everywhere.

For example, during my time there, I had a chance to work a bit on advertising relevance (for work with similar motivation, see [1] in Section 6.2 and [2]), search relevance (closest public example of something vaguely similar might be [1]), improving the quality of human judgments (vaguely similar to the ideas published in [1] and [2]), looking at new evaluation methods for search (motivated by [1]), ubiquitous online experimentation (same goals as ExP), personalized web search (like Findory and motivated by [1] and [2]), personalized advertising (see [1]), and large scale data analyses (see [1]).

And, as fun as the problems were the people. I had a chance to talk with so many at Microsoft, from celebrated researchers to the hard-working talent pounding on the code. It was very enjoyable to work on such a breadth of problems and with so many different people. I did much, learned much, and I will miss it.

As for next steps, I will be taking some time before settling into anything new. I still hold the same passion for taming information overload, for personalizing the data streams of the Web to make them relevant, helpful, and useful. Whoever manages to change the nature of content display on the Web from a search problem to a recommender problem will reap tremendous rewards. I hope to play my part in that shift.

Saturday, April 25, 2009

Google server and data center details

At the Efficient Data Center Summit, Google and others were discussing techniques to reduce energy consumption from massive clusters. In the process, the Googlers offered some very fun peeks into how they designed some of their servers and data centers.

For example, Chris Malone and Ben Jai offered a talk, "Insights Into Google's PUE", that, starting on slide 8, describes how Google uses single volt power and on-board uninterruptible power supply to raise efficiency at the motherboard from the norm of 65-85% to 99.99%. There is a picture of the board on slide 17.

Amazon's James Hamilton attended the conference and elaborated on this:
The server design Google showed was clearly a previous generation ... a 2005 board ... [but] was a very nice design.

The board is a 12volt only design ... 12V only supplies are simpler, distributing on-board the single voltage is simpler and more efficient, and distribution losses are lower.

The most innovative aspect of the board design is the use of a distributed UPS. Each board has a 12V VRLA battery that can keep the server running for 2 to 3 minutes during power failures. This is plenty of time to ride through the vast majority of power failures ... [and] it avoids the expensive [and less efficient] central UPS system.

The server was designed to be rapidly serviced with the power supply, disk drives, and battery all being Velcro attached and easy to change quickly.
Also fun is a video tour of one of Google's data centers. The video is short and worth watching for a look at how cooling and wiring is done these days as well as checking out how they used servers in shipping containers.

For more details, Amazon's James Hamilton has additional notes ([1] [2] [3]) and CNet's Stephen Shankland links to all the videos (including video of the talks at the summit).

Friday, April 24, 2009

Serendipity, diversity, and personalized search

Aside from the amusing double entendre in its title, a recent paper out of Microsoft Research, "From X-Rays to Silly Putty via Uranus: Serendipity and its Role in Web Search" (PDF), is notable for its take on two topics that seems to be attracting increasing attention lately, personalized search and improving the diversity of search results.

Some excerpts:
Partially-relevant search results, identified as "containing multiple concepts, [or] on target but too narrow," play an important role in a user's information seeking process and problem definition.

By studying Web search query logs and the results people judge relevant and interesting, we find many of the queries people perform return interesting (potentially serendipitous) results that are not directly relevant .... More than a fifth of all search results were judged interesting but not highly relevant to the search task.

Serendipity was more likely to occur in diverse result sets .... Personalization scores correlate with both relevance and also with interestingness, suggesting that information about personal interests and behaviour may be used to support serendipity.
So, the paper suggests that there may be multiple benefits to personalized search. Not only do we get the benefits of improved understanding of query intent and increased relevance, but also we can improve diversity and discovery.

For a discussion of yet another benefit, reducing the payoff to web spammers, please see also my July 2006 post, "Combating web spam with personalization".

Friday, April 10, 2009

MapReduce using Amazon's cluster and differential pricing

Amazon recently launched Elastic MapReduce, a web service that lets people run MapReduce jobs on Amazon's cluster.

Elastic MapReduce appears to handle almost all the details for you. You upload data to S3, then run a MapReduce job. All the work of firing up EC2 instances, getting Hadoop on them, getting the data out of S3, and putting the results back in S3 appears to be done for you. Pretty cool.

Even so, I have a big problem with this new service, the pricing. MapReduce jobs are batch jobs that could run at idle times on the cluster, but there appears to be no effort to run these during idle times nor is there any discount on the pricing. In fact, you actually pay a premium for MapReduce jobs above the cost of the EC2 instances used during the job.

It is a huge missed opportunity. Smoothing out peaks and troughs in cluster load improves efficiency. Using the idle time of machines in Amazon's EC2 cluster should be essentially free. The hardware and infrastructure costs are all sunk. In a non-peak time, only the marginal cost of the additional electricity used by a busy box over an idle box is a true cost.

What Amazon should be doing is offer a steep discount on EC2 pricing for interruptible batch jobs like MapReduce jobs, then only run those jobs in the idle capacity of non-peak times. This would allow Amazon to smooth the load on their cluster and improve utilization while passing on the savings to others.

For more on this topic, please see also my Jan 2007 post, "I want differential pricing for Amazon EC2".

Please see also Amazon VP James Hamilton's recent post, "Resource Consumption Shaping", which also talks about smoothing load on a cluster. Note that James argues that the marginal cost of making an idle box busy is near zero because of the way power and network use is billed (at the 95th percentile).

For some history on past efforts to run Hadoop on EC2, please see my Nov 2006 post, "Hadoop on Amazon EC2".

Update: Eight months later, Amazon launches differential pricing for EC2.