Thursday, December 09, 2010

Papers on specialized databases at Google

Googlers have published two papers recently at academic conferences detailing new specialized databases that are heavily used within Google.

The first paper is a USENIX 2010 paper describing Percolator. Percolator is the database powering Caffeine, which is Google's new system to provide fresher search results by adding new documents and updates to documents to their search index in near real-time.

The Percolator paper is titled "Large-scale Incremental Processing Using Distributed Transactions and Notifications" (PDF). An excerpt:
We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%.

Percolator is built on top of the Bigtable distributed storage system .... Percolator was built specifically for incremental processing and is not intended to supplant existing solutions for most data processing tasks. Computations where the result can't be broken down into small updates (sorting a file, for example) are better handled by MapReduce. Also, the computation should have strong consistency requirements; otherwise, Bigtable is sufficient. Finally, the computation should be very large in some dimension (total data size, CPU required for transformation, etc.); smaller computations not suited to MapReduce or Bigtable can be handled by traditional DBMSs.
Percolator is a specialized database that adds new consistency guarantees (as well as triggers, which they call "observers") to Bigtable. One thing that is interesting is how specialized this system is. Percolator, for example, is in no way intended for online operations and can, in some cases, delay a transaction for tens of seconds due to stray locks. But, that is fine for the near real-time search index update task for which it is designed.

The second paper is a VLDB 2010 paper on Dremel. Dremel is a column store database designed to be orders of magnitude faster for some interactive database queries than MapReduce.

This paper is titled "Dremel: Interactive Analysis of Web-Scale Datasets" (PDF). An excerpt:
Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing.

Dremel can execute many queries over such data that would ordinarily require a sequence of MapReduce (MR) jobs, but at a fraction of the execution time. Dremel is not intended as a replacement for MR and is often used in conjunction with it to analyze outputs of MR pipelines or rapidly prototype larger computations .... Dremel provides a high-level, SQL-like language to express ad hoc queries. In contrast to layers such as Pig and Hive, it executes queries natively without translating them into MR jobs.
The paper includes some fun motivational examples describing how people use Dremel for rapid prototyping of new ideas. There is a huge advantage in spending just seconds rather than hours to examine the potential of a new feature for a classifier or a new signal for relevance rank. Dremel lets Googlers twiddle the feature multiple times to optimize it in just a few minutes, then run a big, multi-hour MapReduce job to get the final data, a huge advantage over rivals that might take days to do the same investigation.

Dremel, like all column stores, it works best when selecting just a few columns from the data, but that is a very common case well worth optimizing for. One fun aside briefly mentioned in the paper is that they see another order of magnitude or two speedup if they can stop after only looking at 98% of the data or so because waiting for straggler chunks causes big slow downs. So, if you are willing to have (unusually slightly) inaccurate results, you can get huge boosts in speed from stopping queries early. That is also not new, but again a very common case and worth thinking about.

In both of these papers, what I find so remarkable is how willing Google is to build specialized databases to speed up tasks. Percolator can only be used for tasks have huge data, strong consistency requirements, and can tolerate occasional latency of tens of seconds, but that is perfect for near real-time search index updates. Dremel can only be used for selecting a few columns of data, but it is extremely common that a MapReduce job wants to ignore almost all the data in each record. It reminds me of the specialized search index structures and kernel twiddles Google does, which are other interesting examples of the lengths Google is willing to go to maximize the usefulness of their internal tools and the performance of their systems.

Wednesday, December 01, 2010

Groupon is not Googly

It is being widely reported that Google is bidding as much as $6B for Groupon. From an article in the New York Times:
Google has offered Groupon $5.3 billion, with the promise of $700 million in performance bonuses for management.

It could ... give Google access to a large sales force ... Groupon has 3,100 employees ... [and] 35 million [Groupon] subscribers worldwide, with 17 million in North America.

11 months [ago] ... Groupon ... [had only] 200 employees ... [Now] about 1,000 people work in the Chicago office and some 2,000 more are spread across its sprawling worldwide network, which includes the employees of its recent international acquisitions, ClanDescuento and — group-buying sites in Chile and Germany. According to Groupon, the company is adding more than 200 employees a month.
It is unclear to me why Google is so interested in this company, so much so that it is willing to pay nearly twice what it paid for DoubleClick.

Google helps people find information. It's mission is "to organize the world's information and make it universally accessible and useful." In advertising, Google helps people find millions of interesting small businesses and sort through billions of products to get what they need.

Groupon is very different. Groupon runs one deal per day per market. There is no technology at Groupon, no information finding, no long tail, no massive data stream. What there is is a large (but very recently hired) sales force, significant revenue, and a decent subscriber base. Those are the kinds of things that looks good to MBAs building spreadsheets and empires, but are not deeply aligned with Google's mission of organizing information and making it useful.

It looks like this is a done deal at this point. But Groupon is not Googly. I fear these cultures are not going to blend well when mashed together. I wish these two luck, but this looks to be more like the kind of union where the bride walks away with the house and the dog in a few years than one where both sides thrive.

Update: It appears it was not a done deal after all.

Friday, November 12, 2010

An update on Google's infrastructure

Google Fellow Jeff Dean gave a talk at Stanford for the EE380 class with fascinating details on Google's systems and infrastructure. Krishna Sankar has a good summary along with a link to video (Windows Media) of the talk.

Some fun statistics in there. Most amazing are the improvements in data processing that have gotten them to the point that, in May 2010, 4.4M MapReduce jobs consumed 39k machine years of computation and processed nearly an exabyte (1k petabytes) of data that month. Remarkable amount of data munching going on at Google.

The talk is an updated version of Jeff's other recent talks such as his LADIS 2009 keynote, WSDM 2009 keynote, and 2008 UW CS lecture.

[HT, High Scalability, for the pointer to Krishna Sankar's notes]

Monday, November 08, 2010

More on why paywalls fail

Felix Salmon on why paywalls fail:
It’s not just that readers don’t see the value in paying for content when something “similar” can be found elsewhere. It’s also that there is positive extra value in reading free content, since it becomes much easier to share that content via email or blogs or Facebook or Twitter, you don’t need to worry about following links or running into paywalls, and in general you know that the site will play well with others on the open web.

If Newsday puts up a paywall and it fails, is that because readers can find content similar to Newsday’s elsewhere for free? Yes, in part. But it’s also because the people who would otherwise visit have lots of other things they also like to do. They like to spend time in Farmville, or they want to watch a video of a dog skateboarding, or they want to see their house on Google Earth, or they want to go walk their dog. These aren’t people who need certain information and are going to seek it out at the lowest cost; they’re just people who would visit Newsday’s website if it was free, but won’t if it isn’t.

That’s why gateways and paywalls are such problematic things, online: they’re a bit like that crappy VIP room in the back of the nightclub which is much less pleasant than the big main space. You might wander in there from time to time if it’s free, but if you need to buy an expensive bottle of Champagne to do so, forget it. There’s lots of other stuff to do, both online and off. And so the walled-off areas of the internet simply get ignored.
It is not just that the content has to be uniquely valuable to make the hurdle of a paywall worth it to readers. It is also that the experience of a paywall detracts from the value of the content because of the hassle to all readers, including when someone wants to share an article with a colleague but cannot because of the paywall.

I would add that, even ignoring the value of sharing, the hurdle of a paywall often seems to be underestimated. As described in Dan Ariely's Predictably Irrational ([1] [2]), among other places ([3]), free has no transaction costs, no risk of loss, and great appeal. Charging anything, anything at all, creates transaction costs and risk, to the point that the vast majority of people will not do it unless the perceived value is obvious and obviously high.

Friday, October 15, 2010

Round-up of latest reading

I've been active microblogging on Google Reader lately. Here's a summary of some of the topics that have caught my attention recently.
  • Microsoft's Bing finally launches a form of personalized search ([1] [2] [3])

  • Netflix is shifting entirely to Amazon Web Services, not just for streaming, but for almost everything ([1] [2])

  • Postmortems of startup failures ([1] [2])

  • It took four long years, but spiking Android sales and the promising early reviews of Windows Phone 7 probably signal the beginning of the end of the iPhone's reign ([1] [2] [3])

  • Microsoft cut benefits again. Their lack of corporate memory is just astounding. Look for new problems in morale and retention, just like after 2004, then for the benefits to be reinstated a couple years later, like they did in 2006. ([1] [2] [3])
More in my shared items microblogging stream or, if you use Google Reader, search for me and follow my shared items there.

Saturday, September 18, 2010

Eric Schmidt on automatic search

Google CEO Eric Schmidt talks up automatic search from mobile devices:
Ultimately, search is not just the web but literally all of your information - your email, the things you care about, with your permission - this is personal search, for you and only for you.

"The next step of search is doing this automatically. When I walk down the street, I want my smartphone to be doing searches constantly - 'did you know?', 'did you know?', 'did you know?', 'did you know?'.

This notion of autonomous search - to tell me things I didn't know but am probably interested in, is the next great stage - in my view - of search.
While I agree with the idea of heavily localized and personalized searches, especially on mobile devices, I think this autonomous search feature sounds really annoying. You don't want to get in people's way. You don't want to interrupt them with something unimportant, especially if you are interrupting someone who is trying to get something done.

Perhaps what might be desirable would be better described as recommendations and personalized advertising, not as some Googly version of Clippy popping up and chirping, "Did you know? Did you know?"

Update: Interesting discussion in the comments about whether what Google is building is really personalized advertising, not search.

Causing internal competition and low morale through compensation policy

Over at Mini-Microsoft, Microsoft employees are listing the details of their compensation changes after their performance reviews.

Reading through them, it is pretty clear that almost everyone is unhappy, both with their reviews and with their relative gains. Which is exactly what you would expect.

This is an instructive example of how forced rank and fine-grained compensation adjustments based on forced rank hurt morale and ends up competing people inside a company against each other.

What you want in an organization is people focused on working together as a team. But, when you use forced rank, a fixed compensation budget for the group, and compensation changes tied to rankings, success for everyone becomes a zero-sum game. You can do just as well by bringing down people on your team -- so you look relatively better -- as by helping people on your team. In fact, it is probably easier to do better by dragging down your colleagues because you have direct control over that.

Performance-based compensation sounds great in theory, but never works in practice, partly because managers lack the information and objectivity to implement it well, partly because people never remember what they do badly and so are almost always angered by the review and compensation adjustments.

For more on this topic, please see my earlier posts, "The problem with forced rank", "Management and total nonsense", and "Joel Spolsky on management methods".

Cuil is dead

Cuil is dead:
Cuil, the much maligned search engine that at one time had hopes of toppling Google, has gone offline ... It may be done for good. Those employees who are still with the company apparently weren't paid this week, and they're starting to say they’re looking for new jobs.
Flashback to the hype of July 2008 around Cuil:
Take yesterday's over-hyped launch of stealth search startup Cuil, which was quickly followed by a backlash when everyone realized that it was selling a bill of goods. This was entirely the company's own fault. It pre-briefed every blogger and tech journalist on the planet, but didn’t allow anyone to actually test the search engine before the launch.

The company's founders have a good pedigree ... But creating a big index is only half the battle. A good search engine has to bring back the best results from that haystack as well. Here Cuil falls short ... The results Cuil returns aren't particularly great, and sometimes completely off the mark.
And what happened soon after:
The launch of the search engine was nothing but a classic PR trainwreck, with much hype and little to show for. Cuil failed to deliver good enough results to drive anyone to change their search behavior, and quickly became the subject of backlash and criticism because of their poor performance and indexing methods that actually took websites down in the process.

I took a peek at how they're doing traffic-wise out of sheer curiosity. After all, with no less than $33 million in funding and a founding management team consisting of ex-Google search experts, something had to give, right? Well, no. Cuil isn't performing well any way you look at it ... search engine traffic is nearing rock bottom.

Tuesday, September 07, 2010

Machine learning on top of GFS at Google

Googler Tushar Chandra recently gave a talk at the LADIS 2010 workshop on "Sibyl: A system for large scale machine learning" (slides available as PDF), which discusses a system that Google has built to do large scale machine learning on top of MapReduce and GFS.

This system probably is just one of many inside the secretive search giant, but we don't often get these kinds of peeks inside. What I found most interesting was the frank discussion of the problems the Google team encountered and how they overcame them.

In particular, the Googlers talk about how they building on top of a system intended for batch log processing caused some difficulties, which they overcame by using a lot of local memory and being careful about arranging data and moving data around. Even so, the last few slides mention how they kept causing localized network and GFS master brownouts, impacting other work on the cluster.

That last problem seems to have been an issue again and again in cloud computing systems. That pesky network is a scarce, shared resource, and it often takes a network brownout to remind us that virtual machines are not all it takes to get everyone playing nice.

On a related topic, please see my earlier post, "GFS and its evolution", and its discussion of the pain Google hit when trying to put other interactive workloads on top of GFS. And, if you're interested in Google's work here, you might also be interested in the open source Mahout, which is a suite of machine learning algorithms mostly intended for running on top of Hadoop clusters.

Friday, September 03, 2010

Insights into the performance of Microsoft's big clusters

A recent article from Microsoft in IEEE Computing, "Server Engineering Insights for Large-Scale Online Services" (PDF), has surprisingly detailed information about the systems running Hotmail, Cosmos (Microsoft's MapReduce/Hadoop), and Bing.

For example, the article describes the scale of Hotmail's data as being "several petabytes ... [in] tens of thousands of servers" and the typical Hotmail server as "dual CPU ... two attached disks and an additional storage enclosure containing up to 40 SATA drives". The typical Cosmos server apparently is "dual CPU ... 16 to 24 Gbytes of memory and up to four SATA disks". Bing uses "several tens of thousands of servers" and "the main memory of thousands of servers" where a typical server is "dual CPU ... 2 to 3 Gbytes per core ... and two to four SATA disks".

Aside from disclosing what appear to be some previously undisclosed details about Microsoft's cluster, the article could be interesting because of insights into the performance of these clusters on the Hotmail, Bing, and Cosmos workloads. Unfortunately, the article suffers from taking too much as a given, not exploring the complexity of interactions between CPU, memory, flash memory, and disk in these clusters on these workloads, and not attempting to explain the many oddities in the data.

Those oddities are fun to think about though. To take a few that caught my attention:
  • Why are Bing servers CPU bound? Is it because, as the authors describe, Bing uses "data compression on memory and disk data ... causing extra processing"? Should Bing be doing so much data compression that it becomes CPU bound (when Google, by comparison, uses fast compression)? If something else is causing Bing servers to be CPU bound, what is it? In any case, does it make sense for the Bing "back-end tier servers used for index lookup" to be CPU bound?
  • Why do Bing servers have only 4-6G RAM each not have more memory when they mostly want to keep indexes in memory, appear to be hitting disk, and are "not bound by memory bandwidth"? Even if the boxes are CPU bound, even if it somehow makes sense for them to be CPU bound, would more memory across the cluster allow them to do things (like faster but weaker compression) that would relieve the pressure on the CPUs?
  • Why is Cosmos (the batch-based log processing system) CPU bound instead of I/O bound? Does that make sense?
  • Why do Cosmos boxes have more the same memory than as Bing boxes when Cosmos is designed for sequential data access? What is the reason that Cosmos "services maintain much of their data in [random access] memory" if they, like Hadoop and MapReduce, are intended for sequential log processing?
  • If Hotmail is mostly "random requests" with "insignificant" locality, why is it designed around sequential data access (many disks) rather than random access (DRAM + flash memory)? Perhaps the reason that Hotmail is "storage bound under peak loads" is that it uses sequential storage for its randomly accessed data?

Update: An anonymous commenter points out that the Bing servers probably are two quad core CPUs -- eight cores total -- so, although there is only 2-3G per core, there likely is a total of 16-24G of RAM per box. That makes more sense and would make them similar to the Cosmos boxes.

Even with the larger amount of memory per Bing box, the questions about the machines still hold. Why are the Bing boxes CPU bound and should they be? Should Cosmos boxes, which are intended for sequential log processing, have the same memory as Bing boxes and be holding much of their data in memory? Why are Cosmos machines CPU bound rather than I/O bound and should they be?

Update: Interesting discussion going on in the comments to this post.

Monday, August 30, 2010

What is the benefit of freaking customers out?

Miguel Helft and Tanzina Vega at the New York Times have a front page article today, "Retargeting Ads Follow Surfers to Other Sites", on a form of personalized web advertising now being called retargeting.

An excerpt:
People have grown accustomed to being tracked online and shown ads for categories of products they have shown interest in, be it tennis or bank loans.

Increasingly, however, the ads tailored to them are for specific products that they have perused online. While the technique, which the ad industry calls personalized retargeting or remarketing, is not new, it is becoming more pervasive as companies like Google and Microsoft have entered the field. And retargeting has reached a level of precision that is leaving consumers with the palpable feeling that they are being watched as they roam the virtual aisles of online stores.

In remarketing, when a person visits an e-commerce site and looks at say, an Etienne Aigner Athena satchel on, a cookie is placed into that person’s browser, linking it with the handbag. When that person, or someone using the same computer, visits another site, the advertising system creates an ad for that very purse.
The article later goes on to contrast this technique of following you around with products you looked at before with behavioral targeting like Google is doing, which learns your broader category interests and shows ads from those categories.

If the goal of the advertising is to be useful and relevant, though, I think both of these are missing the mark. What you want to do is help people discover something they want to buy. Since the item they looked at before obviously wasn't quite right -- they didn't buy it after all -- showing that again doesn't help. Showing closely related alternatives, items that people might buy after rejecting the first item, could be quite useful though.

As marketing exec Alan Pearlstein says at the end of the NYT article, "What is the benefit of freaking customers out?" Remarketing freaks people out. If we are going to do personalized advertising, the goal should be to have the advertising be useful, either by sharing value with consumers using coupons as Pearlstein suggests, or by helping consumers find something interesting that they wouldn't have discovered on their own.

But, publishers should be careful when working with these new ad startups. A startup has a huge incentive to maximize short-term revenue and little incentive to maximize relevance. For the startup, as long as it brings in more immediate revenue, it is perfectly fine to show annoying ads that freak customers out and drive many away. Publishers need to force the focus to be on the value of the ads to the consumer so their customers are happy, satisfied, and keep coming back.

Thursday, August 19, 2010

Measuring online brand advertising without experiments

A few Googlers recently published a paper with a terribly dull title, "Evaluating Online Ad Campaigns in a Pipeline: Causal Models at Scale" (abstract, PDF), at the KDD 2010 conference.

The paper turns out to be a quite interesting attempt to measure the impact of online display advertising -- a notoriously difficult problem -- by looking at how it changes people's searching and browsing online. That's hard enough, but these crazy Googlers also are trying to do this without using A/B testing. To do that last trick, they separate people into those the exposed who have seen the ad and the controls who have not seen the ad while carefully limiting the controls only to people who are similar to the exposed.

From the paper:
Traditionally, online campaign effectiveness has been measured by "clicks" ... However, many display ads are not click-able ... and some campaigns hope to build longer-term interest in the brand rather than drive immediate response. Counting clicks alone then misses much of the value of the campaign.

Better measures of campaign effectiveness are based on the change in online brand awareness ... [due] to the display ad campaign alone. We ... [find] the change in probability that a user searches for brand terms or navigates to brand sites that can be attributed to an online ad campaign.

Randomized experiments ... are the gold standard for estimating treatment effects ... [but it] requires an advertiser to forego showing ads to some users ... [which] advertisers are not keen to [do] ... Estimation without randomization is more difficult but not always impossible .... Simply put, the controls [we pick] were eligible to be served campaign ads but were not.

Our estimates require summary (not personally identifiable) data on exposed and controls. The summary data are obtained from several sources, including the advertiser's own campaign information, ad serving logs, and sampled data from users who have installed Google toolbar and opted in to enhanced features.
By the way, some have speculated in the past ([1] [2]) that Google toolbar data is being used for Google's advertising, but there was no public confirmation of that from Google. To my knowledge, this is the first public confirmation that data from Google's ubiquitous toolbar is being used by them in at least some way in their advertising.

For more on related topics, please see also my November 2008 post, "Measuring offline ads by their online impact", and my July 2008 post, "Google Toolbar data and the actual surfer model".

Monday, August 16, 2010

Human computation and lemons

NYU Professor Panos Ipeirotis has an insightful post, "Mechanical Turk, Low Wages, and the Market for Lemons", that looks at why wages are so low, usually well below minimum wage, on Amazon's MTurk.

His theory is that spammers and cheaters have turned MTurk into a market for lemons. The quality is now so bad that buyers demand a risk premium and require redundant work for quality checks, splitting what might be a risk-reduced fair wage three to five ways among the workers.

An excerpt from his post:
A market for lemons is a market where the sellers cannot evaluate beforehand the quality of the goods that they are buying. So, if you have two types of products (say good workers and low quality workers) and cannot tell who is whom, the price that the buyer is willing to pay will be proportional to the average quality of the worker.

So the offered price will be between the price of a good worker and a low quality worker. What a good worker would do? Given that good workers will not get enough payment for their true quality, they leave the market. This leads the buyer to lower the price even more towards the price for low quality workers. At the end, we only have low quality workers in the market (or workers willing to work for similar wages) and the offered price reflects that.

This is exactly what is happening on Mechanical Turk today. Requesters pay everyone as if they are low quality workers, assuming that extra quality assurance techniques will be required on top of Mechanical Turk.

So, how can someone resolve such issues? The basic solution is the concept of signalling. Good workers need a method to signal to the buyer their higher quality. In this way, they can differentiate themselves from low quality workers.

Unfortunately, Amazon has not implemented a good reputation mechanism. The "number of HITs worked" and the "acceptance percentage" are simply not sufficient signalling mechanisms.
If you like Panos' post, you might also be interested in GWAP guru and CMU Professor Luis von Ahn's recent post, "Work and the Internet", where Luis bemoans the low wages on MTurk and questions whether they amount to exploitation. Panos' post is a response to Luis'.

Please see also my 2005 post, "Amazon Mechanical Turk?", where I wrote, "If I scale up by doing cheaper answers, I won't be able to filter experts as carefully, and quality of the answers will be low. Many of the answers will be utter crap, just made up, quick bluffs in an attempt to earn money from little or no work. How will they deal with this?"

Friday, July 09, 2010

Big redesign at Google News

It has been widely reported that Google News has done a major redesign -- its first since 2002 apparently -- to more prominently feature personalization and customization.

Before I comment on it, in the interest of full disclosure, I should say that I am absurdly biased on this particular topic, having run Findory and talked at length over the years with Google, their partners, and their competitors about news personalization.

That said, I don't like what they've done. And I'm not the only one. Thomas Claburn at InformationWeek catalogs the complaints he is seeing at InformationWeek and elsewhere, summarizing it all by comparing it to the "New Coke" flop.

I think what the Google team has done is a lovely example of personalization done poorly, by people who really should know better. They change navigation links based on personalization even when confidence is low (one of my links in the left hand nav is for "Lindsay Lohan", which is hard to stomach). The article recommendations are often off, cannot be corrected, do not change in real time as you read articles, and there is no explanation of why something was recommended. There is no ability to see, edit, or rate your reading history. The ability to exclude or favor sources appears to be hacked on; the only way to do it is to manually type in the names of sources.

Under the surface, there appears still to be a lot of implicit personalization based on past behavior, but, from what someone using it sees, the focus is entirely on customization. I can "edit personalization" and "add sections" to put categories on my page. And that is about the limit of my control and the limit of the explanations of why articles are appearing. People like to be in control. They like to understand why something happens, especially if they don't agree with it. And Google News offers very little control or explanations.

Adding to the other problems, the design seems really busy and confused to me, like the Googlers can't decide what they are doing and -- in a fashion more typical of Microsoft -- just keep adding features. Hey, look, it's your fast flipping, clustered, personalized, customizable, widget-complete newspaper! Love it, it's Googly! C'mon, Google, what happened to keeping it clean, simple, and relevant?

Tuesday, June 22, 2010

Google to personalize metashopping

In an interview with CNet, Googler Sameer Samat talks about Google's future plans for shopping search, including personalization and recommendations.

An excerpt:
One thing Google doesn't do very well is provide the shopping-as-adventure experience ... You might go to the mall with a specific product in mind, but a well-designed mall ... forces you to discover -- and hopefully purchase -- other products that you might not have even known you wanted: the marketing types like to call this "serendipity." Google wants to be known as a destination for that kind of experience, said Sameer Samat, director of product management.

After years of trying and failing to reach that goal, Google plans to give it another go over the coming months. Don't expect Google to turn into a full-blown online retailer among the likes of or just yet. But the combination of personalized features for product search pages and what Samat thinks is "the largest database of products that has been created" could entice people to actually shop on Google.

Google's current approach works best for those who are on a mission when they shop, shoppers who already know what they want and are just looking for additional information before sealing the deal ... [But] there are millions of other people who treat shopping as leisure, rather than a simple transaction. These are people who ... prefer browsing to targeted shopping, knowing that every now and then they'll discover something totally unique or completely unexpected.

Google wants to serve more of those people ... [by making] recommendations based on that list of products and lists submitted by others to help you discover new products: sort of like Amazon's recommendations page meets Pandora's radio stations meets Google.

"Shopping is not just about search, it's not just about intent, it's about discovery," Samat said. "If we can do it, and do it well, we will have built something that's really amazing; it should be the most comprehensive experience for shopping you could ever find."
On a related note, Google is pushing aggressively to get retailers to use Google's commerce search engine to run their search experience. Each deal Google signs gives them more detailed information about another retailing vertical. It's all about the data.

Monday, June 14, 2010

Google on presentation bias in search

A few Googlers at WWW 2010 had a paper, "Beyond Position Bias: Examining Result Attractiveness as a Source of Presentation Bias in Clickthrough Data" (PDF), that explores how much people tend to click on eye-catching search results rather than seeking the most relevant search results.

The work itself was pretty simple -- just looking at how bolding title and abstract terms changes clickthough rates in A/B tests -- but I think the paper is worth a peek for two reasons. First, it is a decent survey of some of the current work on position and presentation bias. Second, it exposes some of Google's struggles with the difficulty of deriving searcher satisfaction from the noisy proxies that we have available like click data.

By the way, I love the fact, noted in the paper, that people tend to click on the last result much more than you would expect. The reason is that people don't linearly scan down a page, but often jump to the bottom and focus attention there. A decade ago at Amazon, the personalization team exploited this effect and seized the space at the bottom of most pages on the site for our features. You see, when we saw no one had built tools to track click and conversion data, we built them, and then we used them. No one else realized the value of the space at the bottom of the page, but we did.

For more on the struggle to evaluate search results from noisy click data, please see some of my older posts, "Modeling how searchers look at search results ", "Finding task boundaries in search logs", and "Testing rankers by interleaving search results".

Tuesday, June 08, 2010

Travel itineraries from Flickr photo trails

Every once in a while, you hit a paper that seems like a startup waiting to happen. A paper that will be presented next week at HT 2010 by Yahoo Research is one of these.

The paper, "Automatic Construction of Travel Itineraries using Social Breadcrumbs" (PDF), cleverly uses the data often embedded in Flickr photos (e.g. timestamp, tags, sometimes GPS) to produce trails of where people have been in their travels. Then, they combine all those past trails to generate high quality itineraries for future tourists that tell them what to see, where to go, how long to expect to spend at each sight, and how long to allow for travel times between the sights.

Some excerpts from the paper:
Shared photos can be seen as billions of geo-temporal breadcrumbs that can promisingly serve as a latent source reflecting the trips of millions of users ... [We] automatically construct travel itineraries at large scale from those breadcrumbs.

By analyzing these breadcrumbs associated with a person's photo stream, one can deduce the cities visited by a person, which Points of Interest (POI) that the person took photos at, how long that person spent at each POI, and what the transit time was between POIs visited in succession.

By aggregating such timed paths of many users, one can construct itineraries that reflect the "wisdom" of touring crowds. Each such itinerary is composed of a sequence of POIs, with recommended visit times and approximate transit times between them.

[In surveys] users perceive our automatically generated itineraries to be as good as (or even slightly better than) itineraries provided by professional tour companies.
This reminds me quite a bit of the work on using GPS trails from mobile devices like phones (e.g. [1] or [2]) or search histories on maps (e.g. [3]). But, the use of Flickr photos as the data source is clever, especially for this application where the photos are also useful in the final output and the gaps in the data stream are not important.

Fun idea, nicely implemented, and very convincing results. Definitely worth a read. Don't miss the thoughts at the end on expansions to the idea, such as changing how the trails are filtered and aggregated based on individual preferences to generate personalized itineraries.

Monday, June 07, 2010

A Findory buyout offer from Yahoo?

I just received this check from Yahoo payable to

I've never seen a check for $0.00 before. Is that a buyout bid, Yahoo? Your offer for Findory?

In all seriousness, I'm sure this is just an mistake over in Yahoo-land, perhaps leftovers from the experiments Findory did with layering its personalized advertising engine over ads pulled from the Yahoo Publisher Network.

Still, years after Findory ended, years after Findory talked to Yahoo, and seeing how Yahoo now struggles to personalize content, I can't help but laugh at this check for $0.00 from Yahoo to Findory.

Tuesday, June 01, 2010

How Bing predicts the CTR of ads

An upcoming ICML 2010 paper, "Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in Microsoft’s Bing Search Engine", describes the algorithm actually used in the Bing search engine to predict the clickthrough rates of ads.

From the paper:
Recognising the importance of CTR estimation for online advertising ... Bing/adCenter decided to run a competition to entice people across the company to develop the most accurate and scalable CTR predictor.

The algorithm described in this publication tied for first place in the first competition and won the subsequent competition based on prediction accuracy. As a consequence, it was chosen to replace Bing's previous CTR prediction algorithm, a transition that was completed in the summer of 2009.
The paper goes on to describe why the problem is important, the algorithm used, and some of the nastiness of getting something that works in the lab to run on the live site.

Don't miss the tidbit at the end where they say that they are "investigating the use of more powerful models, such as the feature-based collaborative filtering method Matchbox (Stern, Herbrich, & Graepel, 2009) for latent feature discovery and personalisation."

Thursday, May 20, 2010

Geeking with Greg administrivia

This blog passed some fun milestones recently. It just had its six year anniversary, logging over 1500 posts in its lifetime.

When I started this blog in 2004, I wanted to bring personalization to information. Just as personalization and recommendations help people discover what they want in a massive product catalog, I thought personalization and recommendations could help tame the information streams that flood over us. Over the next five years, I wrote on this blog as I worked on personalized news, personalized search, and personalized advertising.

Regular readers may have noticed that posts on this blog have slowed down a fair bit over the last year. In large part, this is because I am no longer working on personalized search or personalized advertising, nor does my work still benefit from tracking what is going on in the information retrieval research community.

I try hard to keep this blog on topic. My plan is to have posts here continue to focus on personalized information, perhaps a bit on research papers, but mostly tracking the increasingly aggressive moves of the internet giants toward personalized search and advertising. But, that means posts will continue to be fairly infrequent.

If for some reason you can't get enough of geeking with me, if you really must have more, you might consider tracking what I am reading more broadly by following me in Google Reader. I use the shared items there as a link blog and comment broadly there on many topics.

Also, on this sixth anniversary, I welcome feedback on what you might enjoy seeing more of on this blog, especially from long-time readers. Do you mostly like the reviews of research papers (which, many have told me, are a great time-saver)? Comments on events in the industry (which might be snark, but perhaps useful snark)? Do you find the posts to be too long or too infrequent? Do you care if this blog stays on the topic of personalized information? If there is anything you want more of, please let me know in the comments!

Friday, May 14, 2010

Yahoo as an internet information filter

Yahoo CEO Carol Bartz is talking up personalization again, this time in an amusingly titled Esquire article:
When you talk about the Internet growing to 225 million sites, you've got to ask: Who's parsing all that? How do you make sense of all that stuff?

I mean, who has time to wander all over the Internet?

Tomorrow's Yahoo! is going to be really tailored. I'm not talking about organization — organizing means that you already know what you want and somebody's just putting it in shape for you. I'm talking about both smart science and people culling through masses of information on the fly and figuring out what people want to know.

We will be delivering your interests to you. For instance, if you're a sports fan but have no interest in tennis, we won't show you tennis. We would know that you do things in a certain sequence, so we'd say, "Here's your portfolio. Here's some news you might like. Oh, you went to this movie last week, here's some other movies you might want to check out."

I call it the Internet of One. I want it to be mine, and I don't want to work too hard to get what I need. In a way, I want it to be HAL. I want it to learn about me, to be me, and cull through the massive amount of information that's out there to find exactly what I want.
Please see also my June 2009 post, "Yahoo CEO Carol Bartz on personalization", for more on Yahoo's vision of recommending information.

[Esquire article found Nick Carr]

Tuesday, May 11, 2010

Google tries to save the news

James Fallows at The Atlantic has a new article out, "How to Save the News", on what Googlers think about the future of the news industry.

Some key excerpts:
"If you were starting from scratch, you could never possibly justify [the current] business model," [Google Chief Economist] Hal Varian said ... "Grow trees -- then grind them up, and truck big rolls of paper down from Canada? Then run them through enormously expensive machinery, hand-deliver them overnight to thousands of doorsteps, and leave more on newsstands, where the surplus is out of date immediately and must be thrown away? Who would say that made sense?" The old-tech wastefulness of the process is obvious, but Varian added a less familiar point. Burdened as they are with these "legacy" print costs, newspapers typically spend about 15 percent of their revenue on what, to the Internet world, are their only valuable assets: the people who report, analyze, and edit the news.

"Nothing that I see suggests the 'death of newspapers,'" [Google CEO] Eric Schmidt told me. The problem was the high cost and plummeting popularity of their print versions. "Today you have a subscription to a print newspaper," he said. "In the future model, you'll have subscriptions to information sources that will have advertisements embedded in them, like a newspaper. You'll just leave out the print part. I am quite sure that this will happen ... As print circulation falls, the growth of the online audience is dramatic ... Newspapers don't have a demand problem; they have a business-model problem." Many of his company’s efforts are attempts to solve this, so that newspaper companies can survive, as printed circulation withers away.

The three pillars of the new online business model, as I heard them invariably described, are distribution, engagement, and monetization. That is: getting news to more people, and more people to news-oriented sites; making the presentation of news more interesting, varied, and involving; and converting these larger and more strongly committed audiences into revenue, through both subscription fees and ads.

The best monetizing schemes are of course ones that people like -- ads they enjoy seeing, products for which they willingly pay. Online display ads should be better on these counts too, [Google VP Neal] Mohan said. "here are things we can do online that we simply can't do in print," he said. An ad is "intrusive" mainly if it is not related to what you care about at that time ... "The online world will be a lot more attuned to who you are and what you care about" ... Advertising has been around forever, Mohan said, "but until now it has always been a one-way conversation."
The entire article is well worth reading. It gives a great feel for how Googlers are thinking about the future of news (and is mostly in line with my own thoughts).

Please see also my Oct 2009 post, "Google CEO on personalized news".

Friday, April 30, 2010

Facebook's moves and personalized advertising

It has been widely reported that Facebook has launched Open Graph and Implicit Personalization, which, among other things, give Facebook information about people's movements and what they like on the web. The service was launched opt-out and, even if you do want to opt-out, requires diving into confusing privacy settings to opt-out.

The prolific discussion of this elsewhere has thoroughly exhausted most of what there is to say, but I wanted to emphasize two things about this launch.

First, the fact that Facebook is so aggressively seeking this "treasure trove" of browsing behavior data may signal a major shift in its revenue model. Prior claims aside, the company now may be realizing that it is hard to target advertising to profile information and status updates because there is no commercial intent. This new source of data -- the websites people are visiting and what they like -- contains the purchase intent that Facebook so desperately needs.

Second, as Steve Lohr at the NYT reported today, other companies considering heavy use of personalized advertising have been waiting for someone else to take the first step and bear the brunt of any privacy-related backlash. It will be interesting to see if Facebook's latest move -- which probably is aggressive enough to count as the first step everyone was waiting for -- will result in a backlash or will open the floodgates.

Wednesday, April 28, 2010

Google launches web search similarities

In another aggressive move toward personalized search, Google has added similarities to some web search results to the bottom of the page, as Mike Melanson at ReadWriteWeb reports.

For example, if I search for [engadget], at the bottom of the page, I see:
Pages similar to:
Gizmodo ...
Ubergizmo ...
Wired ...
Lifehacker ...
The similarity algorithm also appears to have changed, with noticeably better quality in the spot checks I did.

This is a fairly big deal. Similarities based aggregate behavior and targeted to the immediate context is a big step toward personalization. The next step is to tie the data to individual history and target similarities and recommendations both to the context and each person's past history.

Google has done a version of personalized search for some time, but the technique used mostly was based on biasing search results toward people's long-term interests. More recently, they also started boosting previously clicked search results to help support re-finding.

Google's latest move should let Google add fine-grained personalization based on current missions and short-term trends, which, in combination with their current search personalization, is likely to improve Google's ability to help people find what they need.

Update: Here is the announcement of the new feature on the Official Google Blog.

Sunday, April 25, 2010

Google News hybrid recommendations

Three Googlers published a paper, "Personalized News Recommendation Based on Click Behavior" (ACM), at the recent IUI 2010 conference that describes a hybrid recommender system combining user-based and content-based recommendations. This new hybrid recommender now appears to be deployed on Google News.

Some excerpts from the paper:
[The] previous Google News recommendation system was developed using a collaborative filtering method. It recommends news stories that were read by users with similar click history. This method has two major drawbacks ... First, the system cannot recommend stories that have not yet been read by other users ... Second ... news stories [that] are generally very popular ... are constantly recommended to most of the users, even for those users who never [are interested because] ... there are always enough clicks ... to make the recommendation.

A solution to these two problems would be to build profiles of user's genuine interests and use them to make news recommendations. The profiles ... filter out the stories that are not of interest ... [and recommend stories] even if [they have] not been clicked on by other users ... Based on a user's news reading history, the recommender predicts the topic categories of interest ... News articles in those categories are ranked higher in the candidate list.

On average, the hybrid method ... improves the CTR [of] the existing collaborative method by 30.9% ... [and increased] the frequency of website visits in the test group [by] 14.1%.
Hybrid recommenders are not that new. In the past, as in this paper, they usually were motivated by trying to deal with the sparsity and cold start problems that challenge collaborative filtering recommenders. Hybrid systems also have been used to deal with the so-called Harry Potter problem -- recommendations that focus too much on popular items -- by constraining the collaborative recommendations to the interests expressed in the profile, though that often can be better dealt with by tuning a collaborative recommender to discourage correlations between unpopular and popular items.

One thing that is surprising in this paper is the use of high-level topics rather than fine-grained topics. I would think that you would be better off getting as specific as possible on the profile, then branching out to related topics. The paper briefly addresses this, arguing that "specializing the user profile may limit the recommendations to news that the user already knew", but that seems like it would only happen if you rather foolishly only used read topics rather than including topics that appear to be related to read topics.

By the way, when you have as much data as Google should have, it is not at all clear you want to fall back on a content approach like they did in this paper here. Yehuda Koren, for example, has convincingly argued that, when you have big data, latent factor models extract these content-based relationships automatically in much more detail and much more accurately than you could hope to do with a manually constructed model.

Finally, I cannot quite let this one go by without mentioning that Findory was a hybrid news recommender, launched in January 2004, that dealt with the cold start and sparsity problems of a collaborative recommender, the same problems the Google News team apparently is still struggling with six years later. Findory is not mentioned in this paper in the related work, but I know the Google team is quite aware of Findory.

Wednesday, March 24, 2010

Asking questions of your social network

Google recently acquired the startup Aardvark for $50 million. The idea behind Aardvark is to provide a way to ask complicated or subjective questions of your friends and colleagues.

There are two papers that will be published at upcoming conferences that provide useful details on this idea. The first is actually by two members of the Aardvark team -- co-founder and Aardvark CTO Damon Horowitz and ex-Googler and Aardvark advisor Sep Kamvar -- and will be published at the upcoming WWW 2010 conference. The paper is called, "The Anatomy of a Large-Scale Social Search Engine" (PDF). An excerpt:
With Aardvark, users ask a question, either by instant message, email, web input, text message, or voice. Aardvark then routes the question to the person in the user's extended social network most likely to be able to answer that question.

Aardvark queries tend to be long, highly contextualized, and subjective -- in short, they tend to be the types of queries that are not well-serviced by traditional search engines. We also find that the vast majority of questions get answered promptly and satisfactorily, and that users are surprisingly active, both in asking and answering.
The paper is well written and convincing, establishing that this idea works reasonably well for a small (50k) group of enthusiastic early adopters. The paper does not answer whether this will work at large scale on a less motivated, lower quality mainstream audience. It also does not provide data to be able to evaluate the common criticism of asking questions of social networks, which is that, at large scale, the burden from a flood of often irrelevant incoming questions creates too much pain for too little benefit.

To get a bit more illumination on that question, turn to another upcoming paper, this one out of Microsoft Research and to be published at CHI 2010. The paper is "What Do People Ask Their Social Networks, and Why? A Survey Study of Status Message Q&A Behavior" (PDF). Some excerpts:
50.6% ... used their status messages to ask a question .... [on] sites like Facebook and Twitter .... 24.3% received a response in 30 minutes or less, 42.8% in one hour or less, and 90.1% within one day .... 69.3% ... who received responses reported they found the responses helpful.

The most common reason to search socially, rather than through a search engine, was that participants had more trust in the responses provided by their friends [24.8%]. A belief that social networks were better than search engines for subjective questions, such as seeking opinions or recommendations, was also a common explanation [21.5%].

The most common motivation given for responding to a question was altruism [37.0%]. Expertise was the next biggest factor [31.9%], with respondents being motivated because they felt they had special knowledge of the topic ... Nature of the relationship with the asker was an important motivator [13.7%], with closer friends more likely to get answers ... [as well as] the desire to connect socially [13.5%] ... free time [12.3%] ... [and] earning social capital [10.5%].

Many indicated they would prefer a face-to-face or personal request, and ignored questions directed broadly to the network-at-large .... [But] participants enjoyed the fun and social aspects of posing questions to their networks.
The key insight in the CHI paper is that people view asking questions of their social network as fun, entertaining, part of building relationships, and as a form of a gift exchange. The Aardvark paper focuses on a topical relevance rank of your social network, but maintaining relevance is going to be difficult at large scale when you have an unmotivated, lower quality, mainstream audience. The CHI paper might offer a path forward, suggesting we instead focus on game playing, entertainment, and the social rewards people enjoy when answering questions from their network.

Tuesday, March 23, 2010

Security advice is wrong

An insightful, funny, and thought-provoking paper by Cormac Herley at Microsoft Research, "So Long, And No Thanks for the Externalities: The Rational Rejection of Security Advice by Users" (PDF), looks at why people ignore security advice.

The surprising conclusion is that some security advice we give to people -- such as inspect URLs carefully, pay attention to https certificate warnings, and use complicated passwords that change frequently -- does more harm than good. It actually costs someone far more to follow the advice than the benefit that person should expect to get.

Extended excerpts from the paper:
It is often suggested that users are hopelessly lazy and unmotivated on security ... [This] is entirely rational from an economic perspective ... Most security advice simply offers a poor cost-benefit tradeoff to users and is rejected.

[Security] advice offers to shield [people] from the direct costs of attacks, but burdens them with increased indirect costs ... Since victimization is rare, and imposes a one-time cost, while security advice applies to everyone and is an ongoing cost, the burden ends up being larger than that caused by the ill it addresses.

To make this concrete, consider an exploit that affects 1% of users annually, and they waste 10 hours clearing up when they become victims. Any security advice should place a daily burden of no more than 10/(365 * 100) hours or 0.98 seconds per user in order to reduce rather than increase the amount of user time consumed. This generated the profound irony that much security advice ... does more harm than good.

We estimate US annual phishing losses at $60 million ... Even for minimum wage any advice that consumes more than ... 2.6 minutes per year to follow is unprofitable [for users] from a cost benefit point of view ... Banks [also] have more to fear from ... indirect losses such as support costs ... than direct losses. For example ... an agent-assisted password reset by 10% of their users would cost $48 million, easily dwarfing Wells Fargo's share of the overall $60 million in phishing losses.

Users are effectively trained to ignore certificate warnings by seeing them repeatedly when there is no real security threat .... As far as we can determine, there is no evidence of a single user being saved from harm by a certificate error, anywhere, ever ... The idea that certificate errors are a useful tool in protecting [users] from harm ... is entirely abstract and not evidence-based. The effort we ask of [users] is real, while the harm we warn them of is theoretical.

Advice almost always ignores the cost of user effort ... The main reason security advice is ignored is that it makes an enormous miscalculation: it treats as free a resource that is actually worth $2.6 billion an hour ... Advice-givers and policy-mandaters demand far more effort than any user can give .... User education is a cost borne by the whole population, while offering benefit only to the fraction that fall victim ... The cost of any security advice should be in proportion to the victimization rate .... [lest] in trying to defend against everything we end up defending nothing.

It is not users who need to be better educated on the risks of various attacks, but the security community. Security advice simply offers a bad cost-benefit tradeoff to users .... We must respect users' time and effort. Viewing the user's time as worth $2.6 billion an hour is a better starting point than valuing it at zero ... When we exaggerate all dangers we simply train users to ignore us.
The paper also has a great discussion of password policies and how they appear to be counter-productive. When system administrators require passwords with weird special characters than need to be changed regularly, they make passwords difficult to remember and impose a substantial burden on users, but the benefit from this policy appears to be minimal.

[Paper found via Bruce Schneier]

Sunday, March 21, 2010

World of Goo on Google

This is too funny not to post. The clever and tremendously fun puzzle game World of Goo, in one of its last levels, has a hilarious jab at the search engines, personalized search, and personalized advertising.

An excerpt from the dialogue with "MOM ... [the] Automated Search Engine Companion":
What would you like to ask MOM today?

Do I smell cookies? ... Yes! I am baking your personal information into each one.

What personal information is in your cookies? ... Cookies may contain:
your location
complete search history
whipped soy product
online purchases
medical records
telephone logs
web logs
email logs
streaming video from your current location
and more!
That's MOM's special recipe for offering free, convenient, relevant information to valued users like you.

Is my personal information safe with you? ... Your personal information is stored in a secure database and will never be shared with anyone*
* Unless they ask.
** Or if someone says they are you and takes your cookies
*** Or if the venture firm finds out it's profitable.
**** Or if unhappy employees release copies of your cookies to other online databases.
***** Or if outsourced data centers sell illegal copies of your cookies to other sites.
****** Or if my parent corporation is acquired and my data including your cookies becomes part of a larger aggregate system without your knowledge or consent.

Delete my cookies ... Are you sure? ... Yes ... Your cookies have been deleted*.
* Cookies may not actually be deleted.
** Cookies may be stored indefinitely for evaluation and training purposes to better serve you.
*** MOM knows best.

Everyone loves receiving special offers from MOM and the MOM's affiliate network of adver-bots.
Video is available as well as a full transcript (search for [Conversation with MOM transcript]).

Tuesday, March 16, 2010

Designing search for re-finding

A WSDM 2010 paper out of Microsoft Research, "Large Scale Query Log Analysis of Re-finding" (PDF), is notable not so much for the statistics on how much people search again for what they have searched for before, but for its fascinating list of suggestions (in Section 6, "Design Implications") of what search engines should do to support re-finding.

An extended excerpt from that section:
The most obvious way that a search tool can improve the user experience given the prevalence of re-finding is for the tool to explicitly remember and expose that user's search history.

Certain aspects of a person's history may be more useful to expose ... For example, results that are re-found often may be worth highlighting ... The result that is clicked first ... is more likely to be useful later, and thus should be emphasized, while results that are clicked in the middle may be worth forgetting ... to reduce clutter. Results found at the end of a query session are more likely to be re-found.

The query used to re-find a URL is often better than the query used initially to find it ... [because of] how the person has come to understand this result. [Emphasize] re-finding queries ... in the history ... The previous query may even be worth forgetting to reduce clutter.

When exposing previously found results, it is sometimes useful to label or name those results, particularly when those results are exposed as a set. Re-finding queries may make useful labels. A Web browser could even take these bookmark queries and make them into real bookmarks.

A previously found result ... may be what the person is looking for ... even when the result [normally] is not going to be returned ... For example, [if] the user's current query is a substring of a previous query, the search engine may want to suggest the results from the history that were clicked from the longer query. In contrast, queries that overlap with but are longer than previous queries may be intended to find new results.

[An] identical search [is] highly predictive of a repeat click ... [We] can treat the result specially and, for example, [take] additional screen real estate to try to meet the user's information need with that result ... [with] deep functionality like common paths and uses in [an expanded] snippet. For results that are re-found across sessions, it may make sense instead to provide the user with deep links to [some] new avenues within the result to explore.

At the beginning of a session, when people are more likely to be picking up a previous task, a search engine should provide access into history. In the middle of the session ... focus on providing access to new information or new ways to explore previously viewed results. At the end of a session ... suggest storing any valuable information that has been found for future use.
Great advice from Jaime Teevan at Microsoft Research. For more on this, please see my earlier post, "People often repeat web searches", which summarizes a 2007 paper by Teevan and others on the prevalence of re-finding.

Sunday, March 14, 2010

GFS and its evolution

A fascinating article, "GFS: Evolution on Fast-Forward", in the latest CACM magazine interviews Googler Sean Quinlan and exposes the problems Google has had with the legendary Google File System as the company has grown.

Some key excerpts:
The decision to build the original GFS around [a] single master really helped get something out the door ... more rapidly .... [But] problems started to occur ... going from a few hundred terabytes up to petabytes, and then up to tens of petabytes ... [because of] the amount of metadata the master had to maintain.

[Also] when you have thousands of clients all talking to the same master at the same time ... the average client isn't able to command all that many operations per second. There are applications such as MapReduce, where you might suddenly have a thousand tasks, each wanting to open a number of files. Obviously, it would take a long time to handle all those requests and the master would be under a fair amount of duress.

64MB [was] the standard chunk size ... As the application mix changed over time, however, ways had to be found to let the system deal efficiently with large numbers of files [of] far less than 64MB (think in terms of Gmail, for example). The problem was not so much with the number of files itself, but rather with the memory demands all those [small] files made on the centralized master .... There are only a finite number of files you can accommodate before the master runs out of memory.

Many times, the most natural design for some application just wouldn't fit into GFS -- even though at first glance you would think the file count would be perfectly acceptable, it would turn out to be a problem .... BigTable ... [is] one potential remedy ... [but] I'd say that the people who have been using BigTable purely to deal with the file-count problem probably haven't been terribly happy.

The GFS design model from the get-go was all about achieving throughput, not about the latency at which it might be achieved .... Generally speaking, a hiccup on the order of a minute over the course of an hour-long batch job doesn't really show up. If you are working on Gmail, however, and you're trying to write a mutation that represents some user action, then getting stuck for a minute is really going to mess you up. We had similar issues with our master failover. Initially, GFS had no provision for automatic master failure. It was a manual process ... Our initial [automated] master-failover implementation required on the order of minutes .... Trying to build an interactive database on top of a file system designed from the start to support more batch-oriented operations has certainly proved to be a pain point.

They basically try to hide that latency since they know the system underneath isn't really all that great. The guys who built Gmail went to a multihomed model, so if one instance of your GMail account got stuck, you would basically just get moved to another data center ... That capability was needed ... [both] to ensure availability ... [and] to hide the GFS [latency] problems.

The model in GFS is that the client just continues to push the write until it succeeds. If the client ends up crashing in the middle of an operation, things are left in a bit of an indeterminate state ... RecordAppend does not offer any replay protection either. You could end up getting the data multiple times in a file. There were even situations where you could get the data in a different order ... [and then] discover the records in different orders depending on which chunks you happened to be reading ... At the time, it must have seemed like a good idea, but in retrospect I think the consensus is that it proved to be more painful than it was worth. It just doesn't meet the expectations people have of a file system, so they end up getting surprised. Then they had to figure out work-arounds.
Interesting to see exposed the warts of Google File System and Bigtable. I remember when reading the Bigtable paper being surprised that it was layered on top of GFS. Those early decisions to use a file system designed for logs and batch processing of logs as the foundation for Google's interactive databases appear to have caused a lot of pain and workarounds over the years.

On a related topic, a recent paper out of Google, "Using a Market Economy to Provision Compute Resources Across Planet-wide Clusters" (PDF), looks at another problem Google is having, prioritizing all the MapReduce batch jobs at Google and trying to maximize utilization of their cluster. The paper only describes a test of one promising solution, auctioning off the cluster time to incent developers to move their jobs to non-peak times and idle compute resources, but still an interesting read.

Update: A year later, rumor has it that a new version of GFS (called Colossus, aka GFS2) resolves the problems with Bigtable I described here. Quoting: "[Bigtable] is [now] underpinned by Colossus, the distributed storage platform also known as GFS2. The original Google File System ... didn't scale as well as the company would like. Colossus is specifically designed for BigTable, and for this reason it's not as suited to 'general use' as GFS was."

Thursday, March 11, 2010

The Onion on Google's data

The Onion has a hilarious article, "Google Responds To Privacy Concerns With Unsettlingly Specific Apology", that should be enjoyable for this crowd. An excerpt as a teaster:
Acknowledging that Google hasn't always been open about how it mines the roughly 800 terabytes of personal data it has gathered since 1998, [CEO Eric] Schmidt apologized to users -- particularly the 1,237,948 who take daily medication to combat anxiety --for causing any unnecessary distress, and he expressed regret -- especially to Patricia Fort, a single mother taking care of Jordan, Sam, and Rebecca, ages 3, 7, and 9 -- for not doing more to ensure that private information remains private.

Monday's apology comes after the controversial launch of Google Buzz, a social networking platform that publicly linked Gmail users to their most e-mailed contacts by default.

"I'd like nothing more than to apologize in person to everyone we've let down, but as you can see, many of our users are rarely home at this hour," said Google cofounder and president Sergey Brin, pointing to several Google Map street-view shots of empty bedroom and living room windows on a projection screen behind him. "And, if last night's searches are any indication, Boston's Robert Hornick is probably out shopping right now for the spaghetti and clam sauce he'll be cooking tonight ... Either that, or hunting down that blond coworker of his, Samantha, whose Picasa photos he stares at every night."
[Article found via Bruce Schneier]

Saturday, February 27, 2010

Personalization and differential pricing

Google's Chief Economist Hal Varian has a new paper out, "Computer Mediated Transactions" (PDF). An excerpt of his predictions on personalization:
Instead of a "one size fits all" model, the web offers a "market of one" ... [powered by] suggestions of things to buy based on your previous purchases, or on purchases of customers like you.

Not only content, but prices may also be personalized, leading to various forms of differential pricing ... [But] the ability of firms to extract surplus [may be] quite limited when consumers are sophisticated ... [And] perfect price description and free entry ... pushes profits to zero, conferring all benefits to the customers.

The same sort of personalization can occur in advertising ... Google and Yahoo ... [already] allow users to specify their areas of interest and then see ads related to those interests. It is also relatively common for advertisers ... to show ads based on previous responses of users to related ads.
Back in 2000, Amazon got slammed (e.g. [1]) for an experiment with differential pricing, but Hal appears to be predicting differential pricing will rise again.

The paper also talks briefly about how experimentation changes how companies make decisions ("when experiments are cheap, they are likely provide more reliable answers than opinions"), data mining, online advertising, legal contracts that use computer monitoring to enforce their terms, and cloud computing. The paper is from the 2010 Ely Lecture at the American Economics Association and video of the talk is available.

Tuesday, February 23, 2010

How we all teach Google to Google

Steven Levy at Wired just posted an article, "How Google's Algorithm Rules the Web", with some fun details on how Google uses constant experimentation, logs of searches and clicks, and many small tweaks to keep improving their search results.

Well worth reading. Some excerpts as a teaser:
[Google Fellow Amit] Singhal notes that the engineers in Building 43 are exploiting ... the hundreds of millions who search on Google. The data people generate when they search -- what results they click on, what words they replace in the query when they're unsatisfied, how their queries match with their physical locations -- turns out to be an invaluable resource in discovering new signals and improving the relevance of results.

"On most Google queries, you're actually in multiple control or experimental groups simultaneously," says search quality engineer Patrick Riley. Then he corrects himself. "Essentially," he says, "all the queries are involved in some test." In other words, just about every time you search on Google, you're a lab rat.

This flexibility -- the ability to add signals, tweak the underlying code, and instantly test the results -- is why Googlers say they can withstand any competition from Bing or Twitter or Facebook. Indeed, in the last six months, Google has [found and] made more than 200 improvements.
Even so, this raises the question of where the point of diminishing returns is with more data and more users. While startups lack Google's heft, Yahoo and Bing are big enough that -- if they continuously experiment, tweak, and learn from their data as much as Google does -- search quality differences likely would be in an imperceptibly small chunk of long tail queries.

Google Reader recommends articles

In a post on the official Google Reader blog, "May we recommend...", Laurence Gonsalves describes a new recommendation feature for Google Reader that recommends articles based on what you have read in the past. An excerpt:
Many of you wanted to see even more personalized recommendations ... [Now], we've started inserting items selected just for you inside the Recommended items section. This is great if you've got interests that are less mainstream. If you love Lego robots, for example, then you should start to notice more of them in your Recommended items.
Sadly, no additional details appear to be available. In my usage, there were rare gems in the recommendations, but a lot of randomness, and a strong bias toward very popular items. The lack of explanation -- why was this item recommended? -- and lack of a way to correct the recommendations likely will make people less forgiving of these problems. I also saw recommendations for items I had already read; items you have already seen always should be filtered from recommendations.

For more on that, you might enjoy some of my previous posts on this topic, such as the Mar 2009 "What is a good recommendation algorithm?" and the much older Dec 2006 "The RSS beast".

Update: A couple weeks later, Google launches Google Reader Play, a StumbleUpon knock-off that recommends web content based on what you say you like. Googler Garrett Wu writes that "it uses the same technology as the Recommended Items feed in Reader." In my usage, it had the same problems too.

Tuesday, February 02, 2010

New details on LinkedIn architecture

Googler Daniel Tunkelang recently wrote a post, "LinkedIn Search: A Look Beneath the Hood", that has slides from a talk by LinkedIn engineers along with some commentary on LinkedIn's search architecture.

What makes LinkedIn search so interesting is that the search does real-time updates (the "time between when user updates a profile and being able to find him/herself by that update need to be near-instantaneous"), faceted search (">100 OR clauses", "NOT support", complex boolean logic, some facets are hierarchical, some are dynamic over time), and personalized relevance ranking of search results (ordered by distance in your LinkedIn social graph).

LinkedIn appears to use a combination of aggressive partitioning, keeping data in-memory, and a lot of custom code (mostly modifications to Lucene, some of which have been released open source) to handle these challenges. One interesting tidbit is, going against current conventional wisdom, LinkedIn appears to only use caching minimally, preferring to spend their efforts and machine resources on making sure they can recompute computations quickly than on hiding poor performance behind caching layers.

Friday, January 29, 2010

Gmail launches personalized ads

Google's popular mail service, GMail, has launched advertising targeted not just to the particular e-mail message you are reading, but to other e-mails you might have read recently. An excerpt:
Sometimes, there aren't any good ads to match to a particular message. From now on, you'll sometimes see ads matched to another recent email instead.

For example, let's say you're looking at a message from a friend wishing you a happy birthday. If there aren't any good ads for birthdays, you might see the Chicago flight ads related to your last email instead.
It is a significant move toward personalized advertising and, as the Google post notes, is a big change for Google, as they previously "had specified that ads alongside an email were related only to the text of the current message." For example, here, Google says, "Ads and links to related pages only appear alongside the message that they are targeted to, and are only shown when the Google Mail user ... is viewing that particular message."

For more on personalized ads that target not only the current content, but also to previously viewed content that has strong purchase intent, please see my July 2007 post, "What to advertise when there is no commercial intent?"

Wednesday, January 27, 2010

Yahoo on personalizing content and ads

Yahoo CEO Carol Bartz had a few tidbits on personalized relevance for content and advertising in the recent Yahoo Q4 2009 earnings call. Some excerpts:
We generate value ... [through] the vast amount of data we gather and use to deliver a better, more personal experience for users and a better, more targeted audience for our advertisers.

Since we began paring our content optimization technology with editorial expertise we have seen click through rates in the Today module more than double ... We are making additional improvements to the technology that will make the user experience even more personally relevant.

Truth be told, no one has uncovered the holy grail of making advertising as relevant as content is 100% of the time. Beyond just offering advertisers a specific bucket, say women aged 35-45 and have children, we instead need to deliver many more specific attributes of scale. For example, women aged 35-45 with kids under three who are shopping for a minivan, and on and on and on and on. If we can do this we can create a better experience for both the user and the advertiser.

We have been letting great data about the consumers, data that is very attractive to advertisers fall to the floor ... We simply aren't even close to maximizing the value of our massive audience for advertisers.
Sounds like the goal is right, but the pace is slow. For more on that, please see also my June 2009 post, "Yahoo CEO Carol Bartz on personalization".

Sunday, January 24, 2010

Hybrid, not artificial, intelligence

Google VP Alfred Spector gave a talk last week at University of Washington Computer Science on "Research at Google". Archived video is available.

What was unusual about Al's talk was his focus on cooperation between computers and humans to allow both to solve harder problems than they might be able to otherwise.

Starting at 8:30 in the talk, Al describes this as a "virtuous cycle" of improvement using people's interactions with an application, allowing optimizations and features like like learning to rank, personalization, and recommendations that might not be possible otherwise.

Later, around 33:20, he elaborates, saying we need "hybrid, not artificial, intelligence." Al explains, "It sure seems a lot easier ... when computers aren't trying to replace people but to help us in what we do. Seems like an easier problem .... [to] extend the capabilities of people."

Al goes on to say the most progress on very challenging problems (e.g. image recognition, voice-to-text, personalized education) will come from combining several independent, massive data sets with a feedback loop from people interacting with the system. It is an "increasingly fluid partnership between people and computation" that will help both solve problems neither could solve on their own.

This being a Google Research talk, there was much else covered, including the usual list of research papers out of Google, solicitation of students and faculty, pumping of Google as the best place to access big data and do research on big data, and a list of research challenges. The most interesting of the research challenges were robust, high performance, transparent data migration in response to load in massive clusters, ultra-low power computing (e.g. powered only by ambient light), personalized education where computers learn and model the needs of their students, and getting outside researchers access to the big data they need to help build hybrid, not artificial, intelligence.

Wednesday, January 20, 2010

Predictions for 2010

It's that time of year again. Many are making their predictions for the tech industry for 2010.

It's been a while since I played this game -- last time was my dark prediction for a dot-com crash in 2008 ([1] [2]) -- but I thought I'd try again this year.

I wrote up my predictions in a post over at blog@CACM, "What Will 2010 Bring?"

Because it is for the CACM, the predictions focus more on computing in general than on startups, recommendations, or search. And, they are phrased as questions than as predictions.

I think the answer to some of the questions I posed may be no. For example, I doubt tablets will succeed this time around, don't think enterprises will move to the public cloud as much as expected, and am not sure that personalized advertising will always be used to benefit consumers. I do think netbooks are a dead market, mobile devices will become standardized and more like computers, and that 2010 will see big advances in local search and augmented reality on mobile devices.

If you have any thoughts on these predictions or some of your own to add, please comment either here or at blog@CACM.

Update: Another prediction, not in that list, that might be worth including here, "Who Needs Massively Multi-Core?"

Monday, January 04, 2010

Lectures on Computational Advertising

Slides from all the lectures of Andrei Broder's recent Computational Advertising class at Stanford University now are available online. Andrei is a VP and Chief Scientist at Yahoo and leads their Advertising Technology Group.

Lecture 6 (PDF) is particularly interesting with its coverage of learning to rank. Lecture 8 (PDF) has a tidbit on behavioral advertising and using recommender systems for advertising, but it is very brief. The first few lectures are introductory; don't miss lecture 3 (PDF) if you are new to sponsored search and want a good dive into the issues and techniques.