Wednesday, December 28, 2005

Attensa funded with $12M

Attensa, a company that is "developing an end-to-end RSS Network that automatically and intelligently delivers prioritized, relevant RSS information," recently received a second round of funding, bringing its total backing to $12M. Wowsers.

Here's what Attensa says it will be doing with all that cash:
Getting to Less is More with Article Level Intelligence

By intelligently analyzing information about RSS articles and how readers are interacting with the articles, the Attensa RSS network can deliver more relevant, timely information ...

Using Attensa network attention streams that accommodate the Attention.xml standard, metadata is ... triangulated through collaborative filtering to deliver the most relevant information.

By sharing, aggregating and triangulating the attention streams (anonymously and in near real-time) generated by the millions of people using RSS feeds ... [Attensa will] create privacy protected anonymous user profiles, based on permission, that can recommend content, refine blog and Website searching, and enhance the experience of tracking the news that matters ...
Perhaps the personalization bubble has already started.

But, from what I can tell, this company seems to have cast its net far and wide, saying they'll be doing RSS for enterprise (like Newsgator), an Outlook-based feed reader (like Newsgator and soon Microsoft), metrics for publishers (like FeedBurner), popular posts (like Digg), article clustering (like Memeorandum), tagging (like del.icio.us), and recommended articles (like Findory). Who knows, with $12M in funding, perhaps they'll succeed in tackling this laundry list.

If you're interested, Attensa's VP of R&D, Eric Hayes, has a blog. A couple months ago, Eric and I had a short thread about the difficulty of building scalable recommendation systems.

Monday, December 26, 2005

A political lens on information

Mark Cuban posts a truly frightening prediction, a world where people use different tools to find information because they don't want to see any information that conflicts with their preconceived opinions:
I have zero doubt that in the future there will be sliders or some equivalent that represent "the [political] flavor" of search that users will look for.

Looking for information about the war in Iraq... push the slide rule to the right till you reach Bill O'Reilly flavored search, or slide it to the left for the Al Franken flavor. The results are then influenced by the brand you prefer to associate with.

The news is no longer just the news ... A search result will no longer just be a search result.

The Web 3.0 - You stay on your side of the web and I will stay on mine.
Can this possibly be true? Are people so afraid of being wrong that they will ignore conflicting information?

Unfortunately, I've seen some of this myself at Findory. Especially around the 2004 elections, Findory received a few pretty remarkable hate mails. This is one, from someone clearly deep in the bowels of the right wing, is one of the most extreme:
To: corporate@findory.com
Subject: lefty

YOUR TOO LEFT WING!!!!!!!!!!!!!!!!!!!!!!!!!!

DROP DEAD AND THEN BECOME AN AMERICAN AGAIN. OR JUST MOVE AND LIVE IN THE GLORIOUS WORLD OF MAKE BELIEVE IN EUROPE.

GOOD-BYE ENJOY YOUR VOYAGE.
We also get accusations of bias coming from the left:
To: suggestions@findory.com
Subject: Foo...

Articles history gets arbitrarily flushed from time to time.
No clustering, articles related to the very same topic are repeated and clutter the page space "real estate".
Categorisation is sometimes hapazard, articles showing under the wrong heading and "personalized" topics gathering disconnected subjects.

Yet, you manage to introduce a right-wing bias!

I suggest you use this effort and cleverness to improve the basic product instead...
Since Findory crawls thousands of sources around the world -- some considered to be conservative, some considered to be liberal, most considered to be moderate -- I've been a bit surprised by these comments.

There is a temptation to dismiss these as ravings from the lunatic fringe, but I've been curious about where these people see bias. Even with the most hateful of these e-mails, a calm response usually works well, and I've often been able to discover why a few customers feel so strongly that there is bias one way or the other.

The answer is disturbing. Findory is specifically designed to ignore political biases when recommending articles. If you read a right or a left-leaning opinion article on the Iraq War, you will be recommended other articles on the war and issues surrounding the war, some right-leaning, some left-leaning.

The idea is to avoid pigeonholing, to show people views from across the spectrum, to give people the information they need to make an informed judgment.

For some, that is exactly the problem. They don't want to see both sides. They want a filter, a political lens. As they see it, reading an opinion article on the left should only give them other opinion articles on the left (or visa-versa), reinforcing the opinion they already have.

They don't want discovery. They don't want new information. They don't want to learn. They want to be pigeonholed.

And this is why I find Mark Cuban's post so frightening. If he is correct, what I've seen as a radical fringe, a few people way outside the mainstream, is actually the majority view. Mark sees a world where information is not true or false, but left or right:
This process of continuous alteration was applied not only to newspapers, but to books, periodicals, pamphlets, posters, leaflets, films, sound-tracks, cartoons, photographs -- to every kind of literature or documentation which might conceivably hold any political or ideological significance. Day by day and almost minute by minute the past was brought up to date.

In this way every prediction made by the Party could be shown by documentary evidence to have been correct; nor was any item of news, or any expression of opinion, which conflicted with the needs of the moment, ever allowed to remain on record. All history was a palimpsest, scraped clean and reinscribed exactly as often as was necessary.


- 1984 by George Orwell
That world must not be allowed to come to pass. Information must be free.

Saturday, December 24, 2005

Recommended research papers

I keep getting requests to recommend papers in personalization, recommendations, and information retrieval, mostly from students.

I've been responding to these individually, but that seems inefficient, so I decided to go ahead and post a list of a few of my favorites.

The focus in this list is on breadth, mostly surveys that provide a good introduction, mostly work that used very large data sets. Follow citations on Citeseer if you want to explore in more depth.Enjoy! I hope it makes for interesting reading over the holidays.

Friday, December 23, 2005

My 2006 predictions

I've seen several good posts ([1] [2] [3] [4]) already with predictions for 2006. I thought I'd throw my thoughts out there too.

My predictions for 2006:

After putting Google on a pedestal, the press will start knocking it down. A firestorm of bad press will undermine the pillars of hype that support Google's lofty stock price, but the negativity will not be justified by any noticeable weakness in Google's business.

Yahoo will double down on their bets in community and social networking, including buying at least two more startups working in the area. Results of their efforts will be mixed, popular among early adopters, but largely a dud for the mainstream.

Microsoft will launch an AdSense-like advertising product in the hopes of undermining Google's business, but the product will fail to attract a large network in 2006 due to relatively weak ad targeting and low clickthrough rates.

MSN Search will increase market share, but only modestly in 2006. Other search engines will not move noticeably. Searchers will continue to view Google as having the best search results, whether or not that perception is accurate.

Microsoft will abandon Windows Live.

Tagging documents (My Web 2.0, del.icio.us, tag search of documents) will fail to attract mainstream interest. Tagging will continue to be popular for photos, videos, and other items with poor metadata.

Flickr, Technorati, del.icio.us, and other popular tagging sites will find themselves under assault by spammers. Like with splogs, efforts to battle the influx of crap will be only partially successful.

Wikipedia will be sabotaged by a spam robot coming over a botnet. The spam robot will makes millions of subtle, small changes to the articles, many of which go undetected for long periods of time. Unable to keep up, Wikipedia will be forced to shut off anonymous edits and place other controls on changes.

Yahoo and MSN finally will launch blog search. Google Blog Search will grab majority market share anyway. Technorati, Feedster, and other blog search pure play startups will struggle.

The massive power of Google's cluster will be demonstrated in a much more ambitious version of Google Q&A (currently a modest experiment with automated knowledge extraction of answers from the Web). It will be well received. The launch will send the other search giants, who have been favoring simpler canned shortcuts instead, into a panic.

Interest in attention and personalization of information will grow as searchers become increasingly desperate for an easy way to surface the good stuff from all the crap out there. We'll see many new startups offering personalization products, most of which will be peddling junk. The hype will attract VCs. They will follow each other on in, bleating joyfully as they shower investment capital indiscriminately on good and bad alike.

Google will add an experiment with personalized news to Google News and expand on their personalized search. MSN and Yahoo will experiment with personalization and recommendations in news, search, and shopping. All three will experiment with highly targeted advertising using your search and browsing clickstream.

The hype about mashups and APIs will fade as more and more developers are frustrated by crippled APIs, lack of service quality guarantees, and lack of bargaining power in negotiations for commercial use of the APIs.

As their own business slows, eBay will make other large acquisitions in an effort to buy growth.

Update: Some good discussion in the comments on this post, especially on the Windows Live prediction.

Wednesday, December 21, 2005

Making the impossible possible

Google Earth CTO Michael Jones spoke at UCSD recently. A massive 150M video of the talk is available.

The talk is mostly a demo of Google Earth, focused on showing how all kinds of user-contributed geographically tagged data can be integrated into Google Earth.

But one part of the talk I found particularly insightful was when Michael mentioned Nobel prize winner Tjalling Koopmans and commented on Tjalling's view that new tools enable new problems to be solved:
Your perception of a thing that is a viable problem to think about is shaped by the tool you can use.

If I wanted to build a swimming pool and I had a spoon, I wouldn't think about doing it. If had a backhoe...

If we look at tools, we discover they have a life of their own. People are shaped by their tools.

Sometimes the solution to important problems ... [is] just waiting for the tool. Once this tool comes, everyone just flips in their head.
Michael was applying this to Google Earth -- that Google Earth is a tool that enables things that were not easily possible to do before -- but I think this is an insightful point about a lot of Google's work.

The goal is to build tools that enable people to find and analyze information orders of magnitude faster than before. This opens the door to attacking problems that before were prohibitively difficult to solve.

This is true of the Google search itself, the first tool many people turn to when they have a question about anything. This is true of the Google, Yahoo, MSN, and Amazon APIs, which allow people to rapidly prototype clever mashups demonstrating new ways to solve problems. This is true of Google's internal tools Sawzall and MapReduce, tools that are "major force multipliers" by allowing parallel data processing at an unprecedented scale on the Google cluster.

Problems that were difficult or impossible to solve before are becoming practical as new tools are created for processing information. It is an exciting time. Vast opportunities lie before us.

[video via Paul Kedrosky]

Monday, December 19, 2005

The folly of ignoring scaling

David Heinemeier at 37 Signals (and creator of Ruby on Rails) wrote what I thought of as a pretty extreme post two weeks ago, "Don't scale", that argues that startups should ignore scaling and performance.

Ironically, in the following two weeks, many popular Web 2.0 startups have had problems, including a multi-day outage at del.icio.us, an 18+ hour outage at SixApart's blogging service Typepad, performance that has "sucked eggs" at Bloglines, and, as GrabPerf reports, slowness and outages at Technorati, Feedster, BlogPulse, BlogDigger, and Digg.

Stepping back for a second, a toned down version of David's argument is clearly correct. A company should focus on users first and infrastructure second. The architecture, the software, the hardware cluster, these are just tools. They serve a purpose, to help users, and have little value on their own.

But this extreme argument that scaling and performance don't matter is clearly wrong. People don't like to wait and they don't like outages. Getting what people need quickly and reliably is an important part of the user experience. Scaling does matter.

See also Om Malik's post, "The Web 2.0 hit by outages".

Update: Several months later, one of the first blog search engines, Daypop, goes offline because of scaling issues. The site says, "Daypop no longer has enough memory to calculate the Top 40 and other Top pages ... Daypop won't be back up until a new search/analysis engine is in place." Daypop has been down for a few months since this message was posted.

Update: Sixteen months later, in an interview, Twitter Developer Alex Payne says:
Twitter is the biggest Rails site on the net right now. Running on Rails has forced us to deal with scaling issues - issues that any growing site eventually contends with - far sooner than I think we would on another framework.

The common wisdom in the Rails community at this time is that scaling Rails is a matter of cost: just throw more CPUs at it. The problem is that more instances of Rails (running as part of a Mongrel cluster, in our case) means more requests to your database. At this point in time there's no facility in Rails to talk to more than one database at a time.

The solutions to this are caching the hell out of everything and setting up multiple read-only slave databases, neither of which are quick fixes to implement. So it's not just cost, it's time, and time is that much more precious when people can['t] reach your site.

None of these scaling approaches are as fun and easy as developing for Rails. All the convenience methods and syntactical sugar that makes Rails such a pleasure for coders ends up being absolutely punishing, performance-wise. Once you hit a certain threshold of traffic, either you need to strip out all the costly neat stuff that Rails does for you (RJS, ActiveRecord, ActiveSupport, etc.) or move the slow parts of your application out of Rails, or both.

Thursday, December 15, 2005

People are lazy

I love Paul Kedrosky's recent post about the three reasons trying to "change the world on the back of altered user behavior" will fail:
1. People are lazy
2. People are lazy
3. People are lazy
Paul goes on to say that "intelligence belongs in the network and in the algorithms" and "relying on users to do the heavy lifting -- however intellectually appealing -- is not going to work in the real world of lazy users who see little in it for them."

People are lazy, appropriately so. If you ask them to do work, most of them won't do it. From their point of view, you're only of value to them if you save them time.

If any work is going to be done, it's going to have to be done by a computer, not a person. People expect you to just make the right thing happen.

This is why Findory works the way that it does. No login, no configuration. Just read articles. The site learns from the articles you read and recommends other articles. The computer does all the work. It is simple, easy, and helpful.

See also my previous post, "Personalized search at PC Forum", where I describe the debate between A9 CEO Udi Manber, who claims searchers need to learn how to use more powerful tools, and Google's Marissa Mayer, who says people just want to quickly and easily get the information they need.

Wednesday, December 14, 2005

The money in the long tail

David Hornik at VentureBlog posts his conclusions about where to find value in the long tail.

Some excerpts:
There are essentially two general classes of technology the will benefit economically from the Long Tail -- aggregators and filterers.

The aggregators are those web businesses that seek to collect up as much of the Long Tail content as is possible, so as to make their "stores" a one stop shop for content no matter how popular or obscure.

The filterers are those businesses that make it easier to find the content in which we are interested ... The beneficiary of the filtering is the end user and the filterer, not the content owner per se.

I believe that it is difficult to be an aggregator without also being a filterer ... Aggregators ... [must] come up with their own clever filtering mechanisms to help consumers fully appreciate and navigate the breadth of the content they have to offer.

I think it is helpful for venture capitalists and entrepreneurs alike to focus on where the money is in the Tail. The real money is in aggregation and filtering and those will continue to be interesting businesses for the foreseeable future.
Gather up the long tail content, then filter. Help people find what they need.

Massive selection isn't enough. To make the long tail accessible, irrelevant items should be hidden. Interesting items should be emphasized. Millions of poor choices should be reduced to tens of good ones. The value is in surfacing the gems from the sea of noise.

See also my earlier posts, "Personalization and the long tail" and "Profiting from the long tail".

Tuesday, December 13, 2005

Kill Google, Vol. 2

According to a NYT article, Bill Gates was asked last month, "Will you do to Google what you did to Netscape?":
Mr. Gates, the Microsoft co-founder and chairman, paused, looked down at his folded hands and smiled broadly, as if enjoying a private joke. "Nah," he replied, "we'll do something different."
And what would that something different be? The article goes on to suggest that it will be web services, but I think it will be going after Google's lifeblood, advertising.

Geeks like me think of Google as a search company, but most biz folks I talk to view Google as an advertising company. It is the ads that generate the revenue. It is the ads that allow everything else to happen.

The AdSense revenues -- revenues from ads placed on other sites -- may be particularly vulnerable to attack. This was 43% of Google's revenue in Q3 2005. With these ads, the owner of the site gets roughly 70% of the revenue from the ad. Google takes the other 30%.

It seems like Microsoft could do a fair amount of damage here by trying to drive the share the advertising engine takes in this deal to near zero. To do that, it just needs to launch its own AdSense-like product and be willing to set its take to its breakeven point.

There's some indications that Microsoft may be planning to do this. Bill Gates pointed out that Google makes a lot of money from advertising and then scolded Google for keeping all of the advertising money for itself. Nicholas Carr recently wrote that "the wide profit margins Google enjoys on internet advertising are unsustainable" and "competition, from Yahoo and Microsoft as well as others, can be expected to reduce the profits." And MSN just announced a pilot of an AdSense-like product called AdCenter.

However, there is a big assumption here, that other advertising engines can generate the same revenue as AdSense. As long as Google's clickthrough rates are roughly 30% higher, it will be impossible for anyone else to drive Google's share of the revenue to zero.

And that is Google's defense. Focus on relevance. If Google can maintain its lead on relevance, if it can maintain higher clickthrough rates, if it can continue to generate more revenue for sites using AdSense, it is not vulnerable.

See also "Kill Google, Vol. 1" where I focus on the dangers of from growing too fast and from failing to innovate quickly.

See also "Kill eBay, Vol. 1" and "Kill eBay, Vol. 2".

Update: After the March 2006 analyst day, Google posted slides that said in the notes:
AdSense margins will be squeezed in 2006 and beyond. Y! and MSN will do un-economic things to grow share.
Google expects Yahoo and Microsoft to attack them using this strategy.

AWSP offers shell access at Alexa?

I thought Amazon Mechanical Turk was one of the strangest things I've seen in a while, but Amazon is weirding me out again with their new Amazon Web Search Platform (AWSP).

AWSP is supposed to be a developer framework to innovate on top of the crawl and index data available from Alexa. As part of this package, it appears the AWSP offers ssh access to the Alexa cluster where you can write arbitrary C code.

This is either incredibly bold or absurdly foolish. On the one hand, this could be a useful platform for some developers, a utility computing server farm where you can rent machines by the CPU hour and access the incredible Web data available from Alexa. On the other hand, arbitrary C code can do arbitrary things, nicely accessing the data it is supposed to or evilly cracking the machine, fondling other people's data, and launching attacks on other servers.

You have to hand it to Amazon. They've been doing an amazing job thinking outside the box lately. But, sometimes, the box is there for a reason.

Update: In the comments, a couple people are arguing that these accounts appear to be isolated in virtual machines and that I may be overstating the risk. They might be right, perhaps I am being too paranoid, especially given that there are easier targets out there.

Friday, December 09, 2005

Yahoo buys del.icio.us

Just eight months after taking funding, the popular social bookmark website del.icio.us gets acquired by Yahoo.

Jeremy Zawodny has the announcement on the Yahoo Search blog and Joshua Schachter announces on the del.icio.us blog.

Yahoo seems to be making quite a push on tagging and social software. It will be interesting to see how this plays out.

See also my previous posts, "Questioning tags" and "Yahoo gets social with MyWeb".

Update: Greg Yardley claims the deal is rumored to be for $30M and computes that that works out to roughly $100 per user. Yowsers. Greg goes on to say, "Yahoo didn't buy del.icio.us' technology; it bought our bookmarks and tags - and for quite a price."

Update: John Battelle says his sources also put the deal at around $30M.

Update: Paul Kedrosky's sno.oker.ed post is pretty amusing. Some good points from Paul and in the comments on the post. [via Om Malik]

Thursday, December 08, 2005

Survey paper "Deeper Inside PageRank"

Ho John Lee pointed to a long but truly excellent survey paper on PageRank, "Deeper Inside PageRank" (PDF) by Langville and Meyer.

The 46 page paper not only describes PageRank and twiddles of PageRank in detail, but also it talks about research on optimizing the PageRank computation and generating personalized versions of PageRank. It's a thick, dense paper, a lot of work to plow through, but I found a lot of juicy food for thought in there.

I ended the paper buzzing with questions, primarily around the probabilities of link transitions and the personalization (aka teleportation) vector. If you've got a good understanding of PageRank, I'd appreciate it if you could comment on my thoughts below and let me know if I've gone astray.

On the probabilities of transitioning across a link in the link graph, the paper's example on pp. 338 assumes that surfers are equally likely to click on links anywhere in the page, clearly a questionable assumption. However, at the end of that page, they briefly state that "any suitable probability distribution" can be used instead including one derived from "web usage logs".

Similarly, section 6.2 describes the personalization vector -- the probabilities of jumping to an unconnected page in the graph rather than following a link -- and briefly suggests that this personalization vector could be determined from actual usage data.

In fact, at least to my reading, the paper seems to imply that it would be ideal for both of these -- the probability of following a link and the personalization vector's probability of jumping to a page -- to be based on actual usage data. They seem to suggest that this would yield a PageRank that would be the best estimate of searcher interest in a page.

But, if I have enough usage data to do this, can't I calculate the equivalent PageRank directly? Let's say I have a toolbar (like Alexa, Yahoo, or Google) or ISP logs (like MSN or AOL) that gives me data on everything people visit. Instead of weighting the links in the link graph using the usage data, can I ignore the link graph and rank pages by their likelihood of being visited?

What's the difference between these two calculations? In one, I'm summing over the probability that surfers come over inbound links to find the probability that people will visit the page. In the other, I'm computing that probability directly from who actually visited the page. Using the link graph would seem to be only something to use if you don't have the usage data, an indirect estimate of the relevance of a page that you could calculate directly given enough data.

Now, I am assuming here that the value of a link is entirely determined by how much that link is used. If an unused link on a page does have meaning and should influence the relevance of the linked page, then this all falls apart.

But is this otherwise accurate? With enough data on what pages people visit, could the calculation over the link graph be eliminated or at least reduced?

By the way, don't miss the interesting discussion in the paper on using the personalization vector for personalized search by using a different vector for different groups of people. There are severe scaling issues with this method of search personalization, which can be partially addressed by some of the ideas from the Google Kaltix folks. For more on that, see my previous post, "More on Google personalized search".

[Ho John Lee post via Brian Dennis]

Update: Ho John Lee responded with some comments. Definitely worth reading.

Yahoo Answers and wisdom of the crowd

Jeremy Zawodny has the post on the Yahoo Search blog announcing a new product, Yahoo Answers.

Jeremy describes it as "a place to tap the collective wisdom of the Internet for advice, recommendations, theories, jokes, ... whatever."

Both Gary Price and Michael Bazeley talk about the similarity of the new Yahoo Answers with existing forums and message boards. I think this comparison is pretty accurate.

There are already moderated discussion forums where people can rate the quality of posts. Yahoo Answers would appear to be essentially the same thing, a user-moderated forum for people to talk about whatever.

Both Gary and Michael also contrast Yahoo's offering with Google Answers. Yahoo Answers is a free service where anyone can answer a question. Google Answers is a paid service where expert researchers answer questions. This difference is important.

Google Answers keeps quality high by charging a fee and restricting who can answer a question. Yahoo Answers hopes to keep quality high with a rating and reputation system.

Unless Yahoo Answers' reputation system includes something novel that does a better job of ferreting out experts, the site will have the same problem all user-moderated forums have. A popularity contest isn't the best way of getting to the truth.

People don't know what they don't know. Majority vote doesn't work if people don't have the information they need to have an informed opinion.

There was a case in the news a couple years ago of a legal advice site that had user-moderated forums. The idea was that lawyers would come on to the site, give short opinions, and use the goodwill gained to drum up future business.

A teenage kid with no legal training whatsoever hopped on to the system and started answering hundreds of legal questions with common sense answers. Despite the fact that some of his advice was wrong, badly wrong in some cases, he had the highest ratings on the site.

There is wisdom in the crowd. There is also a lot of noise. Separating the wisdom from the noise is the challenge.

Update: Looking at the Yahoo Answers point system, it appears to me that there is an incentive to answer as many questions as possible as quickly as possible without worrying about accuracy. I think that's going to need some tuning.

Update: Gary Price points out that Ask Jeeves had a very similar system to Yahoo Answers called AnswerPoint that they shut down in 2002. Why did they shut it down? Ask Jeeves SVP Jim Lanzone told Gary that the user base was very small, that "as a free service, there was little incentive for people to answer other people's questions," and that "it was usually just faster and easier for people to search normally ... than to submit a question to the community and wait for an answer."

Update: Nine months later, Philipp Lenssen posts an interview with a frequent contributor to Yahoo Answers named Michael. Michael said that, on Yahoo Answers, "the signal-to-noise ratio is astounding ... it's very difficult to sort the wheat from the chaff." He also disliked the Yahoo Answers point system, saying that they "encourage people to just give one-liner spammed responses to questions instead of actually putting in some thought."

Wednesday, December 07, 2005

Organizing chaos and information overload

In his recent post "Organizing Chaos", Peter Rip talks about the value of targeting content:
Targeting equates to value. Targeting specificity increases as volume increases, lifting the value of the entire inventory. It is a virtuous cycle .... More users generate and attract more content. Content expansion increases the value of targeting. Value is extracted by making the content more searchable, and ultimately, reusable.
This reminds me of what Bill Joy said about information overload:
Our lives are overwhelmed by all the information coming at us in a very disorganized way. We're going to hunger for something that will make sense of all the chaos--that will look at all the things happening in the world and filter and order them in a way that's personalized to us. That will be the next great revolution--that is something that doesn't take an index of the dead information on the Net, but the live information of things as they are occurring and as they are relevant to us.
Or what John Doerr said:
Maybe we'll get to 3 billion people on the web and say that what matters to all of us is information, and products, and more. Which is we live in time and we're assaulted by events. And, so, let's just say there's 3 billion events going on at any given time. And if you wanted to compute the cross product of the 3 billion people and the 3 billion events -- 'cause you need to filter very carefully the information that's going to get to this device -- I don't want to be assaulted by anything but the most relevant information ...
Or what Bill Gates said:
Workers are increasingly deluged with ... scads of information ... But finding just what they need when they need it is tough. "The software challenges that lie ahead are less about getting access to the information people need, and more about making sense of the information they have."
Or what John Battelle said:
Through the actions we take in the digital world, we leave traces of our intent, and the more those traces become trails, the more strongly an engine might infer our intent given any particular query ... I expect those trails ... to turn into relevance gold .... Clickstreams are the seeds that will grow into our culture's own memex -- a new ecology of potential knowledge -- and search will be the spade that turns the Internet's soil.
Or what I have said ([1] [2] [3]):
The urgent scaling problem for our users ... is scaling attention. Readers have limited time ... It will become harder and harder to find and discover the gems buried in all the noise. We need to help readers focus, filter, and prioritize.

There is tremendous potential in this flood of data, an opportunity to extract knowledge from the noise. ... There is wisdom in that crowd. All we need to do is find it.

Show me what matters. Help me find what I need .... Where before there was an undifferentiated glut of information, now there is focus. Where before there was noise, now there is knowledge.

Monday, December 05, 2005

Google's rules of management

In a Newsweek article, Google CEO Eric Schmidt says that Google's management philosophy gives them a competitive advantage over other firms.

A few highlights from the article:
Cater to their every need ... The goal is to "strip away everything that gets in their way." We provide a standard package of fringe benefits, but on top of that are first-class dining facilities, gyms, laundry rooms, massage rooms, haircuts, carwashes, dry cleaning, commuting buses -- just about anything a hardworking engineer might want. Let's face it: programmers want to program, they don't want to do their laundry. So we make it easy for them.

Data drive decisions. At Google, almost every decision is based on quantitative analysis. We've built systems to manage information, not only on the Internet at large, but also internally ... We have a raft of online "dashboards" for every business we work in that provide up-to-the-minute snapshots of where we are.

We adhere to the view that the "many are smarter than the few" ... At Google, the role of the manager is that of an aggregator of viewpoints, not the dictator of decisions. Building a consensus ... always produces a more committed team and better decisions.

Hire by committee. Virtually every person who interviews at Google talks to at least half-a-dozen interviewers ... Everyone's opinion counts, making the hiring process more fair and pushing standards higher ... If you hire great people and involve them intensively in the hiring process, you'll get more great people ... [a] positive feedback loop ... [with] a huge payoff.

A trusted work force is a loyal work force.
See also my previous posts ([1] [2] [3] [4]) on Google's exceptional benefits and the advantages it gives them.

See also my previous post, "The Human Equation", where I discuss a book that argues that investing heavily in your people pays off not just for knowledge workers, but for all workers.

[Newsweek article via Niall Kennedy]

Update: Some additional insight in a Business 2.0 interview of Eric Schmidt by John Battelle. [via Gary Price]

Sunday, December 04, 2005

Advanced search, PostScript, and improving search

Last night, I was trying to find something pretty specific, a PostScript program that generates random mazes when you send it down to your printer. I was having a hard time finding it with quick searches for "postscript maze" and the like, so I switched to advanced search.

What I decided to do was a search limited by filetype to Postscript (.ps) files with the word "maze" in the filename (or URL).

I was surprised to find that only Google supports this query (e.g. [allinurl: maze filetype:ps]). I think AltaVista used to be able to do it, but can't now that it is owned by Yahoo. Yahoo, MSN Search, Ask, none of the other engines can do this particular query.

There is a debate right now about whether search can be improved by giving people more powerful tools (advanced search, MSN Search's "Search Builder", Clusty's clustering, A9's "columns") or whether search just needs to do the right thing (question answering, personalized search).

While I'm not a huge believer in improving search with more powerful tools -- I don't think the mainstream will bother with them -- I'm surprised that advanced search isn't getting more attention. I was amazed that only Google supported this particular search.

By the way, PostScript is a full programming language, though a rather bizarre one, so it really is possible to write very short programs that generates mazes, fractals, and other goodies when you send them down to your printer.

I did a few of these back when I was in undergrad, but I've misplaced the files now, so I went searching so see what other people did. If you want to check out what I found, here's two ([1] [2]) of my favorites. They're PostScript files, so you'll need a PostScript printer or GSView to see them.

How is Yahoo My Web 2.0 doing?

It's been several months since Yahoo My Web 2.0 launched. How is it doing now?

Yahoo My Web 2.0 was announced as a way to overcome the "limits of web search" using "social search". Yahoo My Web 2.0 allows tagging of bookmarked web pages (like del.icio.us) and throws in some nice search and social networking features.

At the time it launched, I criticized Yahoo My Web 2.0 as being too much work for too little gain and doubted it would get mainstream adoption. Danny Sullivan at Search Engine Watch also was skeptical.

It's nearly six months later now, so let's go back and see how Yahoo My Web 2.0 is doing. Unfortunately, there's no easy way to get traffic numbers for the site, but Yahoo does post some metrics for My Web 2.0 on their home page. They say they have 407,819 pages and 99,800 tags.

This doesn't seem like a lot to me. "My Community" of 20 contacts has 2,821 saved pages and 1,411 tags. This would suggest that the entire community is only a couple orders of magnitude larger than my community. It seems that there may be just a few thousand people using Yahoo My Web 2.0.

What do you think? Are you using Yahoo My Web 2.0? If so, what do you like about it? If not, why not?

Update: Yahoo just acquired del.icio.us. My Web 2.0 was viewed by some as a del.icio.us knock-off. I wonder if Yahoo decided to build My Web 2.0 instead of buy del.icio.us 8-9 months ago, failed to get traction, and now are going ahead with the buy.

So, was the Orkut code really stolen?

About a year and a half ago, Wired reported that Google was being sued by a small company called Affinity Engines because of the code used for Orkut.

The suit alleged that Orkut Buyukkokten -- the engineer who wrote Google's Orkut social networking site and named it after himself -- reused Affinity Engines code when developing the Orkut site. Orkut Buyukkokten used to work at Affinity Engines.

What ever happened with this? Did Orkut steal the source code? I've heard rumors that Google is utterly in the wrong here, but the court case still is slowly grinding its way through. If there is evil to be found here, it seems to be buried under enough legal slime that it may be years before it is fully exposed.

But, looking at how Google's social networking site has languished, I can't help but think that this controversy is at least somewhat responsible. If Google was concerned about liability, they'd have a big incentive to discontinue development of Orkut.

What do you think? Was the code stolen? If so, is it a violation of Google's "do no evil" mantra? If not, why does Orkut seem to have been abandoned by Google?

Update: A few months later, the court case appears to have ended. When I inquired about it, I got the following statement from Michael Kwun at Google: "The parties have resolved their differences in this matter and have agreed not to share the terms of the agreement. We are very pleased with the outcome."

Friday, December 02, 2005

E-mail overload, social sorting, and EmailRank

Many of us know what it is like to be overwhelmed by e-mail. We fear our inboxes as a never-ending, poorly differentiated barrage, requiring laborious effort to manually skim, sort, and prioritize.

I look at this mess and think to myself, why? Why do I have to do this myself? Can't the computer help me here?

What I would like to see is a TrustRank-like system of propagating importance and reputation through a network of my e-mail contacts.

Here's how it would work:
Analyze who I e-mail, giving each person an implicit rank of importance based on my e-mail history.

Add in any people I explicitly indicate are important.

Propagate this importance through the network. That is, for each person I think is important, look at the people that important person thinks are important, and say those people must be at least somewhat important to me, then rinse and repeat.
So, now I have a large list of which contacts are important, important to me and to the community that surrounds me.

On any incoming mail, combine the importance of the contact with attributes of the specific mail message to mark the importance of the mail. Call it EmailRank, a relevance rank for incoming e-mail.

Seems like this wouldn't be too bad to implement at a large web-based e-mail site like GMail, Yahoo Mail, or Hotmail. They already have all the contact data right there. Build the graph, propagate, add analysis of the e-mail. I'm surprised it hasn't happened already.

See also a couple interesting Microsoft Research papers around this idea, "Attention-Sensitive Alerting" (PDF) and "The Social Network and Relationship Finder" (PDF). Neither discusses propagation of importance through a contact network, but both have a lot of ideas on methods for ranking e-mails using other data.

See also "Better e-mail prioritization going mainstream?" on TechDirt.

Update: There are also recent articles in PCWorld, CNet, and other sites on SNARF, the second of the two Microsoft Research projects I mentioned.

Microsoft Fremont is impressive

Kurt Weber (GPM, Microsoft Fremont) and Brady Forrest (PM, MSN Search) were kind enough to give me a demo this morning of Microsoft Fremont, an upcoming online, community-driven marketplace roughly similar to Craigslist.

Fremont emphasizes small scale selling to friends and acquaintances. These are easy transactions with people who have to see you again, so it makes for a friendly exchange with less risk of problems. I think it is quite likely I would prefer it over eBay, Craigslist, or selling used items at Amazon.

The goal is selling, not building up a network of friends. The social network is built implicitly from your list of IM buddies or from e-mail address groups, no fuss, no effort. Listing items is straightforward. The UI is clean and easy. It's a system you could see your grandma using.

It is impressive. They definitely have got the idea of social networking with a purpose.

I have to say, I'm surprised to see this from Microsoft. I would have expected to see this first from Yahoo as an outgrowth of Yahoo 360. Or from Amazon as some clever combination of Amazon's community features and Amazon's selling features. Or maybe from Google as an attempt to actually make Orkut useful.

Instead, Microsoft steps up to the plate. From what I've seen so far, it looks like they'll hit it out of the park.

See also my previous post, "Microsoft Fremont vs. Google Base".

Update: Three months later, Microsoft renamed Fremont to Windows Live Expo and launched it. It has no payment mechanism, so it basically feels like Craigslist with a couple social networking features thrown in. Microsoft Passport registration is required for use, something that I found to be an annoying hurdle.

In short, not as impressive as I had expected.

VCs and investing in "me-too" companies

John Cook at the Seattle PI talks about how VCs seem to trend follow, throwing money at startups with many existing venture-backed competitors:
VCs are investing in startup companies that already have four or five venture-backed competitors -- something I saw during the dot com boom of the late 90s and something that is occurring once again ...

Rustic Canyon Partners' Jon Staenberg said he is concerned about the "me-too" companies being formed in certain sectors. He said the world didn't need five or six online pet food stores during the late 1990s and it probably doesn't need five or six social networking companies today.

Some carnage will occur.
From my experience talking with VCs over the last couple years, I think VCs invest in "me-too" companies because there is less personal risk for them to do that than find and invest in new ideas.

Look at it from a VC's point of view. You have one company is entering a market space that several other VC firms have evaluated and blessed. Another company has a new product that has no market history, requiring a lot of effort to evaluate the potential.

Arguing for investing in the first company is easy. Just point at all the other interest and the due diligence the other firms presumably already did. If the company fails in the end, you can say, "Well, everyone thought this was a good idea. Not my fault."

Championing the second company is personally risky and hard. Due diligence on a new technology requires a lot of expertise, time, and work. In the end, a couple people at the firm will have to stick their necks out on the investment to get the other partners to go along, something that is personally risky for them if the investment fails.

While this may not be the best thing for the investors in the fund, the VCs are just behaving as rational actors, minimizing their personal risk. "Me-too" trend following is the result.

[I first posted a version of this as a comment on John's post. I later decided to cross-post it here.]