Wednesday, November 05, 2008

Data is good, code is a liability

Googler Peter Norvig gave a talk at industry day at CIKM 2008 that, despite my fascination with all things Peter Norvig, almost frightened me off by including the phrase "the Ultimate Agile Development Tool" in its title.

The talk redeemed itself in the first couple minutes, citing Steve Yegge's "Good Agile, Bad Agile" and making it clear that Peter more meant being agile than Agile.

His core point was that "code is a liability". Relying on data over code as much as possible allows simpler code that is more flexible, adaptive, and robust.

In one of several examples, Peter put up a slide showing an excerpt for a rule-based spelling corrector. The snippet of code, that was just part of a much larger program, contained a nearly impossible to understand let alone verify set of case and if statements that represented rules for spelling correction in English. He then put up a slide containing a few line Python program for a statistical spelling correction program that, given a large data file of documents, learns the likelihood of seeing words and corrects misspellings to their most likely alternative. This version, he said, not only has the benefit of being simple, but also easily can be used in different languages.

For another example, Peter pulled from Jing et al, "Canonical Image Selection from the Web" (ACM), which uses a clever representation of the features of an image, a huge image database, and clustering of images with similar features to find the most representative image of, for example, the Mona Lisa on a search for [mona lisa].

Peter went on to say to say that more data seems to help in many problems more than complicated algorithms. More data can hit diminishing returns at some point, but the point seems to be fairly far out for many problems, so keeping it simple while processing as much data as possible often seems to work best. Google's work in statistical machine translation works this way, he said, primarily using the correlations discovered between the words in different languages in a training set of 18B documents.

The talk was one of the ones recorded by videolectures.net and should appear there in a week. If you cannot wait, the CIKM talk was similar to Peter's startup school talk from a few months ago, so you could use that as a substitute.

7 comments:

Anonymous said...

Whatever. Code is Data.

Anonymous said...

cjh: *whoosh*

Greg: might the spelling corrector he showed be this one? http://norvig.com/spell-correct.html

Anonymous said...

sort of like using look-up tables for trig functions

Unknown said...

The thing that always scares me with data drive "stuff" is when it doesn't work the way you want/expect. Then what? These types of systems are very difficult to debug. That crazy pile of case and IF statements at least is debuggable.

Anyone making a data driven system must consider building in some sort of debugging system otherwise you're back in the code anyway.

Anonymous said...

Debuggers allow you to temporarily play with time and space. But they are a real pain when the bug is rare.

Data on the otherhand is fully observable. Data driven programming, the glorified lookup table, allows on one to see the entire table, to map from the result to the input. It is inherently reversible for many problems.

This is a parallel to functional programming model where you do not mutate the original version, you copy and transform at each layer.

To a find bugs, bugs that are transforming data incorrectly, I don't say corrupt because you still have your source material. To find these bugs one simply has to look at the mappings from the input to the output. Much simpler.

Jeff Kubina said...

Greg, one of the greatest points I took away from Peter's talk was to continue testing your program against ever larger training and test sets until the scores asymptote; if the scores don't asymptote, then it may be worth your time to get more data. In one of his slides he showed how the performance and ranking of various machine translation programs change significantly as they were applied to more data; the initial lowest scoring method out performed them all when applied to more data. During the talk I also thought that by training on the data you are essentially building another program and that is where the complexity should lie. However, sometimes when I apply such approaches (like neural nets) I don't get a an understanding of the data or why program works. A similar talk by Peter Norvig is here. Also, thanks for all the CIKM write-ups.

Jeffrey Davidson said...

After a pretty significant delay, the video lecture is finally posted. You can access it at http://videolectures.net/peter_norvig/.