Our submission to the [ICFP 2001 programming contest] won the first prize. Our program is named "LALR(5000000)", for the reason explained in the README file:
The problem is equivalent to finding the most likely parse (or rather,
a pretty good parse) of a given string in a probabilistic context-free
grammar. More precisely, the string to be parsed is the sequence of
decorated characters (or rather, contiguous chunks thereof); the PCFG
describes the stack transitions and costs of the markup language; each
parse tree is an acceptable way to describe the given sequence of
decorated characters using the markup language. Fortunately, the
grammar is highly regular; unfortunately, the grammar is also highly
recursive and connected.
Standard parsing algorithms exist for PCFG parsing. Our program is
roughly based on one such algorithm, probabilistic CYK parsing.
Seldom are parsing algorithms designed to handle multi-megabyte
sentences, however...
Dylan Thurston <dpt@math.harvard.edu>
Chung-chieh Shan <ken@digitas.harvard.edu>
We are currently working on a write-up to describe our entry. A draft is available (dated September 20, 2001; 06:25:33; revision 1.23):
You can also [download the code we submitted here] (570,941 bytes).
--KenShan