|
2041e03cf27ddc605b76a9d6dfecba05 |
An ongoing project; joint work with / help from Stuart Shieber and Carl Gunter. --KenShanIntroductionIn many cryptographic applications, two humans each have a document and need to check that they are identical. For example, Alice may have sent her public key to Bob over insecure email; Bob then needs to confirm that he has received Alice's public key rather than some other public key that an adversary injected into the network. To do this, Bob typically contacts Alice in person or over the phone, chats a bit to make sure that it is really Alice who he is talking to, then reads off to Alice, in hexadecimal, some cryptographic hash of the public key he received. Alice computes the same cryptographic hash of her public key and checks that the result is equal to what she hears. (The cryptographic hash of the key is known as its fingerprint.) Typical public keys are between 1000 and 2000 bits in length; typical key fingerprints are between 100 and 200 bits in length, or a few dozen hexadecimal digits. Assume that the adversary cannot fool Bob during his chat with Alice, and that the cryptographic hash function used by Alice and Bob are identical and secure. The key fingerprinting scheme described above still has disadvantages, because humans are not used to exchanging hexadecimal strings in speech. Some digits sound similar (for example "B", "D", and "E" in English), and others are cumbersome to pronounce in sequence. As a result, people exchanging fingerprints over the phone often repeat themselves ("that is a B, not a D"), use a pilot alphabet ("alpha", "beta", "Charlie"), or both. Pilot alphabets are based on the idea that people can more easily transmit natural language words over speech than alphanumeric strings. We can describe the more general concept of mnemonic codes as follows. To send an alphanumeric string to Alice, Bob expands the string into an utterance that may be longer but also more natural. Bob speaks this utterance to Alice, who then contracts what she heard to recover the intended string. In applications such as key fingerprinting, where Alice and Bob are confirming rather than transmitting the string, Alice may simply check that what she heard is a/the valid expansion of the expected string, without performing any contraction; this comparison task is simpler than transmission, for the obvious reason that comparison can be reduced to transmission. Pilot alphabets are a special case of mnemonic codes where expansion maps each character into a word beginning with it. We can also view reading off hexadecimal digits as a special case of mnemonic codes where expansion maps each character into its standalone spoken form. Both examples expand each character separately, but that is not necessary. We can imagine mnemonic codes in which each pair of digits is expanded together, or entire strings are assigned expansions all at once. The rest of this document describes and discusses a mnemonic code based on arithmetic coding. {It would be more appropriate to start by giving a formal definition of mnemonic coding and defining a utility function to pin down what a good mnemonic code is, but we are not quite that formal yet.} A mnemonic code based on arithmetic codingArithmetic coding is a pair of algorithms that encodes text streams into bit streams and decodes bit streams into text streams. Both encoding and decoding: * rely on the implementor to provide a language model, i.e., a probability distribution over text streams; * run online, i.e., produce partial output as input continues to arrive; and * use constant time per input token, and constant space. In the ideal situation, where the language model provided perfectly matches the actual distribution of incoming text, the encoder produces a uniformly distributed stream of bits with the same asymptotic entropy as the input text, thereby making optimal use of the output bandwidth. Conversely, when the decoder is fed a uniformly distributed stream of bits, it produces an output text stream with the same asymptotic entropy as the input bits, and distributed according to the language model. The decoder can thus be thought of as an optimal sampling algorithm that uses the fewest random bits to produce the most text distributed according to any given model. Our basic idea is to use this decoder as the expansion algorithm in a mnemonic code. For more details on arithmetic coding, see Chapter 5 of the book Text Compression by Timothy Bell, John Cleary, and Ian Witten. Assume that Alice and Bob both speak some natural language L. Denote by W the set of words in L; we then have the set of (finite) strings W^* and the set of (infinite) streams W^\omega. Denote the n-th word in a stream or string y by y_n. We can think of L itself as an infinite stream of random text, and each string in W^* as the prefix of an infinite number of possible streams in W^\omega. Hence, we let Y be a random stream (a random variable ranging over streams y \in W^\omega). Each word Y_n in Y is a random variable ranging over W. Suppose that we have a probabilistic model of L. Our model may be a simple bigram or n-gram model or a more sophisticated one involving longer-distance correlations. In any case, we can practically assume that our model can be expressed as a family of efficiently computable conditional probability distributions : P(Y_1), : P(Y_2 | Y_1), : P(Y_3 | Y_1, Y_2), : P(Y_4 | Y_1, Y_2, Y_3), : .... These distributions specify the likelihood of each word given its preceding context. A bigram model, for instance, would specify an initial probability vector : P(Y_1 = w) = A_w and a transition probability matrix : P(Y_{i+1} = w' | Y_i = w, Y_1, Y_2, ..., Y_{i-1}) = B_{w'|w}. {Formally, we define a (probabilistic) model of L to be a probability distribution over W^\omega that assigns a probability to each prefix yadda yadda.} Let P be any given language model. Our goal is to expand a string of m bits : x = x_1 ... x_m into a string in W^*. In practical applications such as key fingerprinting, the length m is known (e.g., it is the output length of our cryptographic hash function) and x is a uniform random sample from W^m (e.g., we assume our hash function is good). Conceptually, our mnemonic coding scheme simply feeds x to the decoder, instructing it to use P as the language model. The output from the decoder is then our expansion of x. One complication arises because the decoder takes infinite bit streams as input, not finite bit strings. Beyond the m-th bit, we pad the input bit string with an infinite stream of zeroes. Any other padding would work just as well, as long as the same padding stream is appended to all length-m strings for each m. Another complication arises because the decoder produces infinite text streams as output, not finite text strings. Given a length-m bit string x as input, we want to produce just enough output to ensure that we expand x differently from any other length-m bit string x' \ne x. Because of the way arithmetic coding works, we need only check that x is expanded differently from x+1 and x-1, where we treat each length-m bit string as a m-bit binary number. Therefore, to produce just enough output, we run three decoders in parallel on x, x-1, and x+1, and stop as soon as the three output text streams diverge. ImplementationWe have implemented the mnemonic code described above in C++ code that is reasonably clean and modular but mostly templated and uncommented. In particular, while we have only implemented a simple n-gram language model, it is relatively easy to swap in alternative language models or to implement alternative data structures for storing n-gram frequency counts. It is also easy to reuse the code to perform tasks other than expansion. Our implementation can be tested over the Web at http://www.digitas.harvard.edu/cgi-bin/ken/coding The Web interface is a simple CGI script written in Perl that invokes compiled C++ code in turn. It allows the user to input bit strings of any length in hexadecimal, octal or binary format. The user specifies the (n-gram) language model to use by selecting: * the number of words to use for context, i.e., n-1 for an n-gram language model; * the corpus to use for estimating probability distributions by maximum likelihood; * whether punctuation in the corpus is to be ignored, considered as part of the preceding word, or considered as separate words; * whether to ignore case in the corpus. A summary of our C++ source files: * coding.cpp -- the main driver code. * freq.h -- frequency tables: a simple, array-based implementation (array_freq_table) and a more sophisticated, binary-tree-based implementation (tree_freq_table); both are described in Chapter 5 of Text Compression. * model.h -- language models: a fixed unigram model (fixed_model), an adaptive unigram model (adaptive_model), and a fixed n-gram model (ngram_context and context_model); all three are described in Chapters 5 and 6 of Text Compression. * coder.h -- arithmetic coding: a templated implementation of both encoding and decoding: can be instantiated for any language model type. * bits.h -- utilities for manipulating bit strings: parsing bit string input in bases 2, 8, and 16; incrementing and decrementing bit strings viewed as integers. * copy.h -- the utility function copy_until_distinct: compares and copies three input streams to one output stream. * empty.h -- the empty class: useful for instantiating templates. * hash.h -- a header file to encapsulate away whether we use map<> in standard C++ or hash_map<> in SGI STL. * symbol.h -- symbol tables: instead of schlepping C++ string objects around, we use a two-way table to map words to integers on input and integers to words on output. * coding.h -- typedefs to instantiate all our templates with reasonable parameters. The compactness-naturalness tradeoffSo far, we have focused on arithmetic coding with an n-gram language model. The most straightforward method to create this n-gram language model is to estimate its probabilities from a corpus by maximum likelihood. This is the only method we have implemented. We did not implement any smoothing in our language model, except we back off to a lower-order context when the current n-gram context does not occur anywhere within the corpus. Typically, the perplexity of an n-gram language model decreases as n increases. This is because additional context at the end of a text stream helps distinguish between its possible continuations. As a result, smaller n results in shorter and less natural expansions, while larger n results in longer and more natural expansions. We can thus strike different points of balance on the compactness-naturalness spectrum of expansion functions by adjusting n. Playing with our implementation, we found that the subjectively "best" setting of n is somewhere between 2 and 3. Unfortunately, while we would like to make the compactness-naturalness tradeoff continuously, the parameter n is required to be an integer. An obvious idea that we have yet to try is to approximate non-integral values of n by replacing the "hard" backoff strategy we currently use with a "soft" backoff strategy, essentially using a weighted average across multiple context lengths. Chapter 6 of Text Compression describes several smoothing/backoff techniques, all of which can be plugged into our language model. Besides context backoff, several other smoothing techniques are commonly applied to language modeling. These include smoothing across words using part-of-speech and other tagging, clustering, and general maximum entropy approaches. The effects of these techniques on mnemonic coding remain to be investigated. In any case, as a general statement, these techniques are of interest to us because they may: * introduce additional parameters for the model, thereby letting us make the compactness-naturalness tradeoff continuously as discussed above; as well as * increase the perplexity of the model, thereby generating shorter expansions, while maintaining subjective naturalness. On the second point above, consider the trigram model built from Alice in Wonderland, ignoring punctuation and case. The bigram "of her" occurs in the book 17 times, followed by the words: * age * childhood * ever * favourite * going * head (2 times) * hedgehog * knowledge * little * or * own * sharp * sister * skirt * voice (2 times) Subjectively, it seems to us that: * A large number of nouns, adjectives, and even gerunds following "of her" would be reasonably natural-sounding, yet a decoder using this language model would not be allowed to discharge entropy by choosing any word other than the 15 listed above. * The trigram "of her voice" is not much more natural-sounding than the trigram "of her age", even though the former appears a whopping twice as many times as the latter. Mnemonic coding is intended to make bit strings easier to transmit (and, in potential applications other than key fingerprinting, to memorize). In general, maximum likelihood models tend to overfit data. In this example, overfitting results in a language model generating text that would be found significantly easier to transmit or remember than text generated from a smoothed model only by a devoted fan of Alice in Wonderland who has memorized the corpus. More generally, we expect a smoothed model to have higher perplexity yet comparable naturalness. Hence, by smoothing a parameterized family of models on the compactness-naturalness spectrum, we expect to produce more natural text with comparable length. Another idea towards increasing perplexity is to use corpus with higher perplexity, such as Alice in Wonderland or Gilbert and Sullivan lyrics {the latter suggested to me by Dylan Thurston} rather than Sherlock Holmes. Using corpus with high perplexity may be advantageous not just because the resulting language model may have high perplexity as well, but also because the resulting expansions may acquire a poetic or fantastic flavor that fits better with text generated from n-gram models in general. A more principled approach to all these attempts at improving our expansion function / language model would be to start by defining (quantifying) what it means for an expansion function / language model to be "better". Two criteria introduced above are compactness and naturalness. We can quantified compactness as the entropy of the language model. Here, by "entropy" we mean the information rate of text streams generated by the model, i.e., E_P(-\log P), not the description length required to encode the model itself. As for naturalness: Let P_0 denote Bob and Alice's model of L; in other words, the probability distribution P_0 measures how much Bob and Alice expect to encounter (and are optimized to process) each given text stream in L. We can quantify naturalness as the negative cross-entropy E_P(\log P_0). The general optimization problem we want to solve (exactly or approximately) is: Suppose that we have a language model P_0, in which we did our best estimating Bob and Alice's model of L, possibly using smoothing techniques discussed above. Our mnemonic coding scheme is to perform arithmetic decoding using some language model P, which may or may not be equal to P_0, because we want to choose P to maximize the objective function : Z ("log posterior plus normalizing constant") : = E_P(\log P_0 - \lambda \log P), where \lambda is a non-negative balance parameter set by the user, continuously. {I haven't quite worked out how the fact that streams are infinite affects the formal math here.} {Digression: Note that we *really* want to optimized based on the expected number of output words per input bit, not the expected number of input bits per output word as formulated above. I don't think it matters, but I have yet to prove it.} To maximize Z, we take the partial derivative of Z with respect to each element of the transition probability matrix of P. The outgoing probabilities in each context are constrained to add up to 1. Hence (by the Lagrangian multipler), the partial derivatives of Z with respect to the outgoing probabilities in each context must be equal. To a zeroth approximation, we can assume that only the steady state of the Markov process matters, and ignore (i.e., approximate by zero) the partial derivatives of the steady state probability vector with respect to the transition probability matrix. The result is that we can simply take the conditional probability distribution at each context in P_0 to a fractional power between 0 and 1, and renormalize, to get a new set of conditional probability distributions for P. To a first approximation, it may well work out if we run a sort of EM algorithm to optimize P iteratively. To be more exact than that, it would be nice if we could take the partial derivatives of the stable state distribution of a Markov model with respect to the transition probabilities of said model. Unfortunately, I was told that this is hard. I also faintly recall being told previously that even computing the exact entropy of a given Markov model is hard. {Related work: Some guy at IBM, who I can look up and whose paper I have with me, trained speech HMMs using EM in conjunction with a perplexity-minimizing prior. Among other things, he has a program for determining the words most likely to be confused with any given English word.} PunctuationIt is easier to read punctuated (and capitalized) text than unpunctuated text. Therefore, we would like expansion to produce punctuated text. If we were to generate punctuated text by performing decoding with a language model that directly incorporates punctuation, then some bit strings would be expanded into text strings that differ only in punctuation. In other words, the effective bandwidth of the mnemonic coding channel is reduced. Alice, listening to Bob's voice over the phone, may not be able to distinguish between the two text strings. This means that: * If we are using mnemonic coding to compare bit strings such as key fingerprints, then the security of the procedure is weakened. * If we are using mnemonic coding to transmit rather than compare bit strings, then Alice would not know what punctuation to type. Therefore, we should first generate unpunctuated text from the input bit string, then punctuate the expansion for the sole purpose of helping Bob pronounce it. We have yet to implement this postprocessing step of restoring punctuation. To this end, there is relevant existing work in restoring punctuation as well as accents to text. An interesting side question is: How well can a machine restore punctuation, anyway? An ideal punctuation restoration algorithm would make use of all information available in unpunctuated input text; however, there may be genuine ambiguities that even a perfect system would not be able to resolve. A practical upper bound on the quality of automatic punctuation restoration is the quality of manual punctuation restoration. To determine the latter, we stripped punctuation marks from some Wall Street Journal text and implemented a Web-based system where we challenge our friends to restore them. To gain access to this challenge and try it out yourself, email ken@digitas.harvard.edu. {The WSJ text tends to be literary and dense; we should try something like the Brown corpus as well to see how it compares.} SecurityBut wait-- Punctuation is not the only thing that Alice may have trouble distinguishing over the phone! In English, "to", "two" and "too" sound the same, as do "cat's cares" and "cat scares". In principle, the expansion process should: * first expand the input bit string into an equivalence class of acoustically indistinguishable phoneme sequences; then * render the phoneme sequence into text. Each of the two steps should make use of a separate language model. {Some more thoughts on the security of comparing key fingerprints using mnemonic coding:} Obviously, for every way we come up with to encode data, there's always a hypothetical English accent/dialect where two different pieces of data encodes to the same string using said way. Perhaps we can randomize, i.e., find a large set S whose elements are encoding schemes, such that : for every English accent/dialect, and for every bit string B, if E is an encoding scheme randomly chosen from S, then -- with high probability -- E encodes B differently from E encodes any other bit string (I think this is an appropriate definition for "security against malicious adversary"), even though it is not true that : if E is an encoding scheme randomly chosen from S, then -- with high probability -- for every English accent/dialect, E encodes different bit strings differently and it may also be not true that : for every English accent/dialect, if E is an encoding scheme randomly chosen from S, then -- with high probability -- E encodes different bit strings differently I'm not sure how to rigorously define "English accent/dialect" though. Obviously, "any mapping from text to speech" is not restrictive enough, because then I can just map everything to silence. Recent workRecently I integrated the CMU-Cambridge Statistical Language Modeling Toolkit (SLM) into my C++ implementation. Said toolkit takes care of smoothing n-gram models in various standard ways, performing probability table lookups quickly, etc. I added command-line options to my program to switch between using my language modeling code and the Toolkit's. The arithmetic (de)coding part of my program works fine with either. I also implemented a simple "knob" that takes all conditional probability values to a given power (then renormalizes). This knob doesn't seem to work very well -- it doesn't shorten the text by very much even when the exponent is zero. I used default Good-Turing smoothing options in SLM to train a bunch of larger models from the WSJ text in the North American News Text Corpus. Short summary of results: Larger corpus => Better language model => Quality output. Big questions: * In general, to build a better language model, does one simply throw larger corpora at the Good-Turing estimate? * Given that the Good-Turing estimate is pretty automatic (no parameters to set), how would we add a knob to the system? --KenShan |
In many cryptographic applications, two humans each have a document and need to check that they are identical. For example, Alice may have sent her public key to Bob over insecure email; Bob then needs to confirm that he has received Alice's public key rather than some other public key that an adversary injected into the network. To do this, Bob typically contacts Alice in person or over the phone, chats a bit to make sure that it is really Alice who he is talking to, then reads off to Alice, in hexadecimal, some cryptographic hash of the public key he received. Alice computes the same cryptographic hash of her public key and checks that the result is equal to what she hears. (The cryptographic hash of the key is known as its fingerprint.)
Typical public keys are between 1000 and 2000 bits in length; typical key fingerprints are between 100 and 200 bits in length, or a few dozen hexadecimal digits.
Assume that the adversary cannot fool Bob during his chat with Alice, and that the cryptographic hash function used by Alice and Bob are identical and secure. The key fingerprinting scheme described above still has disadvantages, because humans are not used to exchanging hexadecimal strings in speech. Some digits sound similar (for example "B", "D", and "E" in English), and others are cumbersome to pronounce in sequence. As a result, people exchanging fingerprints over the phone often repeat themselves ("that is a B, not a D"), use a pilot alphabet ("alpha", "beta", "Charlie"), or both.
Pilot alphabets are based on the idea that people can more easily transmit natural language words over speech than alphanumeric strings. We can describe the more general concept of mnemonic codes as follows. To send an alphanumeric string to Alice, Bob expands the string into an utterance that may be longer but also more natural. Bob speaks this utterance to Alice, who then contracts what she heard to recover the intended string. In applications such as key fingerprinting, where Alice and Bob are confirming rather than transmitting the string, Alice may simply check that what she heard is a/the valid expansion of the expected string, without performing any contraction; this comparison task is simpler than transmission, for the obvious reason that comparison can be reduced to transmission.
Pilot alphabets are a special case of mnemonic codes where expansion maps each character into a word beginning with it. We can also view reading off hexadecimal digits as a special case of mnemonic codes where expansion maps each character into its standalone spoken form. Both examples expand each character separately, but that is not necessary. We can imagine mnemonic codes in which each pair of digits is expanded together, or entire strings are assigned expansions all at once.
The rest of this document describes and discusses a mnemonic code based on arithmetic coding. {It would be more appropriate to start by giving a formal definition of mnemonic coding and defining a utility function to pin down what a good mnemonic code is, but we are not quite that formal yet.}
Arithmetic coding is a pair of algorithms that encodes text streams into bit streams and decodes bit streams into text streams. Both encoding and decoding:
In the ideal situation, where the language model provided perfectly matches the actual distribution of incoming text, the encoder produces a uniformly distributed stream of bits with the same asymptotic entropy as the input text, thereby making optimal use of the output bandwidth. Conversely, when the decoder is fed a uniformly distributed stream of bits, it produces an output text stream with the same asymptotic entropy as the input bits, and distributed according to the language model. The decoder can thus be thought of as an optimal sampling algorithm that uses the fewest random bits to produce the most text distributed according to any given model. Our basic idea is to use this decoder as the expansion algorithm in a mnemonic code. For more details on arithmetic coding, see Chapter 5 of the book Text Compression by Timothy Bell, John Cleary, and Ian Witten.
Assume that Alice and Bob both speak some natural language L. Denote by W the set of words in L; we then have the set of (finite) strings W^* and the set of (infinite) streams W^\omega. Denote the n-th word in a stream or string y by y_n. We can think of L itself as an infinite stream of random text, and each string in W^* as the prefix of an infinite number of possible streams in W^\omega. Hence, we let Y be a random stream (a random variable ranging over streams y \in W^\omega). Each word Y_n in Y is a random variable ranging over W.
Suppose that we have a probabilistic model of L. Our model may be a simple bigram or n-gram model or a more sophisticated one involving longer-distance correlations. In any case, we can practically assume that our model can be expressed as a family of efficiently computable conditional probability distributions
These distributions specify the likelihood of each word given its preceding context. A bigram model, for instance, would specify an initial probability vector
and a transition probability matrix
{Formally, we define a (probabilistic) model of L to be a probability distribution over W^\omega that assigns a probability to each prefix yadda yadda.}
Let P be any given language model. Our goal is to expand a string of m bits
into a string in W^*. In practical applications such as key fingerprinting, the length m is known (e.g., it is the output length of our cryptographic hash function) and x is a uniform random sample from W^m (e.g., we assume our hash function is good).
Conceptually, our mnemonic coding scheme simply feeds x to the decoder, instructing it to use P as the language model. The output from the decoder is then our expansion of x.
One complication arises because the decoder takes infinite bit streams as input, not finite bit strings. Beyond the m-th bit, we pad the input bit string with an infinite stream of zeroes. Any other padding would work just as well, as long as the same padding stream is appended to all length-m strings for each m.
Another complication arises because the decoder produces infinite text streams as output, not finite text strings. Given a length-m bit string x as input, we want to produce just enough output to ensure that we expand x differently from any other length-m bit string x' \ne x. Because of the way arithmetic coding works, we need only check that x is expanded differently from x+1 and x-1, where we treat each length-m bit string as a m-bit binary number. Therefore, to produce just enough output, we run three decoders in parallel on x, x-1, and x+1, and stop as soon as the three output text streams diverge.
We have implemented the mnemonic code described above in C++ code that is reasonably clean and modular but mostly templated and uncommented. In particular, while we have only implemented a simple n-gram language model, it is relatively easy to swap in alternative language models or to implement alternative data structures for storing n-gram frequency counts. It is also easy to reuse the code to perform tasks other than expansion.
Our implementation can be tested over the Web at http://www.digitas.harvard.edu/cgi-bin/ken/coding
The Web interface is a simple CGI script written in Perl that invokes compiled C++ code in turn. It allows the user to input bit strings of any length in hexadecimal, octal or binary format. The user specifies the (n-gram) language model to use by selecting:
A summary of our C++ source files:
So far, we have focused on arithmetic coding with an n-gram language model. The most straightforward method to create this n-gram language model is to estimate its probabilities from a corpus by maximum likelihood. This is the only method we have implemented. We did not implement any smoothing in our language model, except we back off to a lower-order context when the current n-gram context does not occur anywhere within the corpus.
Typically, the perplexity of an n-gram language model decreases as n increases. This is because additional context at the end of a text stream helps distinguish between its possible continuations. As a result, smaller n results in shorter and less natural expansions, while larger n results in longer and more natural expansions. We can thus strike different points of balance on the compactness-naturalness spectrum of expansion functions by adjusting n.
Playing with our implementation, we found that the subjectively "best" setting of n is somewhere between 2 and 3. Unfortunately, while we would like to make the compactness-naturalness tradeoff continuously, the parameter n is required to be an integer. An obvious idea that we have yet to try is to approximate non-integral values of n by replacing the "hard" backoff strategy we currently use with a "soft" backoff strategy, essentially using a weighted average across multiple context lengths. Chapter 6 of Text Compression describes several smoothing/backoff techniques, all of which can be plugged into our language model.
Besides context backoff, several other smoothing techniques are commonly applied to language modeling. These include smoothing across words using part-of-speech and other tagging, clustering, and general maximum entropy approaches. The effects of these techniques on mnemonic coding remain to be investigated. In any case, as a general statement, these techniques are of interest to us because they may:
On the second point above, consider the trigram model built from Alice in Wonderland, ignoring punctuation and case. The bigram "of her" occurs in the book 17 times, followed by the words:
Subjectively, it seems to us that:
Mnemonic coding is intended to make bit strings easier to transmit (and, in potential applications other than key fingerprinting, to memorize). In general, maximum likelihood models tend to overfit data. In this example, overfitting results in a language model generating text that would be found significantly easier to transmit or remember than text generated from a smoothed model only by a devoted fan of Alice in Wonderland who has memorized the corpus. More generally, we expect a smoothed model to have higher perplexity yet comparable naturalness. Hence, by smoothing a parameterized family of models on the compactness-naturalness spectrum, we expect to produce more natural text with comparable length.
Another idea towards increasing perplexity is to use corpus with higher perplexity, such as Alice in Wonderland or Gilbert and Sullivan lyrics {the latter suggested to me by Dylan Thurston} rather than Sherlock Holmes. Using corpus with high perplexity may be advantageous not just because the resulting language model may have high perplexity as well, but also because the resulting expansions may acquire a poetic or fantastic flavor that fits better with text generated from n-gram models in general.
A more principled approach to all these attempts at improving our expansion function / language model would be to start by defining (quantifying) what it means for an expansion function / language model to be "better". Two criteria introduced above are compactness and naturalness.
We can quantified compactness as the entropy of the language model. Here, by "entropy" we mean the information rate of text streams generated by the model, i.e., E_P(-\log P), not the description length required to encode the model itself. As for naturalness: Let P_0 denote Bob and Alice's model of L; in other words, the probability distribution P_0 measures how much Bob and Alice expect to encounter (and are optimized to process) each given text stream in L. We can quantify naturalness as the negative cross-entropy E_P(\log P_0).
The general optimization problem we want to solve (exactly or approximately) is: Suppose that we have a language model P_0, in which we did our best estimating Bob and Alice's model of L, possibly using smoothing techniques discussed above. Our mnemonic coding scheme is to perform arithmetic decoding using some language model P, which may or may not be equal to P_0, because we want to choose P to maximize the objective function
where \lambda is a non-negative balance parameter set by the user, continuously. {I haven't quite worked out how the fact that streams are infinite affects the formal math here.}
{Digression: Note that we *really* want to optimized based on the expected number of output words per input bit, not the expected number of input bits per output word as formulated above. I don't think it matters, but I have yet to prove it.}
To maximize Z, we take the partial derivative of Z with respect to each element of the transition probability matrix of P. The outgoing probabilities in each context are constrained to add up to 1. Hence (by the Lagrangian multipler), the partial derivatives of Z with respect to the outgoing probabilities in each context must be equal.
To a zeroth approximation, we can assume that only the steady state of the Markov process matters, and ignore (i.e., approximate by zero) the partial derivatives of the steady state probability vector with respect to the transition probability matrix. The result is that we can simply take the conditional probability distribution at each context in P_0 to a fractional power between 0 and 1, and renormalize, to get a new set of conditional probability distributions for P.
To a first approximation, it may well work out if we run a sort of EM algorithm to optimize P iteratively.
To be more exact than that, it would be nice if we could take the partial derivatives of the stable state distribution of a Markov model with respect to the transition probabilities of said model. Unfortunately, I was told that this is hard. I also faintly recall being told previously that even computing the exact entropy of a given Markov model is hard.
{Related work: Some guy at IBM, who I can look up and whose paper I have with me, trained speech HMMs using EM in conjunction with a perplexity-minimizing prior. Among other things, he has a program for determining the words most likely to be confused with any given English word.}
It is easier to read punctuated (and capitalized) text than unpunctuated text. Therefore, we would like expansion to produce punctuated text.
If we were to generate punctuated text by performing decoding with a language model that directly incorporates punctuation, then some bit strings would be expanded into text strings that differ only in punctuation. In other words, the effective bandwidth of the mnemonic coding channel is reduced. Alice, listening to Bob's voice over the phone, may not be able to distinguish between the two text strings. This means that:
Therefore, we should first generate unpunctuated text from the input bit string, then punctuate the expansion for the sole purpose of helping Bob pronounce it. We have yet to implement this postprocessing step of restoring punctuation. To this end, there is relevant existing work in restoring punctuation as well as accents to text.
An interesting side question is: How well can a machine restore punctuation, anyway? An ideal punctuation restoration algorithm would make use of all information available in unpunctuated input text; however, there may be genuine ambiguities that even a perfect system would not be able to resolve. A practical upper bound on the quality of automatic punctuation restoration is the quality of manual punctuation restoration. To determine the latter, we stripped punctuation marks from some Wall Street Journal text and implemented a Web-based system where we challenge our friends to restore them. To gain access to this challenge and try it out yourself, email ken@digitas.harvard.edu. {The WSJ text tends to be literary and dense; we should try something like the Brown corpus as well to see how it compares.}
But wait-- Punctuation is not the only thing that Alice may have trouble distinguishing over the phone! In English, "to", "two" and "too" sound the same, as do "cat's cares" and "cat scares". In principle, the expansion process should:
Each of the two steps should make use of a separate language model.
{Some more thoughts on the security of comparing key fingerprints using mnemonic coding:}
Obviously, for every way we come up with to encode data, there's always a hypothetical English accent/dialect where two different pieces of data encodes to the same string using said way. Perhaps we can randomize, i.e., find a large set S whose elements are encoding schemes, such that
(I think this is an appropriate definition for "security against malicious adversary"), even though it is not true that
and it may also be not true that
I'm not sure how to rigorously define "English accent/dialect" though. Obviously, "any mapping from text to speech" is not restrictive enough, because then I can just map everything to silence.
Recently I integrated the CMU-Cambridge Statistical Language Modeling Toolkit (SLM) into my C++ implementation. Said toolkit takes care of smoothing n-gram models in various standard ways, performing probability table lookups quickly, etc. I added command-line options to my program to switch between using my language modeling code and the Toolkit's. The arithmetic (de)coding part of my program works fine with either.
I also implemented a simple "knob" that takes all conditional probability values to a given power (then renormalizes). This knob doesn't seem to work very well -- it doesn't shorten the text by very much even when the exponent is zero.
I used default Good-Turing smoothing options in SLM to train a bunch of larger models from the WSJ text in the North American News Text Corpus. Short summary of results: Larger corpus => Better language model => Quality output.
Big questions:
--KenShan