MnemonicCoding

ProperTreatment | RecentChanges | Preferences
Edit this page | View other revisions | Last edited 2006-02-02 (changes)

Changes (from prior major revision) (no other diffs)

Changed: 1c1,449
2041e03cf27ddc605b76a9d6dfecba05
An ongoing project; joint work with / help from Stuart Shieber and Carl Gunter. --KenShan

Introduction




In many cryptographic applications, two humans each have a document
and need to check that they are identical. For example, Alice may
have sent her public key to Bob over insecure email; Bob then needs to
confirm that he has received Alice's public key rather than some other
public key that an adversary injected into the network. To do this,
Bob typically contacts Alice in person or over the phone, chats a bit
to make sure that it is really Alice who he is talking to, then reads
off to Alice, in hexadecimal, some cryptographic hash of the public
key he received. Alice computes the same cryptographic hash of her
public key and checks that the result is equal to what she hears.
(The cryptographic hash of the key is known as its fingerprint.)

Typical public keys are between 1000 and 2000 bits in length; typical
key fingerprints are between 100 and 200 bits in length, or a few
dozen hexadecimal digits.

Assume that the adversary cannot fool Bob during his chat with Alice,
and that the cryptographic hash function used by Alice and Bob are
identical and secure. The key fingerprinting scheme described above
still has disadvantages, because humans are not used to exchanging
hexadecimal strings in speech. Some digits sound similar (for example
"B", "D", and "E" in English), and others are cumbersome to pronounce
in sequence. As a result, people exchanging fingerprints over the
phone often repeat themselves ("that is a B, not a D"), use a pilot
alphabet ("alpha", "beta", "Charlie"), or both.

Pilot alphabets are based on the idea that people can more easily
transmit natural language words over speech than alphanumeric strings.
We can describe the more general concept of mnemonic codes as
follows. To send an alphanumeric string to Alice, Bob expands the
string into an utterance that may be longer but also more natural.
Bob speaks this utterance to Alice, who then contracts what she
heard to recover the intended string. In applications such as key
fingerprinting, where Alice and Bob are confirming rather than
transmitting the string, Alice may simply check that what she heard is
a/the valid expansion of the expected string, without performing any
contraction; this comparison task is simpler than transmission, for
the obvious reason that comparison can be reduced to transmission.

Pilot alphabets are a special case of mnemonic codes where expansion
maps each character into a word beginning with it. We can also view
reading off hexadecimal digits as a special case of mnemonic codes
where expansion maps each character into its standalone spoken form.
Both examples expand each character separately, but that is not
necessary. We can imagine mnemonic codes in which each pair of digits
is expanded together, or entire strings are assigned expansions all at
once.

The rest of this document describes and discusses a mnemonic code
based on arithmetic coding. {It would be more appropriate to start by
giving a formal definition of mnemonic coding and defining a utility
function to pin down what a good mnemonic code is, but we are not
quite that formal yet.}

A mnemonic code based on arithmetic coding




Arithmetic coding is a pair of algorithms that encodes text streams
into bit streams and decodes bit streams into text streams. Both
encoding and decoding:

* rely on the implementor to provide a language model, i.e., a probability distribution over text streams;

* run online, i.e., produce partial output as input continues to arrive; and

* use constant time per input token, and constant space.

In the ideal situation, where the language model provided perfectly
matches the actual distribution of incoming text, the encoder produces
a uniformly distributed stream of bits with the same asymptotic
entropy as the input text, thereby making optimal use of the output
bandwidth. Conversely, when the decoder is fed a uniformly
distributed stream of bits, it produces an output text stream with the
same asymptotic entropy as the input bits, and distributed according
to the language model. The decoder can thus be thought of as an
optimal sampling algorithm that uses the fewest random bits to produce
the most text distributed according to any given model. Our basic
idea is to use this decoder as the expansion algorithm in a mnemonic
code. For more details on arithmetic coding, see Chapter 5 of the
book Text Compression by Timothy Bell, John Cleary, and Ian Witten.

Assume that Alice and Bob both speak some natural language L. Denote
by W the set of words in L; we then have the set of (finite)
strings W^* and the set of (infinite) streams W^\omega. Denote
the n-th word in a stream or string y by y_n. We can think of L
itself as an infinite stream of random text, and each string in W^* as
the prefix of an infinite number of possible streams in W^\omega.
Hence, we let Y be a random stream (a random variable ranging over
streams y \in W^\omega). Each word Y_n in Y is a random variable
ranging over W.

Suppose that we have a probabilistic model of L. Our model may be a
simple bigram or n-gram model or a more sophisticated one involving
longer-distance correlations. In any case, we can practically assume
that our model can be expressed as a family of efficiently computable
conditional probability distributions

: P(Y_1),
: P(Y_2 | Y_1),
: P(Y_3 | Y_1, Y_2),
: P(Y_4 | Y_1, Y_2, Y_3),
: ....

These distributions specify the likelihood of each word given its
preceding context. A bigram model, for instance, would specify an
initial probability vector

: P(Y_1 = w) = A_w

and a transition probability matrix

: P(Y_{i+1} = w' | Y_i = w, Y_1, Y_2, ..., Y_{i-1}) = B_{w'|w}.

{Formally, we define a (probabilistic) model of L to be a
probability distribution over W^\omega that assigns a probability to
each prefix yadda yadda.}

Let P be any given language model. Our goal is to expand a string of
m bits

: x = x_1 ... x_m

into a string in W^*. In practical applications such as key
fingerprinting, the length m is known (e.g., it is the output length
of our cryptographic hash function) and x is a uniform random sample
from W^m (e.g., we assume our hash function is good).

Conceptually, our mnemonic coding scheme simply feeds x to the
decoder, instructing it to use P as the language model. The output
from the decoder is then our expansion of x.

One complication arises because the decoder takes infinite bit streams
as input, not finite bit strings. Beyond the m-th bit, we pad the
input bit string with an infinite stream of zeroes. Any other padding
would work just as well, as long as the same padding stream is
appended to all length-m strings for each m.

Another complication arises because the decoder produces infinite text
streams as output, not finite text strings. Given a length-m bit
string x as input, we want to produce just enough output to ensure
that we expand x differently from any other length-m bit string x' \ne
x. Because of the way arithmetic coding works, we need only check
that x is expanded differently from x+1 and x-1, where we treat each
length-m bit string as a m-bit binary number. Therefore, to produce
just enough output, we run three decoders in parallel on x, x-1, and
x+1, and stop as soon as the three output text streams diverge.

Implementation




We have implemented the mnemonic code described above in C++ code that
is reasonably clean and modular but mostly templated and uncommented.
In particular, while we have only implemented a simple n-gram language
model, it is relatively easy to swap in alternative language models or
to implement alternative data structures for storing n-gram frequency
counts. It is also easy to reuse the code to perform tasks other than
expansion.

Our implementation can be tested over the Web at
http://www.digitas.harvard.edu/cgi-bin/ken/coding

The Web interface is a simple CGI script written in Perl that invokes
compiled C++ code in turn. It allows the user to input bit strings of
any length in hexadecimal, octal or binary format. The user specifies
the (n-gram) language model to use by selecting:

* the number of words to use for context, i.e., n-1 for an n-gram language model;

* the corpus to use for estimating probability distributions by maximum likelihood;

* whether punctuation in the corpus is to be ignored, considered as part of the preceding word, or considered as separate words;

* whether to ignore case in the corpus.

A summary of our C++ source files:

* coding.cpp -- the main driver code.

* freq.h -- frequency tables: a simple, array-based implementation (array_freq_table) and a more sophisticated, binary-tree-based implementation (tree_freq_table); both are described in Chapter 5 of Text Compression.

* model.h -- language models: a fixed unigram model (fixed_model), an adaptive unigram model (adaptive_model), and a fixed n-gram model (ngram_context and context_model); all three are described in Chapters 5 and 6 of Text Compression.

* coder.h -- arithmetic coding: a templated implementation of both encoding and decoding: can be instantiated for any language model type.

* bits.h -- utilities for manipulating bit strings: parsing bit string input in bases 2, 8, and 16; incrementing and decrementing bit strings viewed as integers.

* copy.h -- the utility function copy_until_distinct: compares and copies three input streams to one output stream.

* empty.h -- the empty class: useful for instantiating templates.

* hash.h -- a header file to encapsulate away whether we use map<> in standard C++ or hash_map<> in SGI STL.

* symbol.h -- symbol tables: instead of schlepping C++ string objects around, we use a two-way table to map words to integers on input and integers to words on output.

* coding.h -- typedefs to instantiate all our templates with reasonable parameters.

The compactness-naturalness tradeoff




So far, we have focused on arithmetic coding with an n-gram language
model. The most straightforward method to create this n-gram language
model is to estimate its probabilities from a corpus by maximum
likelihood. This is the only method we have implemented. We did not
implement any smoothing in our language model, except we back off to a
lower-order context when the current n-gram context does not occur
anywhere within the corpus.

Typically, the perplexity of an n-gram language model decreases as n
increases. This is because additional context at the end of a text
stream helps distinguish between its possible continuations. As a
result, smaller n results in shorter and less natural expansions,
while larger n results in longer and more natural expansions. We can
thus strike different points of balance on the compactness-naturalness
spectrum of expansion functions by adjusting n.

Playing with our implementation, we found that the subjectively "best"
setting of n is somewhere between 2 and 3. Unfortunately, while we
would like to make the compactness-naturalness tradeoff continuously,
the parameter n is required to be an integer. An obvious idea that we
have yet to try is to approximate non-integral values of n by
replacing the "hard" backoff strategy we currently use with a "soft"
backoff strategy, essentially using a weighted average across multiple
context lengths. Chapter 6 of Text Compression describes several
smoothing/backoff techniques, all of which can be plugged into our
language model.

Besides context backoff, several other smoothing techniques are
commonly applied to language modeling. These include smoothing across
words using part-of-speech and other tagging, clustering, and general
maximum entropy approaches. The effects of these techniques on
mnemonic coding remain to be investigated. In any case, as a general
statement, these techniques are of interest to us because they may:

* introduce additional parameters for the model, thereby letting us make the compactness-naturalness tradeoff continuously as discussed above; as well as

* increase the perplexity of the model, thereby generating shorter expansions, while maintaining subjective naturalness.

On the second point above, consider the trigram model built from
Alice in Wonderland, ignoring punctuation and case. The bigram "of
her" occurs in the book 17 times, followed by the words:

* age
* childhood
* ever
* favourite
* going
* head (2 times)
* hedgehog
* knowledge
* little
* or
* own
* sharp
* sister
* skirt
* voice (2 times)

Subjectively, it seems to us that:

* A large number of nouns, adjectives, and even gerunds following "of her" would be reasonably natural-sounding, yet a decoder using this language model would not be allowed to discharge entropy by choosing any word other than the 15 listed above.

* The trigram "of her voice" is not much more natural-sounding than the trigram "of her age", even though the former appears a whopping twice as many times as the latter.

Mnemonic coding is intended to make bit strings easier to transmit
(and, in potential applications other than key fingerprinting, to
memorize). In general, maximum likelihood models tend to overfit
data. In this example, overfitting results in a language model
generating text that would be found significantly easier to transmit
or remember than text generated from a smoothed model only by a
devoted fan of Alice in Wonderland who has memorized the corpus.
More generally, we expect a smoothed model to have higher perplexity
yet comparable naturalness. Hence, by smoothing a parameterized
family of models on the compactness-naturalness spectrum, we expect to
produce more natural text with comparable length.

Another idea towards increasing perplexity is to use corpus with
higher perplexity, such as Alice in Wonderland or Gilbert and
Sullivan lyrics {the latter suggested to me by Dylan Thurston} rather
than Sherlock Holmes. Using corpus with high perplexity may be
advantageous not just because the resulting language model may have
high perplexity as well, but also because the resulting expansions may
acquire a poetic or fantastic flavor that fits better with text
generated from n-gram models in general.

A more principled approach to all these attempts at improving our
expansion function / language model would be to start by defining
(quantifying) what it means for an expansion function / language model
to be "better". Two criteria introduced above are compactness and
naturalness.

We can quantified compactness as the entropy of the language model.
Here, by "entropy" we mean the information rate of text streams
generated by the model, i.e., E_P(-\log P), not the description length
required to encode the model itself. As for naturalness: Let P_0
denote Bob and Alice's model of L; in other words, the probability
distribution P_0 measures how much Bob and Alice expect to encounter
(and are optimized to process) each given text stream in L. We can
quantify naturalness as the negative cross-entropy E_P(\log P_0).

The general optimization problem we want to solve (exactly or
approximately) is: Suppose that we have a language model P_0, in which
we did our best estimating Bob and Alice's model of L, possibly using
smoothing techniques discussed above. Our mnemonic coding scheme is
to perform arithmetic decoding using some language model P, which may
or may not be equal to P_0, because we want to choose P to maximize
the objective function

: Z ("log posterior plus normalizing constant")
: = E_P(\log P_0 - \lambda \log P),

where \lambda is a non-negative balance parameter set by the user,
continuously. {I haven't quite worked out how the fact that streams
are infinite affects the formal math here.}

{Digression: Note that we *really* want to optimized based on the
expected number of output words per input bit, not the expected number
of input bits per output word as formulated above. I don't think it
matters, but I have yet to prove it.}

To maximize Z, we take the partial derivative of Z with respect to
each element of the transition probability matrix of P. The outgoing
probabilities in each context are constrained to add up to 1. Hence
(by the Lagrangian multipler), the partial derivatives of Z with
respect to the outgoing probabilities in each context must be equal.

To a zeroth approximation, we can assume that only the steady state of
the Markov process matters, and ignore (i.e., approximate by zero) the
partial derivatives of the steady state probability vector with
respect to the transition probability matrix. The result is that we
can simply take the conditional probability distribution at each
context in P_0 to a fractional power between 0 and 1, and renormalize,
to get a new set of conditional probability distributions for P.

To a first approximation, it may well work out if we run a sort of EM
algorithm to optimize P iteratively.

To be more exact than that, it would be nice if we could take the
partial derivatives of the stable state distribution of a Markov model
with respect to the transition probabilities of said model.
Unfortunately, I was told that this is hard. I also faintly recall
being told previously that even computing the exact entropy of a given
Markov model is hard.

{Related work: Some guy at IBM, who I can look up and whose paper I
have with me, trained speech HMMs using EM in conjunction with a
perplexity-minimizing prior. Among other things, he has a program for
determining the words most likely to be confused with any given
English word.}

Punctuation




It is easier to read punctuated (and capitalized) text than
unpunctuated text. Therefore, we would like expansion to produce
punctuated text.

If we were to generate punctuated text by performing decoding with a
language model that directly incorporates punctuation, then some bit
strings would be expanded into text strings that differ only in
punctuation. In other words, the effective bandwidth of the mnemonic
coding channel is reduced. Alice, listening to Bob's voice over the
phone, may not be able to distinguish between the two text strings.
This means that:

* If we are using mnemonic coding to compare bit strings such as key fingerprints, then the security of the procedure is weakened.

* If we are using mnemonic coding to transmit rather than compare bit strings, then Alice would not know what punctuation to type.

Therefore, we should first generate unpunctuated text from the input
bit string, then punctuate the expansion for the sole purpose of
helping Bob pronounce it. We have yet to implement this
postprocessing step of restoring punctuation. To this end, there is
relevant existing work in restoring punctuation as well as accents to
text.

An interesting side question is: How well can a machine restore
punctuation, anyway? An ideal punctuation restoration algorithm would
make use of all information available in unpunctuated input text;
however, there may be genuine ambiguities that even a perfect system
would not be able to resolve. A practical upper bound on the quality
of automatic punctuation restoration is the quality of manual
punctuation restoration. To determine the latter, we stripped
punctuation marks from some Wall Street Journal text and implemented a
Web-based system where we challenge our friends to restore them. To
gain access to this challenge and try it out yourself, email
ken@digitas.harvard.edu. {The WSJ text tends to be literary and
dense; we should try something like the Brown corpus as well to see
how it compares.}

Security




But wait-- Punctuation is not the only thing that Alice may have
trouble distinguishing over the phone! In English, "to", "two" and
"too" sound the same, as do "cat's cares" and "cat scares". In
principle, the expansion process should:

* first expand the input bit string into an equivalence class of acoustically indistinguishable phoneme sequences; then

* render the phoneme sequence into text.

Each of the two steps should make use of a separate language model.

{Some more thoughts on the security of comparing key fingerprints
using mnemonic coding:}

Obviously, for every way we come up with to encode data, there's
always a hypothetical English accent/dialect where two different
pieces of data encodes to the same string using said way. Perhaps we
can randomize, i.e., find a large set S whose elements are encoding
schemes, such that

: for every English accent/dialect, and for every bit string B, if E is an encoding scheme randomly chosen from S, then -- with high probability -- E encodes B differently from E encodes any other bit string

(I think this is an appropriate definition for "security against
malicious adversary"), even though it is not true that

: if E is an encoding scheme randomly chosen from S, then -- with high probability -- for every English accent/dialect, E encodes different bit strings differently

and it may also be not true that

: for every English accent/dialect, if E is an encoding scheme randomly chosen from S, then -- with high probability -- E encodes different bit strings differently

I'm not sure how to rigorously define "English accent/dialect" though.
Obviously, "any mapping from text to speech" is not restrictive
enough, because then I can just map everything to silence.

Recent work




Recently I integrated the CMU-Cambridge Statistical Language Modeling Toolkit
(SLM) into my C++ implementation. Said toolkit takes care of smoothing n-gram
models in various standard ways, performing probability table lookups quickly,
etc. I added command-line options to my program to switch between using my
language modeling code and the Toolkit's. The arithmetic (de)coding part
of my program works fine with either.

I also implemented a simple "knob" that takes all conditional probability values
to a given power (then renormalizes). This knob doesn't seem to work very well --
it doesn't shorten the text by very much even when the exponent is zero.

I used default Good-Turing smoothing options in SLM to train a bunch of larger
models from the WSJ text in the North American News Text Corpus. Short summary
of results: Larger corpus => Better language model => Quality output.

Big questions:
* In general, to build a better language model, does one simply throw larger corpora at the Good-Turing estimate?
* Given that the Good-Turing estimate is pretty automatic (no parameters to set), how would we add a knob to the system?

--KenShan


An ongoing project; joint work with / help from Stuart Shieber and Carl Gunter. --KenShan

Introduction

In many cryptographic applications, two humans each have a document and need to check that they are identical. For example, Alice may have sent her public key to Bob over insecure email; Bob then needs to confirm that he has received Alice's public key rather than some other public key that an adversary injected into the network. To do this, Bob typically contacts Alice in person or over the phone, chats a bit to make sure that it is really Alice who he is talking to, then reads off to Alice, in hexadecimal, some cryptographic hash of the public key he received. Alice computes the same cryptographic hash of her public key and checks that the result is equal to what she hears. (The cryptographic hash of the key is known as its fingerprint.)

Typical public keys are between 1000 and 2000 bits in length; typical key fingerprints are between 100 and 200 bits in length, or a few dozen hexadecimal digits.

Assume that the adversary cannot fool Bob during his chat with Alice, and that the cryptographic hash function used by Alice and Bob are identical and secure. The key fingerprinting scheme described above still has disadvantages, because humans are not used to exchanging hexadecimal strings in speech. Some digits sound similar (for example "B", "D", and "E" in English), and others are cumbersome to pronounce in sequence. As a result, people exchanging fingerprints over the phone often repeat themselves ("that is a B, not a D"), use a pilot alphabet ("alpha", "beta", "Charlie"), or both.

Pilot alphabets are based on the idea that people can more easily transmit natural language words over speech than alphanumeric strings. We can describe the more general concept of mnemonic codes as follows. To send an alphanumeric string to Alice, Bob expands the string into an utterance that may be longer but also more natural. Bob speaks this utterance to Alice, who then contracts what she heard to recover the intended string. In applications such as key fingerprinting, where Alice and Bob are confirming rather than transmitting the string, Alice may simply check that what she heard is a/the valid expansion of the expected string, without performing any contraction; this comparison task is simpler than transmission, for the obvious reason that comparison can be reduced to transmission.

Pilot alphabets are a special case of mnemonic codes where expansion maps each character into a word beginning with it. We can also view reading off hexadecimal digits as a special case of mnemonic codes where expansion maps each character into its standalone spoken form. Both examples expand each character separately, but that is not necessary. We can imagine mnemonic codes in which each pair of digits is expanded together, or entire strings are assigned expansions all at once.

The rest of this document describes and discusses a mnemonic code based on arithmetic coding. {It would be more appropriate to start by giving a formal definition of mnemonic coding and defining a utility function to pin down what a good mnemonic code is, but we are not quite that formal yet.}

A mnemonic code based on arithmetic coding

Arithmetic coding is a pair of algorithms that encodes text streams into bit streams and decodes bit streams into text streams. Both encoding and decoding:

In the ideal situation, where the language model provided perfectly matches the actual distribution of incoming text, the encoder produces a uniformly distributed stream of bits with the same asymptotic entropy as the input text, thereby making optimal use of the output bandwidth. Conversely, when the decoder is fed a uniformly distributed stream of bits, it produces an output text stream with the same asymptotic entropy as the input bits, and distributed according to the language model. The decoder can thus be thought of as an optimal sampling algorithm that uses the fewest random bits to produce the most text distributed according to any given model. Our basic idea is to use this decoder as the expansion algorithm in a mnemonic code. For more details on arithmetic coding, see Chapter 5 of the book Text Compression by Timothy Bell, John Cleary, and Ian Witten.

Assume that Alice and Bob both speak some natural language L. Denote by W the set of words in L; we then have the set of (finite) strings W^* and the set of (infinite) streams W^\omega. Denote the n-th word in a stream or string y by y_n. We can think of L itself as an infinite stream of random text, and each string in W^* as the prefix of an infinite number of possible streams in W^\omega. Hence, we let Y be a random stream (a random variable ranging over streams y \in W^\omega). Each word Y_n in Y is a random variable ranging over W.

Suppose that we have a probabilistic model of L. Our model may be a simple bigram or n-gram model or a more sophisticated one involving longer-distance correlations. In any case, we can practically assume that our model can be expressed as a family of efficiently computable conditional probability distributions

P(Y_1),
P(Y_2 | Y_1),
P(Y_3 | Y_1, Y_2),
P(Y_4 | Y_1, Y_2, Y_3),
....

These distributions specify the likelihood of each word given its preceding context. A bigram model, for instance, would specify an initial probability vector

P(Y_1 = w) = A_w

and a transition probability matrix

P(Y_{i+1} = w' | Y_i = w, Y_1, Y_2, ..., Y_{i-1}) = B_{w'|w}.

{Formally, we define a (probabilistic) model of L to be a probability distribution over W^\omega that assigns a probability to each prefix yadda yadda.}

Let P be any given language model. Our goal is to expand a string of m bits

x = x_1 ... x_m

into a string in W^*. In practical applications such as key fingerprinting, the length m is known (e.g., it is the output length of our cryptographic hash function) and x is a uniform random sample from W^m (e.g., we assume our hash function is good).

Conceptually, our mnemonic coding scheme simply feeds x to the decoder, instructing it to use P as the language model. The output from the decoder is then our expansion of x.

One complication arises because the decoder takes infinite bit streams as input, not finite bit strings. Beyond the m-th bit, we pad the input bit string with an infinite stream of zeroes. Any other padding would work just as well, as long as the same padding stream is appended to all length-m strings for each m.

Another complication arises because the decoder produces infinite text streams as output, not finite text strings. Given a length-m bit string x as input, we want to produce just enough output to ensure that we expand x differently from any other length-m bit string x' \ne x. Because of the way arithmetic coding works, we need only check that x is expanded differently from x+1 and x-1, where we treat each length-m bit string as a m-bit binary number. Therefore, to produce just enough output, we run three decoders in parallel on x, x-1, and x+1, and stop as soon as the three output text streams diverge.

Implementation

We have implemented the mnemonic code described above in C++ code that is reasonably clean and modular but mostly templated and uncommented. In particular, while we have only implemented a simple n-gram language model, it is relatively easy to swap in alternative language models or to implement alternative data structures for storing n-gram frequency counts. It is also easy to reuse the code to perform tasks other than expansion.

Our implementation can be tested over the Web at http://www.digitas.harvard.edu/cgi-bin/ken/coding

The Web interface is a simple CGI script written in Perl that invokes compiled C++ code in turn. It allows the user to input bit strings of any length in hexadecimal, octal or binary format. The user specifies the (n-gram) language model to use by selecting:

A summary of our C++ source files:

The compactness-naturalness tradeoff

So far, we have focused on arithmetic coding with an n-gram language model. The most straightforward method to create this n-gram language model is to estimate its probabilities from a corpus by maximum likelihood. This is the only method we have implemented. We did not implement any smoothing in our language model, except we back off to a lower-order context when the current n-gram context does not occur anywhere within the corpus.

Typically, the perplexity of an n-gram language model decreases as n increases. This is because additional context at the end of a text stream helps distinguish between its possible continuations. As a result, smaller n results in shorter and less natural expansions, while larger n results in longer and more natural expansions. We can thus strike different points of balance on the compactness-naturalness spectrum of expansion functions by adjusting n.

Playing with our implementation, we found that the subjectively "best" setting of n is somewhere between 2 and 3. Unfortunately, while we would like to make the compactness-naturalness tradeoff continuously, the parameter n is required to be an integer. An obvious idea that we have yet to try is to approximate non-integral values of n by replacing the "hard" backoff strategy we currently use with a "soft" backoff strategy, essentially using a weighted average across multiple context lengths. Chapter 6 of Text Compression describes several smoothing/backoff techniques, all of which can be plugged into our language model.

Besides context backoff, several other smoothing techniques are commonly applied to language modeling. These include smoothing across words using part-of-speech and other tagging, clustering, and general maximum entropy approaches. The effects of these techniques on mnemonic coding remain to be investigated. In any case, as a general statement, these techniques are of interest to us because they may:

On the second point above, consider the trigram model built from Alice in Wonderland, ignoring punctuation and case. The bigram "of her" occurs in the book 17 times, followed by the words:

Subjectively, it seems to us that:

Mnemonic coding is intended to make bit strings easier to transmit (and, in potential applications other than key fingerprinting, to memorize). In general, maximum likelihood models tend to overfit data. In this example, overfitting results in a language model generating text that would be found significantly easier to transmit or remember than text generated from a smoothed model only by a devoted fan of Alice in Wonderland who has memorized the corpus. More generally, we expect a smoothed model to have higher perplexity yet comparable naturalness. Hence, by smoothing a parameterized family of models on the compactness-naturalness spectrum, we expect to produce more natural text with comparable length.

Another idea towards increasing perplexity is to use corpus with higher perplexity, such as Alice in Wonderland or Gilbert and Sullivan lyrics {the latter suggested to me by Dylan Thurston} rather than Sherlock Holmes. Using corpus with high perplexity may be advantageous not just because the resulting language model may have high perplexity as well, but also because the resulting expansions may acquire a poetic or fantastic flavor that fits better with text generated from n-gram models in general.

A more principled approach to all these attempts at improving our expansion function / language model would be to start by defining (quantifying) what it means for an expansion function / language model to be "better". Two criteria introduced above are compactness and naturalness.

We can quantified compactness as the entropy of the language model. Here, by "entropy" we mean the information rate of text streams generated by the model, i.e., E_P(-\log P), not the description length required to encode the model itself. As for naturalness: Let P_0 denote Bob and Alice's model of L; in other words, the probability distribution P_0 measures how much Bob and Alice expect to encounter (and are optimized to process) each given text stream in L. We can quantify naturalness as the negative cross-entropy E_P(\log P_0).

The general optimization problem we want to solve (exactly or approximately) is: Suppose that we have a language model P_0, in which we did our best estimating Bob and Alice's model of L, possibly using smoothing techniques discussed above. Our mnemonic coding scheme is to perform arithmetic decoding using some language model P, which may or may not be equal to P_0, because we want to choose P to maximize the objective function

Z ("log posterior plus normalizing constant")
= E_P(\log P_0 - \lambda \log P),

where \lambda is a non-negative balance parameter set by the user, continuously. {I haven't quite worked out how the fact that streams are infinite affects the formal math here.}

{Digression: Note that we *really* want to optimized based on the expected number of output words per input bit, not the expected number of input bits per output word as formulated above. I don't think it matters, but I have yet to prove it.}

To maximize Z, we take the partial derivative of Z with respect to each element of the transition probability matrix of P. The outgoing probabilities in each context are constrained to add up to 1. Hence (by the Lagrangian multipler), the partial derivatives of Z with respect to the outgoing probabilities in each context must be equal.

To a zeroth approximation, we can assume that only the steady state of the Markov process matters, and ignore (i.e., approximate by zero) the partial derivatives of the steady state probability vector with respect to the transition probability matrix. The result is that we can simply take the conditional probability distribution at each context in P_0 to a fractional power between 0 and 1, and renormalize, to get a new set of conditional probability distributions for P.

To a first approximation, it may well work out if we run a sort of EM algorithm to optimize P iteratively.

To be more exact than that, it would be nice if we could take the partial derivatives of the stable state distribution of a Markov model with respect to the transition probabilities of said model. Unfortunately, I was told that this is hard. I also faintly recall being told previously that even computing the exact entropy of a given Markov model is hard.

{Related work: Some guy at IBM, who I can look up and whose paper I have with me, trained speech HMMs using EM in conjunction with a perplexity-minimizing prior. Among other things, he has a program for determining the words most likely to be confused with any given English word.}

Punctuation

It is easier to read punctuated (and capitalized) text than unpunctuated text. Therefore, we would like expansion to produce punctuated text.

If we were to generate punctuated text by performing decoding with a language model that directly incorporates punctuation, then some bit strings would be expanded into text strings that differ only in punctuation. In other words, the effective bandwidth of the mnemonic coding channel is reduced. Alice, listening to Bob's voice over the phone, may not be able to distinguish between the two text strings. This means that:

Therefore, we should first generate unpunctuated text from the input bit string, then punctuate the expansion for the sole purpose of helping Bob pronounce it. We have yet to implement this postprocessing step of restoring punctuation. To this end, there is relevant existing work in restoring punctuation as well as accents to text.

An interesting side question is: How well can a machine restore punctuation, anyway? An ideal punctuation restoration algorithm would make use of all information available in unpunctuated input text; however, there may be genuine ambiguities that even a perfect system would not be able to resolve. A practical upper bound on the quality of automatic punctuation restoration is the quality of manual punctuation restoration. To determine the latter, we stripped punctuation marks from some Wall Street Journal text and implemented a Web-based system where we challenge our friends to restore them. To gain access to this challenge and try it out yourself, email ken@digitas.harvard.edu. {The WSJ text tends to be literary and dense; we should try something like the Brown corpus as well to see how it compares.}

Security

But wait-- Punctuation is not the only thing that Alice may have trouble distinguishing over the phone! In English, "to", "two" and "too" sound the same, as do "cat's cares" and "cat scares". In principle, the expansion process should:

Each of the two steps should make use of a separate language model.

{Some more thoughts on the security of comparing key fingerprints using mnemonic coding:}

Obviously, for every way we come up with to encode data, there's always a hypothetical English accent/dialect where two different pieces of data encodes to the same string using said way. Perhaps we can randomize, i.e., find a large set S whose elements are encoding schemes, such that

for every English accent/dialect, and for every bit string B, if E is an encoding scheme randomly chosen from S, then -- with high probability -- E encodes B differently from E encodes any other bit string

(I think this is an appropriate definition for "security against malicious adversary"), even though it is not true that

if E is an encoding scheme randomly chosen from S, then -- with high probability -- for every English accent/dialect, E encodes different bit strings differently

and it may also be not true that

for every English accent/dialect, if E is an encoding scheme randomly chosen from S, then -- with high probability -- E encodes different bit strings differently

I'm not sure how to rigorously define "English accent/dialect" though. Obviously, "any mapping from text to speech" is not restrictive enough, because then I can just map everything to silence.

Recent work

Recently I integrated the CMU-Cambridge Statistical Language Modeling Toolkit (SLM) into my C++ implementation. Said toolkit takes care of smoothing n-gram models in various standard ways, performing probability table lookups quickly, etc. I added command-line options to my program to switch between using my language modeling code and the Toolkit's. The arithmetic (de)coding part of my program works fine with either.

I also implemented a simple "knob" that takes all conditional probability values to a given power (then renormalizes). This knob doesn't seem to work very well -- it doesn't shorten the text by very much even when the exponent is zero.

I used default Good-Turing smoothing options in SLM to train a bunch of larger models from the WSJ text in the North American News Text Corpus. Short summary of results: Larger corpus => Better language model => Quality output.

Big questions:

--KenShan


ProperTreatment | RecentChanges | Preferences
Edit this page | View other revisions | Last edited 2006-02-02 (changes)