Markov Chains
Basics
A Markov chain is a random process that changes from one state to another state with a probability that is determined entirely by the current state.
A simple example would be if the odds of it raining tomorrow were determined only by the weather today, as in the following table (called the transition matrix):
Today's weather  Chance of rain tomorrow  Chance of sun tomorrow 

rain  80%  20% 
sun  10%  90% 
If we initially set a state of 'rain', then we could use a random number generator to produce a prediction for one weeks' worth of weather:
rain > rain > rain > rain > sun > sun > sun
Of course, we would likely get a different sequence of events if we try to make a second prediction:
rain > rain > rain > sun > sun > rain > rain
This obviously isn't a great system for extended weather forecasts. But if we had a more elaborate matrix, we could perhaps use it for something else.
One possibility is to build a matrix from many observations, and then use that to generate potential observations that are hopefully similar. For example, if we treat the letters of a name as a sequence of states, we can generate a transition matrix from a list of names, and then make up new names.
Using US census data from the past hundred years or so (full of names like "John", "William", and "Mary") to build a transition matrix results in generated names like this:
 Orex
 Mak
 Celis
 Mayes
 Leen
 Parmoses
Close enough for government work. But I left out an important part: while each letter in a name is considered a separate state, to generate such (relatively) good names the transition matrix must track two states at a time. So while a "normal" Markov chain for names might have a transition matrix like this:
current  Odds of 'a'  Odds of 'b'  Odds of 'c'  Odds of 'd'  ... 

'A'  3.01%  1.02%  3.46%  9.22%  ... 
'B'  4.21%  0.00%  0.00%  0.00%  ... 
'C'  2.91%  0.00%  0.00%  0.00%  ... 
...  ...  ...  ...  ...  ... 
The matrix used to generate the names above would look more like this:
current  Odds of 'a'  Odds of 'b'  Odds of 'c'  Odds of 'd'  ... 

'Aa'  0.01%  2.01%  0.46%  2.22%  ... 
'Ab'  4.21%  3.65%  0.02%  0.87%  ... 
'Ac'  1.91%  0.03%  0.73%  0.02%  ... 
...  ...  ...  ...  ...  ... 
This is called a Markov chain of order 2, since it tracks two states at a time.
String::Markov
Since playing with Markov chains is fun, I wrote a CPAN module to make it easy to create finite, textoriented Markov chains in Perl: String::Markov.
The finite part means that it is meant for sequences with a defined beginning and end (it would not work well with the weather example), and the textoriented part means that it's designed to work on Unicode strings (like names). There are some fancy bits in the module (like Unicode normal forms), but it is simple enough to have 100% test coverage without too much effort.
The latest updates have been to make the code faster. The original
implementation used a hash reference to store the transition matrix, and
iterated through the hash with each
. Iterating through hashes is
slow. Simply replacing the hash with two arrays and iterating through one of
them to find the correct index for the other resulted in a big boost: slow test
cases ran over twice as fast, while already fast cases got at least 20% faster.
Testing on my cellphone showed the new version ran up to 168% faster.
Transforming the probabilities into a cumulative distribution suitable for binary search had mixed results, and was always a net loss if the appropriate XS module was unavailable. Since the main win for this approach was in unusual situations (giant state spaces), I decided to leave out the extra complexity (and the extra dependency).
Markov Server
To show off what the module could do, I put together a small PSGI app to generate random names: http://markov.gmathews.com/names. Of course, once I started, I couldn't stop myself from adding more features and sources, eventually ending up with the Markov Server, which supports many different chains.
The multitude of chain types is not without cost, and the singlecore VPS I'm using is not resource rich. To mitigate this problem, I run all the chains "simultaneously" as async blocks with Coro. Each chain writes to a dedicated Coro::Channel, so that data is ready before a request is made. Hopefully, the VPS has enough time between servicing requests to refill the channels with data and keep response times low. This soaks up some RAM, but modern machines (even tiny VMs) have plenty of space. The much bigger downside to this approach is that specific results can't be recalled, since all the data that users see is pregenerated and must wait in the queue.
With a faster VPS, I could skip the queued approach and provide each result set with a "permalink" that stores the state of the PRNG in the query string. Then results could be regenerated by seeding the PRNG appropriately, and a full page of lines could be "saved" in 48 bits or less.
[UPDATE]
The latest version of the Markov Server allows for saving results with a simple link, but still allows pregenerated results from a queue to fulfill most requests. This was accomplished with five changes:

Instead of generating results linebyline, results are generated in full batches. The perchain queues now operate atomically on batches of lines, and cannot send sequential lines to different coroutines.

The String::Markov module now generates stable chains, so that the memory layout of the hash used to store the state vectors does not affect the order in which states are tested against the chosen threshold. Recent Perl versions randomize this hash, which causes the same internal state and the same seed to produce different results on different runs. Fixing this was essential for allowing results to be recalled across server restarts.

A queue of just random seeds is used when generating batches of lines, so that each batch is generated from a known seed.

The seed for a batch is used to generate the permalink. If a request comes in with a specific seed value, then that value is used to generate results instead of pulling results from a queue.

LRU caches were added for results pulled out of a queue and for results specified by seed value. A hit in the former cache moves data to the latter cache, creating something resembling a twolevel cache system.