The Wayback Machine - https://web.archive.org/web/20120120140321/http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html

Friday, March 21, 2008

Surprise and Coincidence

Some years ago, I wrote a simple paper, Accurate Methods for the Statistics of Surprise and Coincidence that has since seen quite a history of citation and use by others. The basic method was applied to get highly accurate language identification, species identification, author identification, document classification, musical recommendations, insurance risk assessment and video recommendations. It was eventually the core of my doctoral dissertation (ask me for a copy) and has been cited hundreds of times.

What hasn't been well described in the past is just how simple the method really is. In most of the applications above, it is important to find useful features in large amounts of data. The method at the heart of all of these algorithms is to use a score to analyze counts of events, particularly counts of when events occur together. The counts that you usually have in these situations are the number of times two events have occurred together, the number of times that they have occurred with or without each other and the number of times anything has occurred.

To compute the score, it is handy to restate these counts slightly as the number of times the events occurred together (let's call this k_11), the number of times each has occurred without the other (let's call these k_12 and k_21) and the number of times something has been observed that was neither of these events (let's call this k_22). In this notation, I am using row or column 1 to be the event of interest and row or column 2 to be everything else.

Here is the table we need



Event A Everything but A
Event B A and B together (k_11) B, but not A (k_12)
Everything but B A without B (k_21) Neither A nor B (k_22)

Once you have these counts, the hard part is over. Computing the log-likelihood ratio score (also known as G2) is very simple,

LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))

where H is Shannon's entropy, computed as the sum of (k_ij / sum(k)) log (k_ij / sum(k)). In R, this function is defined as

H = function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0)))}

The trick with adding k==0 handles the case where some element of k is zero. Since we are multiplying by that same zero, we can drop those terms but it helps not to try to compute the log(0). The java or C versions of this (ask me for a copy) is almost as simple, but not quite as pretty.

So there you have it. Two lines of code and the world is your oyster.








8 comments:

Dustin Boswell said...

Hi Ted, I'm using your likelihood test to find collocations in text. I see how to apply it to find bigrams, but what about trigrams, etc..?

Ted Dunning ... apparently Bayesian said...

There are several approaches. Probably the simplest is a multi-pass approach where you first find good bigrams, then re-interpret those as single tokens and find interesting bigrams where one component is itself composite. You can obviously repeat that any number of times.

The other approach is to build a log-likelihood ratio test based on an order-2 Markov model versus an order-1 Markov model. The Markov models in question predict a word based on the preceding two words or the preceding one word of context. In my dissertation, I do a complete derivation of this, but I now think that it would be easier to use the mutual information trick to see how much information gain there is for using the larger context.

I can send you a copy of my dissertation if you like. My gmail address is ted.dunning.

wataru kasai said...

Hi,
I tried your score in Mahout and it works practically very good for many recommendation tasks.
Yet, here I would like to discuss an irregular case potentially occurs.
In the case k1=9, k2=3000, n1=3000 n2=1000000(p1=p2=p=0.003), twoLogLambda converges to zero.
And around that, the score increases with decrease of k1(cross viewed count).
This behavior seems not so good. Are there any solution exist about such a case?

Ted Dunning ... apparently Bayesian said...

Wataru

In that case the score should be zero since the frequencies are identical. Any deviations will cause a positive score.

If you want the score to go above or below zero according to whether the value of k1 is above or below the expected value relative to k2 then you can use the variant of the score that I call the signed root log likelihood ratio. The basic idea is that you take the square root and apply a sign according to whether k1 is above or below the expected value. Mahout has an implementation of this.

wataru kasai said...

Thank you for your reply!
I would like to study about the theory of your signed-root LLR, cause I can't catch its meaning at a glance. But it seems work well & I try it.

Ted Dunning ... apparently Bayesian said...

The theory is actually fairly simple. A random variable distributed as $\chi^2$ will have a square root that is normally distributed except for sign. The restoration is the sign here is symmetrical only in the asymptotic case but for practical applications that doesn't matter.

Mat Kelcey said...

Hi Ted,

Can you elaborate a bit on the trigram comment? I don't quite get it..

For bigrams I've used in the past

mutual_info(t1,t2) = (p(t1)*p(t2)) / p(t1,t2)

Are you saying for trigrams it's

mutual_info(t1,t2,t3) = (p(t1,t2)*p(t2,t3)) / p(t1,t2,t3)

I think I don't quite get the "one component is itself composite." comment.

Cheers!
Mat

Ted Dunning ... apparently Bayesian said...

Mat,

Your definition of mutual information is a bit off and is very subject to errors with small counts. I would worry a bit. If you use the correct definition then I think that multiplying by the number of samples is a good thing since that gives you a measure of anomaly rather than association. For a threshold, anomaly is really what you are after ... other mechanisms are usually used for weighting anyway.

Regarding trigrams, one of the easiest implementations is to simply make a pass looking for significant bigrams. Then, considering those bigrams as single words, make another pass looking for bigrams that include bigrams from the first pass. These composite bigrams are really trigrams. You can repeat this process as you like.

Mathematically, this is a bit off, but the implementation easy really outweighs the values of orthodoxy in this case.