What hasn't been well described in the past is just how simple the method really is. In most of the applications above, it is important to find useful features in large amounts of data. The method at the heart of all of these algorithms is to use a score to analyze counts of events, particularly counts of when events occur together. The counts that you usually have in these situations are the number of times two events have occurred together, the number of times that they have occurred with or without each other and the number of times anything has occurred.
To compute the score, it is handy to restate these counts slightly as the number of times the events occurred together (let's call this k_11), the number of times each has occurred without the other (let's call these k_12 and k_21) and the number of times something has been observed that was neither of these events (let's call this k_22). In this notation, I am using row or column 1 to be the event of interest and row or column 2 to be everything else.
Here is the table we need
Event A | Everything but A | |
Event B | A and B together (k_11) | B, but not A (k_12) |
Everything but B | A without B (k_21) | Neither A nor B (k_22) |
Once you have these counts, the hard part is over. Computing the log-likelihood ratio score (also known as G2) is very simple,
LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))
where H is Shannon's entropy, computed as the sum of (k_ij / sum(k)) log (k_ij / sum(k)). In R, this function is defined as
H = function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0)))}
The trick with adding k==0 handles the case where some element of k is zero. Since we are multiplying by that same zero, we can drop those terms but it helps not to try to compute the log(0). The java or C versions of this (ask me for a copy) is almost as simple, but not quite as pretty.
So there you have it. Two lines of code and the world is your oyster.
8 comments:
Hi Ted, I'm using your likelihood test to find collocations in text. I see how to apply it to find bigrams, but what about trigrams, etc..?
There are several approaches. Probably the simplest is a multi-pass approach where you first find good bigrams, then re-interpret those as single tokens and find interesting bigrams where one component is itself composite. You can obviously repeat that any number of times.
The other approach is to build a log-likelihood ratio test based on an order-2 Markov model versus an order-1 Markov model. The Markov models in question predict a word based on the preceding two words or the preceding one word of context. In my dissertation, I do a complete derivation of this, but I now think that it would be easier to use the mutual information trick to see how much information gain there is for using the larger context.
I can send you a copy of my dissertation if you like. My gmail address is ted.dunning.
Hi,
I tried your score in Mahout and it works practically very good for many recommendation tasks.
Yet, here I would like to discuss an irregular case potentially occurs.
In the case k1=9, k2=3000, n1=3000 n2=1000000(p1=p2=p=0.003), twoLogLambda converges to zero.
And around that, the score increases with decrease of k1(cross viewed count).
This behavior seems not so good. Are there any solution exist about such a case?
Wataru
In that case the score should be zero since the frequencies are identical. Any deviations will cause a positive score.
If you want the score to go above or below zero according to whether the value of k1 is above or below the expected value relative to k2 then you can use the variant of the score that I call the signed root log likelihood ratio. The basic idea is that you take the square root and apply a sign according to whether k1 is above or below the expected value. Mahout has an implementation of this.
Thank you for your reply!
I would like to study about the theory of your signed-root LLR, cause I can't catch its meaning at a glance. But it seems work well & I try it.
The theory is actually fairly simple. A random variable distributed as $\chi^2$ will have a square root that is normally distributed except for sign. The restoration is the sign here is symmetrical only in the asymptotic case but for practical applications that doesn't matter.
Hi Ted,
Can you elaborate a bit on the trigram comment? I don't quite get it..
For bigrams I've used in the past
mutual_info(t1,t2) = (p(t1)*p(t2)) / p(t1,t2)
Are you saying for trigrams it's
mutual_info(t1,t2,t3) = (p(t1,t2)*p(t2,t3)) / p(t1,t2,t3)
I think I don't quite get the "one component is itself composite." comment.
Cheers!
Mat
Mat,
Your definition of mutual information is a bit off and is very subject to errors with small counts. I would worry a bit. If you use the correct definition then I think that multiplying by the number of samples is a good thing since that gives you a measure of anomaly rather than association. For a threshold, anomaly is really what you are after ... other mechanisms are usually used for weighting anyway.
Regarding trigrams, one of the easiest implementations is to simply make a pass looking for significant bigrams. Then, considering those bigrams as single words, make another pass looking for bigrams that include bigrams from the first pass. These composite bigrams are really trigrams. You can repeat this process as you like.
Mathematically, this is a bit off, but the implementation easy really outweighs the values of orthodoxy in this case.
Post a Comment