TF*IDF Weighting

What is this used for?

You have a document such as an email, but you need to turn each word, or token, into a numeric value for that document for it to be meaningfully classified or processed. This is typically known as a ‘bag of words’ approach, where the meaning of a document is purely based upon the words (in no particular order) which are contained within the document. To improve the accuracy of the method, but only in cases where you have lots of data in your corpus (collection of documents) you can alter this to be a bigram or trigram model, where you combine tokens into two or three words long. This is great, because it takes each word based upon context, but on the other hand if there isn’t much data available there won’t be enough bigrams or trigrams to match against new data. Another approach is to use Porter’s stemming algorithm, but meanings of individual tokens can be lost during the stemming process so this isn’t approach always favoured.

The Wikipedia article also explains this principle quite well. Though, evidently, you need to get the tokens in the first place from the document, and I would suggest using some PHP like this, which introduces a space in front of exclamation marks, question marks, fullstops, commas, asterisks and speech marks, but leaves words like ‘don’t’ as one token, without splitting it.

$message = preg_replace("/([!?.,\*\"]{1})/", " \\1 ", $message);
$tokens = preg_split("/\s+/", $message);

Term Frequency (or TF)

There are a few alternatives to TF*IDF weighting, though I’ve used it frequently in the past and it usually does a good job. To start off with, what does ‘term frequency’ (TF) refer to? For each token in the tokens array, we calculate the number of times this token appears in the current email. To normalise this against the email’s length (if an email is longer than another email it will have more tokens in it, and consequently there’s a higher probability of the same token occurring more often) we divide this by the number of tokens in the current email. So that’s term frequency or TF.

Inverse Document Frequency (or IDF)

Now, inverse document frequency. This sounds a bit scarier, and is a little bit more work to calculate but is also reasonably simple, and this measures the relative importance (for disambiguation from other documents) of the token. So, for each token, we need to get the number of documents that the token appears in. This divides the total size of the collection (how many documents we have). For a single token therefore, say ‘hello’, which appears in an email 3 times. The size of the email is 100 tokens long, so we take 3 / 100 = 0.03 which is the term frequency. Our collection of documents or corpus is 300 documents large, and ‘hello’ appears in 120 of these. So, we have ln(300/120) = 0.916. The log is just used as a scaling factor, and to stop a certain amount of saturation toward 0.

So multiplying tf * idf = 0.02, which gives the value for the individual token as a disambiguator which isn’t very high (the token occurs in a lot of other documents).


~ by coffeemist on January 26, 2010.

2 Responses to “TF*IDF Weighting”

  1. […] however one that works reasonably well. After you have converted your document into a bag of words, assigning a numerical value to each token, you can then calculate the distance between two of these […]

  2. thanks for that, good insight into the topic to get me started ..

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: