Distance between documents

A ‘distance’ between documents. Granted, an odd concept, however one that works reasonably well. After you have converted your document into a bag of words, assigning a numerical value to each token, you can then calculate the distance between two of these documents.

Imagine a 3D vector space, which is where a single document resides. This vector space also contains all your other documents, at a particular distance from one another. But this would be the case if your document only contained 3 tokens. If it contains 20 tokens, then you would have to consider a 20D vector space. So, a vector space is usually referred to as n-dimensional. Furthermore, you can only calculate the distance between the same token in two different documents.

The most common is Euclidean distance, and is the one that will be discussed here. Mahalanobis distance is another measure, which has the advantage of being unaffected by scale (scale invariant), which is not an issue here as tf*idf values are normalised, but is also a bit more computationally intensive.

function euclidean(array $objectX, array $objectY) {
	// Assuming we're dealing with two associative arrays.
	// $objectX[token] = tfidf
	$missingValue = 0.0001;
	$dist = 0;
	$tokens = array_keys(array_merge($objectX, $objectY));
	foreach($tokens as $token) {
		if(!isset($objectY[$token])) { $objectY[$token] = $missingValue; }
		if(!isset($objectX[$token])) { $objectY[$token] = $missingValue; }
		$dist += pow(($objectX[$token] - $objectY[$token]), 2);
	return $dist;

The most important point to make is the thinking behind the $missingValue. If it’s missing in one document, but not in the other, then we assign a small missing value to the document it’s missing in (e.g. the token has little bearing upon the meaning of that document). If it’s missing in both, the distance is 0, so we have found a perfect match (which is appropriate). You can either use pow(,2) or abs(), to make the value an unsigned scalar.

Consequently, the distance between two documents is ‘the sum of their parts’.

About these ads

~ by coffeemist on January 26, 2010.

2 Responses to “Distance between documents”

  1. [quote] If it’s missing in both, the distance is 0, so we have found a perfect match. [/quote]

    I presume you mean, if it’s present in both, the distance is 0?

  2. Interestingly, and I suppose slightly counter intuitively, it is actually the case that if it’s missing in both, for that one token, then it is a perfect match (as the token is not there in either document). If for example I had one document that is talking about carpets, and another document that didn’t talk about carpets, then the two documents could be considered different (for carpets at least). If neither document talks about carpets then they are the same, for that token.

    I should also say that you are correct in saying that if the token is present in both then that is also a perfect match. Remember that the meaning of a document is also indicated by the words that aren’t there as well as those that are!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

%d bloggers like this: