n-gram sentence similarity with cosine similarity measurement

I have been working on a project about sentence similarity. I know it has been asked many times in SO, but I just want to know if my problem can be accomplished by the method I use by the way that I am doing it, or I should change my approach to the problem. Roughly speaking, the system is supposed to split all sentences of an article and find similar sentences among other articles that are fed to the system.

I am using cosine similarity with tf-idf weights and that is how I did it.

1- First, I split all the articles into sentences, then I generate trigrams for each sentence and sort them(should I?).

2- I compute the tf-idf weights of trigrams and create vectors for all sentences.

3- I calculate the dot product and magnitude of original sentence and of the sentence to be compared. Then calculate the cosine similarity.

However, the system does not work as I expected. Here, I have some questions in my mind.

As far as I have read about tf-idf weights, I guess they are more useful for finding similar "documents". Since I am working on sentences, I modified the algorithm a little by changing some variables of the formula of tf and idf definitions(instead of document I tried to come up with sentence based definition).

tf = number of occurrences of trigram in sentence / number of all trigrams in sentence

idf = number of all sentences in all articles / number of sentences where trigram appears

Do you think it is ok to use such a definition for this problem?

Another one is that I saw the normalization is mentioned many times when calculating the cosine similarity. I am guessing that this is important because the trigrams vectors might not be the same size(which they rarely are in my case). If a trigram vector is size of x and the other is x+1, then I treat the first vector as it was the size of x+1 with the last value is being 0. Is this what it is meant by normalization? If not, how do I do the normalization?

Besides these, if I have chosen the wrong algorithm what else can be used for such problem(preferably with n-gram approach)?

Thank you in advance.

5
задан Ahmet Keskin 27 October 2010 в 21:01
поделиться