Calculating the perplexity of a language model for email classification

I have a feature set of 500 of the most frequently occuring uni-grams from a corpus of emails. I have been using this to classify emails using c5.0 based on the occurence/absence of each of the words any in test email.

Now I need to calculate the perplexity of the terms in the feature set and use this to classify emails. I was wondering has anyone any experience in language modelling, and knows how I would go about calculating the perplexity of the model, any help would be great!

I should add that I am aware of tools that can do this for me automatically, SRILM/CMU-LMtoolkit for instance, but I would rather make this myself from the ground up as its part of my final year project! I just need on hint on how to get started... perhaps a link to "The idiots guide to perplexity calculation and classification using perplexity"!!

Thanks a lot!!

11
задан B. Bowles 23 March 2011 в 09:36
поделиться