/**
*
* @param sentence should has at least one string
* @param maxGramSize should be 1 at least
* @return set of continuous word n-grams up to maxGramSize from the sentence
*/
public static List<String> generateNgramsUpto(String str, int maxGramSize) {
List<String> sentence = Arrays.asList(str.split("[\\W+]"));
List<String> ngrams = new ArrayList<String>();
int ngramSize = 0;
StringBuilder sb = null;
//sentence becomes ngrams
for (ListIterator<String> it = sentence.listIterator(); it.hasNext();) {
String word = (String) it.next();
//1- add the word itself
sb = new StringBuilder(word);
ngrams.add(word);
ngramSize=1;
it.previous();
//2- insert prevs of the word and add those too
while(it.hasPrevious() && ngramSize<maxGramSize){
sb.insert(0,' ');
sb.insert(0,it.previous());
ngrams.add(sb.toString());
ngramSize++;
}
//go back to initial position
while(ngramSize>0){
ngramSize--;
it.next();
}
}
return ngrams;
}
Вызов:
long startTime = System.currentTimeMillis();
ngrams = ToolSet.generateNgramsUpto("This is my car.", 3);
long stopTime = System.currentTimeMillis();
System.out.println("My time = "+(stopTime-startTime)+" ms with ngramsize = "+ngrams.size());
System.out.println(ngrams.toString());
Выход:
Мое время = 1 мс с ngramsize = 9 [This, is, This is, my, is мой, это мой, автомобиль, моя машина, мой автомобиль]
blockquote>