Looking for a faster way to perform string searches

I have to recognize a large list of URLs (few million lines) as belonging to a particular category or not. I have another list that has sub-strings that if present in the URL belongs to that category. Say, Category A.

The list of sub-strings to check has around 10k such sub-strings. What I did was simply go line by line in the sub-string file and look for the match and if found the URL belongs to Category A. I found in tests that this was rather time consuming.

I'm not a computer science student so don't have much knowledge about optimizing algorithms. But is there a way to make this faster? Just simple ideas. The programming language is not a big issue but Java or Perl would be preferable.

The list of sub-strings to match will not change much. I will however receive different lists of URLs so have to run this each time I get it. The bottleneck seems to be the URLs as they can get very long.

8
задан sfactor 13 April 2011 в 07:50
поделиться