Как я могу оптимизировать этот код Python для генерации всех слов с расстоянием слова 1?

Question

Как я могу оптимизировать этот код Python для генерации всех слов с расстоянием слова 1?

Я столкнулся с подобной проблемой и исправил ее, настроив значение

redirect_url в ' https://test.instamojo.com/integrations/android/redirect/ '

Поскольку sdk инициирует обратный вызов (), как только достигнет своего URL перенаправления.

32

levenshtein-distance python optimization python-2.x word-diff

задан Community 23 May 2017 в 10:27

11 ответов

Первое, что мне пришло в голову:

from operator import ne

def distance(word1, word2):
    return sum(map(ne, word1, word2))

, у которого есть приличный шанс продвинуться быстрее, чем другие опубликованные функции, потому что у него нет интерпретируемых циклов, просто вызовы примитивов Python , И он достаточно короткий, чтобы вы могли разумно вставить его в вызывающую функцию.

Для вашей задачи более высокого уровня я бы рассмотрел структуры данных, разработанные для поиска сходства в метрических пространствах, например, , в этой статье или эта книга , ни одну из которых я не читал (они пришли в поисках статьи, которую я прочитал, но не могу вспомнить).

0

ответ дан 27 November 2019 в 20:51

Я не знаю, сильно ли это повлияет на вашу скорость, но вы могли бы начать с превращения понимания списка в выражение генератора. Он по-прежнему повторяем, поэтому его использование не должно сильно отличаться:

def getchildren(word, wordlist):
    return [ w for w in wordlist if distance(word, w) == 1 ]

-

def getchildren(word, wordlist):
    return ( w for w in wordlist if distance(word, w) == 1 )

. Основная проблема заключается в том, что понимание списка будет создаваться в памяти и занимать довольно много места, тогда как генератор создаст ваш список на лету, так что нет необходимости хранить все это.

Кроме того, следуя неизвестному ответу, это может быть более "питонический" способ записи расстояния ():

def distance(word1, word2):
    difference = 0
    for x,y in zip (word1, word2):
        if x == y:
            difference += 1
    return difference

Но это сбивает с толку то, что предназначено, когда len (word1)! = len (word2), в случае zip он вернет столько символов, сколько самого короткого слова. (Что может оказаться оптимизацией ...)

0

ответ дан 27 November 2019 в 20:51

Вы также можете сделать это в одной строке для кода, подобного этому.

$('#answer' + $.url.attr('anchor').match(/question(\d+)|(.*)/)[1]).show();

Теперь этот метод может потребоваться некоторое объяснение. По сути, здесь вы пытаетесь найти строку, содержащую вопрос (n), где где 'question (n)' - первое совпадение, а (n) - второе совпадение. Теперь, если это не найдено, то " | (. *) " говорит "

1

ответ дан 27 November 2019 в 20:51

Попробуйте это:

def distance(word1, word2):
  return sum([not c1 == c2 for c1, c2 in zip(word1,word2)])

Кроме того, у вас есть ссылка на вашу игру? Мне нравится быть уничтоженным играми в слова

0

ответ дан 27 November 2019 в 20:51

Как часто функция расстояния вызывается с одинаковыми аргументами? Простой для реализации оптимизации будет использовать памятка .

Возможно, вы могли бы также создать какой-то словарь с заморозками букв и списками слов, которые отличаются на единицу, и искать значения в этом. Эта структура данных может быть либо сохранена и загружена с помощью маринада, либо сгенерирована с нуля при запуске.

Короткое замыкание оценки даст вам выигрыш только в том случае, если используемые вами слова очень длинные, поскольку используемый вами алгоритм расстояния Хэмминга в основном O (n) где n - длина слова.

Я провел несколько экспериментов со временем для некоторых альтернативных подходов, которые могут быть иллюстративными.

Результаты Timeit

Ваше решение

d = """\
def distance(word1, word2):
    difference = 0
    for i in range(len(word1)):
        if word1[i] != word2[i]:
            difference += 1
    return difference
"""
t1 = timeit.Timer('distance("hello", "belko")', d)
print t1.timeit() # prints 6.502113536776391

One Liner

d = """\
from itertools import izip
def hamdist(s1, s2):
    return sum(ch1 != ch2 for ch1, ch2 in izip(s1,s2))
"""
t2 = timeit.Timer('hamdist("hello", "belko")', d)
print t2.timeit() # prints 10.985101179

Оценка ярлыка

d = """\
def distance_is_one(word1, word2):
    diff = 0
    for i in xrange(len(word1)):
        if word1[i] != word2[i]:
            diff += 1
        if diff > 1:
            return False
    return diff == 1
"""
t3 = timeit.Timer('hamdist("hello", "belko")', d)
print t2.timeit() # prints 6.63337

3

ответ дан 27 November 2019 в 20:51

Люди в основном пытаются это сделать, пытаясь написать более быструю функцию, но может быть и другой способ.

«расстояние» вызывается более 5 миллионов раз

Почему это ? Возможно, лучшим способом оптимизации является попытка уменьшить количество вызовов до расстояния , а не сбрасывать миллисекунды с времени выполнения расстояния . Невозможно сказать, не видя полного сценария, но оптимизация конкретной функции обычно не требуется.

Если это невозможно, возможно, вы могли бы написать его как модуль C?

4

ответ дан 27 November 2019 в 20:51

from itertools import izip

def is_neighbors(word1,word2):
    different = False
    for c1,c2 in izip(word1,word2):
        if c1 != c2:
            if different:
                return False
            different = True
    return different

Или, может быть, встроенный код izip :

def is_neighbors(word1,word2):
    different = False
    next1 = iter(word1).next
    next2 = iter(word2).next
    try:
        while 1:
            if next1() != next2():
                if different:
                    return False
                different = True
    except StopIteration:
        pass
    return different

И переписанный getchildren :

def iterchildren(word, wordlist):
    return ( w for w in wordlist if is_neighbors(word, w) )

izip (a, b) возвращает итератор для пар значений из a и b .
zip (a, b) возвращает список пар из a и b .

6

ответ дан 27 November 2019 в 20:51

Ваша функция расстояние вычисляет общее расстояние, когда вы действительно заботитесь только о расстоянии = 1. В большинстве случаев вы будете знать, что> 1 в нескольких символах, так что вы можете вернуться раньше и сэкономить много времени.

Помимо этого, может быть лучший алгоритм, но я не могу об этом думать.

Редактировать: Другая идея.

Вы можете сделать 2 случая, в зависимости от того, соответствует ли первый символ. Если оно не совпадает, остальное слово должно точно соответствовать, и вы можете проверить это за один выстрел. В противном случае, сделайте это аналогично тому, что вы делали. Вы могли бы даже сделать это рекурсивно, но я не думаю, что это будет быстрее.

def DifferentByOne(word1, word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    same = True
    for i in range(1, len(word1)):
        if word1[i] != word2[i]:
            if same:
                same = False
            else:
                return False
    return not same

Редактировать 2: Я удалил проверку, чтобы увидеть, имеют ли строки одинаковую длину, так как вы говорите, что это избыточно. Бег Райан В тестах на моем собственном коде и на функции is_neighbors , предоставленной MizardX , я получаю следующее:

Исходное расстояние (): 3,7 секунды
My DifferentByOne (): 1,1 секунды
is_neighbors (): MizardX is_neighbors (): 3,7 секунды.

Редактировать 3: (Возможно, сюда можно попасть на вики-страницу сообщества, но ...)

Попробовать окончательное определение is_neighbors () с помощью izip вместо zip: 2.9 секунд.

Вот моя последняя версия, которая все еще раз в 1.1 секунды:

def DifferentByOne(word1, word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    different = False
    for i in range(1, len(word1)):
        if word1[i] != word2[i]:
            if different:
                return False
            different = True
    return different

10

ответ дан 27 November 2019 в 20:51

Well you can start by having your loop break if the difference is 2 or more.

Also you can change

for i in range(len(word1)):

to

for i in xrange(len(word1)):

Because xrange generates sequences on demand instead of generating the whole range of numbers at once.

You can also try comparing word lengths which would be quicker. Also note that your code doesn't work if word1 is greater than word2

There's not much else you can do algorithmically after that, which is to say you'll probably find more of a speedup by porting that section to C.

Edit 2

Attempting to explain my analysis of Sumudu's algorithm compared to verifying differences char by char.

When you have a word of length L, the number of "differs-by-one" words you will generate will be 25L. We know from implementations of sets on modern computers, that the search speed is approximately log(n) base 2, where n is the number of elements to search for.

Seeing that most of the 5 million words you test against is not in the set, most of the time, you will be traversing the entire set, which means that it really becomes log(25L) instead of only log(25L)/2. (and this is assuming best case scenario for sets that comparing string by string is equivalent to comparing char by char)

Now we take a look at the time complexity for determining a "differs-by-one". If we assume that you have to check the entire word, then the number of operations per word becomes L. We know that most words differ by 2 very quickly. And knowing that most prefixes take up a small portion of the word, we can logically assume that you will break most of the time by L/2, or half the word (and this is a conservative estimate).

So now we plot the time complexities of the two searches, L/2 and log(25L), and keeping in mind that this is even considering string matching the same speed as char matching (highly in favor of sets). You have the equation log(25*L) > L/2, which can be simplified down to log(25) > L/2 - log(L). As you can see from the graph, it should be quicker to use the char matching algorithm until you reach very large numbers of L.

alt text

Also, I don't know if you're counting breaking on difference of 2 or more in your optimization, but from Mark's answer I already break on a difference of 2 or more, and actually, if the difference in the first letter, it breaks after the first letter, and even in spite of all those optimizations, changing to using sets just blew them out of the water. I'm interested in trying your idea though

I was the first person in this question to suggest breaking on a difference of 2 or more. The thing is, that Mark's idea of string slicing (if word1[0] != word2[0]: return word1[1:] == word2[1:]) is simply putting what we are doing into C. How do you think word1[1:] == word2[1:] is calculated? The same way that we are doing.

I read your explanation a few times but I didn't quite follow it, would you mind explaining it a little more indepth? Also I'm not terribly familiar with C and I've been working in high-level languages for the past few years (closest has been learning C++ in high school 6 years ago

As for producing the C code, I am a bit busy. I am sure you will be able to do it since you have written in C before. You could also try C#, which probably has similar performance characteristics.

More Explanation

Here is a more indepth explanation to Davy8

def getchildren(word, wordlist):
    oneoff = one_letter_off_strings(word)
    return set(oneoff) & set(wordlist)

Your one_letter_off_strings function will create a set of 25L strings(where L is the number of letters).

Creating a set from the wordlist will create a set of D strings (where D is the length of your dictionary). By creating an intersection from this, you MUST iterate over each oneoff and see if it exists in wordlist.

The time complexity for this operation is detailed above. This operation is less efficient than comparing the word you want with each word in wordlist. Sumudu's method is an optimization in C rather than in algorithm.

More Explanation 2

There's only 4500 total words (because the wordlist is pre-filtered for 5 letter words before even being passed to the algorithm), being intersected with 125 one-letter-off words. It seemed that you were saying intersection is log(smaller) or in otherwords log(125, 2). Compare this to again assuming what you said, where comparing a word breaks in L/2 letters, I'll round this down to 2, even though for a 5 letter word it's more likely to be 3. This comparison is done 4500 times, so 9000. log(125,2) is about 6.9, and log(4500,2) is about 12. Lemme know if I misinterpreted your numbers.

To create the intersection of 125 one-letter-off words with a dictionary of 4500, you need to make 125 * 4500 comparisons. This is not log(125,2). It is at best 125 * log(4500, 2) assuming that the dictionary is presorted. There is no magic shortcut to sets. You are also doing a string by string instead of char by char comparison here.

2

ответ дан 27 November 2019 в 20:51

для этого фрагмента:

for x,y in zip (word1, word2):
    if x != y:
        difference += 1
return difference

я бы использовал этот:

return sum(1 for i in xrange(len(word1)) if word1[i] == word2[i])

тот же шаблон будет следовать по всему предоставленному коду ...

0

ответ дан 27 November 2019 в 20:51

Другие вопросы по тегам:

levenshtein-distance python optimization python-2.x word-diff

Как я могу оптимизировать этот код Python для генерации всех слов с расстоянием слова 1?

11 ответов

Результаты Timeit

Ваше решение

One Liner

Оценка ярлыка

Похожие вопросы: