Что такое реализация хеш-таблицы/словаря для Python, который не хранит ключи?

Question

Что такое реализация хеш-таблицы/словаря для Python, который не хранит ключи?

Я храню миллионы, возможно миллиарды 4-байтовых значений в хеш-таблице, и я не хочу хранить любой из ключей. Я ожидаю, что только хеши ключей и значений должны будут быть сохранены. Это должно быть быстро и все сохраненные в RAM. Записи все еще искались бы с ключом, в отличие от набора ().

Что такое реализация этого для Python? Существует ли название этого?

Да, коллизии позволяются и могут быть проигнорированы.

(Я могу сделать исключение для коллизий, ключ может быть сохранен для тех. С другой стороны, коллизии могут просто перезаписать ранее хранимую сумму.)

6

python dictionary map hashtable data-structures

задан user213060 16 December 2009 в 23:28

10 ответов

Другие вопросы по тегам:

python dictionary map hashtable data-structures

Bloomier filters - space-efficient associative array

From the Wikipedia:

Chazelle et al. (2004) designed a generalization of Bloom filters that could associate a value with each element that had been inserted, implementing an associative array. Like Bloom filters, these structures achieve a small space overhead by accepting a small probability of false positives. In the case of "Bloomier filters", a false positive is defined as returning a result when the key is not in the map. The map will never return the wrong value for a key that is in the map.

score 2 · Answer 2

Although python dictionaries are very efficient, I think that if you're going to store billions of items, you may want to create your own C extension with data structures, optimized for the way you are actually using it (sequential access? completely random? etc).

In order to create a C extension, you may want to use SWIG, or something like Pyrex (which I've never used).

score 3 · Answer 3

Как насчет того, чтобы использовать обычный словарь и вместо этого:

d[x]=y

использовать:

d[hash(x)]=y

Для поиска:

d[hash(foo)]

Конечно, если есть конфликт хеша, вы можете получить неверное значение обратно.

score 3 · Answer 4

Its the good old space vs runtime tradeoff: You can have constant time with linear space usage for the keys in a hastable. Or you can store the key implicitly and use log n time by using a binary tree. The (binary) hash of a value gives you the path in the tree where it will be stored.

score 2 · Answer 5

Build your own b-tree in RAM.

Memory use:

(4 bytes) comparison hash value

(4 bytes) index of next leaf if hash <= comparison OR if negative index of value

(4 bytes) index of next leaf if hash > comparison OR if negative index of value

12 bytes per b-tree node for the b-tree. More overhead for the values (see below).

How you structure this in Python - aren't there "native arrays" of 32bit integers upported with almost no extra memory overhead...? what are they called... anyway those.

Separate ordered array of subarrays each containing one or more values. The "indexes of value" above are indexes into this big array, allowing retrieval of all values matching the hash.

This assumes a 32bit hash. You will need more bytes per b-tree node if you have greater than 2^31-1 entries or a larger hash.

BUT Spanner in the works perhaps: Note that you will not be able, if you are not storing the key values, to verify that a hash value looked up corresponds only to your key unless through some algorithmic or organisational mechanism you have guaranteed that no two keys will have the same hash. Quite a serious issue here. Have you considered it? :)

score 1 · Answer 6

Hash table has to store keys, unless you provide a hash function that gives absolutely no collisions, which is nearly impossible.

There is, however, if your keys are string-like, there is a very space-efficient data structure - directed acyclic word graph (DAWG). I don't know any Python implementation though.

score 1 · Answer 7

It's not what you asked for buy why not consider Tokyo Cabinet or BerkleyDB for this job? It won't be in memory but you are trading performance for greater storage capacity. You could still keep your list in memory and use the database only to check existence.

score 1 · Answer 8

If you're actually storing millions of unique values, why not use a dictionary?
Store: d[hash(key)/32] |= 2**(hash(key)%32)
Check: (d[hash(key)/32] | 2**(hash(key)%32))

If you have billions of entries, use a numpy array of size (2**32)/32, instead. (Because, after all, you only have 4 billion possible values to store, anyway).

score 0 · Answer 9

Почему не словарь + хэш-библиотека?

>>> import hashlib
>>> hashtable = {}
>>> def myHash(obj):
    return hashlib.sha224(obj).hexdigest()

>>> hashtable[myHash("foo")] = 'bar'
>>> hashtable
{'0808f64e60d58979fcb676c96ec938270dea42445aeefcd3a4e6f8db': 'bar'}

score 1 · Answer 10

Не могли бы вы рассказать нам больше о ключах? Мне интересно, есть ли какие-то закономерности в ключах, которые мы могли бы использовать.

Если ключи представляют собой строки в небольшом алфавите (пример: строки цифр, такие как телефонные номера), вы можете использовать структуру данных trie:

] http://en.wikipedia.org/wiki/Trie

Что такое реализация хеш-таблицы/словаря для Python, который не хранит ключи?

10 ответов

Bloomier filters - space-efficient associative array

Похожие вопросы: