Корпус Токио - Медленнее вставляет после удара 1 миллиона

Другой способ использования itertools.product() , который дает декартово произведение входных итерируемых.

import itertools
df=(pd.DataFrame(list(itertools.product(brand_name.brand_name,object_raw.object_name))
             ,columns=['brand_name','object_name']))
df['category_id']=df['object_name'].map(object_raw.set_index('object_name')['category_id'])
print(df)
  brand_name object_name  category_id
0       Nike     T-shirt           24
1       Nike      Shorts           45
2       Nike       Dress           32
3    Lacoste     T-shirt           24
4    Lacoste      Shorts           45
5    Lacoste       Dress           32
6     Adidas     T-shirt           24
7     Adidas      Shorts           45
8     Adidas       Dress           32

5
задан Bharani 3 March 2009 в 14:50
поделиться

3 ответа

Я просто установил опцию кэша, и это теперь значительно быстрее.

2
ответ дан 14 December 2019 в 09:01
поделиться

Я думаю, что изменение параметра bnum в функции dbtune также даст значительное улучшение скорости.

1
ответ дан 14 December 2019 в 09:01
поделиться

I hit a brick wall around 1 million records per shard as well (sharding on the client side, nothing fancy). I tried various ttserver options and they seemed to make no difference, so I looked at the kernel side and found that

echo 80 > /proc/sys/vm/dirty_ratio

(previous value was 10) gave a big improvement - the following is the total size of the data (on 8 shards, each on its own node) printed every minute:

total:  14238792  records,  27.5881 GB size
total:  14263546  records,  27.6415 GB size
total:  14288997  records,  27.6824 GB size
total:  14309739  records,  27.7144 GB size
total:  14323563  records,  27.7438 GB size
(here I changed the dirty_ratio setting for all shards)
total:  14394007  records,  27.8996 GB size
total:  14486489  records,  28.0758 GB size
total:  14571409  records,  28.2898 GB size
total:  14663636  records,  28.4929 GB size
total:  14802109  records,  28.7366 GB size

So you can see that the improvement was in the order of 7-8 times. Database size was around 4.5GB per node at that point (including indexes) and the nodes have 8GB RAM (so dirty_ratio of 10 meant that the kernel tried to keep less than ca. 800MB dirty).

Next thing I'll try is ext2 (currently: ext3) and noatime and also keeping everything on a ramdisk (that would probably waste twice the amount of memory, but might be worth it).

4
ответ дан 14 December 2019 в 09:01
поделиться
Другие вопросы по тегам:

Похожие вопросы: