Как ускорить заполнение массива numpy в python?

I'm trying to fill a preallocated bytearray using the following code:

# preallocate a block array
dt = numpy.dtype('u8')
in_memory_blocks = numpy.zeros(_AVAIL_IN_MEMORY_BLOCKS, dt)

...

# write all the blocks out, flushing only as desired
blocks_per_flush_xrange = xrange(0, blocks_per_flush)
for _ in xrange(0, num_flushes):
    for block_index in blocks_per_flush_xrange:
        in_memory_blocks[block_index] = random.randint(0, _BLOCK_MAX)

    print('flushing bytes stored in memory...')

    # commented out for SO; exists in actual code
    # removing this doesn't make an order-of-magnitude difference in time
    # m.update(in_memory_blocks[:blocks_per_flush])

    in_memory_blocks[:blocks_per_flush].tofile(f)

Some points:

  • num_flushes is low, at around 4 - 10
  • blocks_per_flush is a large number, on the order of millions
  • in_memory_blocks can be a fairly large buffer (I've set it as low as 1MB and as high as 100MB) but the timing is very consitent...
  • _BLOCK_MAX is the max for an 8-byte unsigned int
  • m is a hashilib.md5()

Generating 1MB using the above code takes ~1s; 500MB takes ~376s. By comparison, my simple C program that uses rand() can create a 500MB file in 8s.

How can I improve the performance in the above loop? I'm pretty sure I'm ignoring something obvious that's causing this massive difference in runtime.

8
задан Allen George 15 April 2011 в 23:06
поделиться