Do a mass db.delete on App Engine, without eating CPU

We've got a reasonably-sized database on Google App Engine - just over 50,000 entities - that we want to clear out stale data from. The plan was to write a deferred task to iterate over the entities we no longer wanted, and delete them in batches.

One complication is that our entities also have child entities that we also want to purge -- no problem, we thought; we'd just query the datastore for those entities, and drop them at the same time as the parent:

query = ParentKind.all()
query.count(100)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
    to_delete.append(entity)
    to_delete.extend(ChildKindA.all().ancestor(entity).fetch(100))
    to_delete.extend(ChildKindB.all().ancestor(entity).fetch(100))
db.delete(to_delete)

We limited ourselves to deleting 100 ParentKind entities at a time; each ParentKind had around 40 child ChildKindA and ChildKindB entities total - perhaps 4000 entities.

This seemed reasonable at the time, but we ran one batch as a test, and the resulting query took 9 seconds to run -- and spent 1933 seconds in billable CPU time accessing the datastore.

This seems pretty harsh -- 0.5 billable seconds per entity! -- but we're not entirely sure what we're doing wrong. Is it simply the size of the batch? Are ancestor queries particularly slow? Or, are deletes (and indeed, all datastore accesses) simply slow as molasses?

Update

We changed our queries to be keys_only, and while that reduced the time to run one batch to 4.5 real seconds, it still cost ~1900 seconds in CPU time.

Next, we installed Appstats to our app (thanks, kevpie) and ran a smaller sized batch -- 10 parent entities, which would amount to ~450 entities total. Here's the updated code:

query = ParentKind.all(keys_only=True)
query.count(10)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
    to_delete.append(entity)
    to_delete.extend(ChildKindA.all(keys_only=True).ancestor(entity).fetch(100))
    to_delete.extend(ChildKindB.all(keys_only=True).ancestor(entity).fetch(100))
db.delete(to_delete)

The results from Appstats:

service.call           #RPCs  real time  api time
datastore_v3.RunQuery  22     352ms      555ms
datastore_v3.Delete    1      366ms      132825ms
taskqueue.BulkAdd      1      7ms        0ms

The Delete call is the single most expensive part of the operation!

Is there a way around this? Nick Johnson mentioned that using the bulk delete handler is the fastest way to delete at present, but ideally we don't want to delete all entities of a kind, just the ones that match, and are children of, our initial bar = foo query.

7
задан Brian Tompsett - 汤莱恩 11 July 2015 в 23:31
поделиться