We've got a reasonably-sized database on Google App Engine - just over 50,000 entities - that we want to clear out stale data from. The plan was to write a deferred task to iterate over the entities we no longer wanted, and delete them in batches.
One complication is that our entities also have child entities that we also want to purge -- no problem, we thought; we'd just query the datastore for those entities, and drop them at the same time as the parent:
query = ParentKind.all()
query.count(100)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
to_delete.append(entity)
to_delete.extend(ChildKindA.all().ancestor(entity).fetch(100))
to_delete.extend(ChildKindB.all().ancestor(entity).fetch(100))
db.delete(to_delete)
We limited ourselves to deleting 100 ParentKind
entities at a time; each ParentKind
had around 40 child ChildKindA
and ChildKindB
entities total - perhaps 4000 entities.
This seemed reasonable at the time, but we ran one batch as a test, and the resulting query took 9 seconds to run -- and spent 1933 seconds in billable CPU time accessing the datastore.
This seems pretty harsh -- 0.5 billable seconds per entity! -- but we're not entirely sure what we're doing wrong. Is it simply the size of the batch? Are ancestor queries particularly slow? Or, are deletes (and indeed, all datastore accesses) simply slow as molasses?
We changed our queries to be keys_only
, and while that reduced the time to run one batch to 4.5 real seconds, it still cost ~1900 seconds in CPU time.
Next, we installed Appstats to our app (thanks, kevpie) and ran a smaller sized batch -- 10 parent entities, which would amount to ~450 entities total. Here's the updated code:
query = ParentKind.all(keys_only=True)
query.count(10)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
to_delete.append(entity)
to_delete.extend(ChildKindA.all(keys_only=True).ancestor(entity).fetch(100))
to_delete.extend(ChildKindB.all(keys_only=True).ancestor(entity).fetch(100))
db.delete(to_delete)
The results from Appstats:
service.call #RPCs real time api time
datastore_v3.RunQuery 22 352ms 555ms
datastore_v3.Delete 1 366ms 132825ms
taskqueue.BulkAdd 1 7ms 0ms
The Delete
call is the single most expensive part of the operation!
Is there a way around this? Nick Johnson mentioned that using the bulk delete handler is the fastest way to delete at present, but ideally we don't want to delete all entities of a kind, just the ones that match, and are children of, our initial bar = foo
query.