Most efficient way to match a certain number of items in a db.Model ListProperty
-
20-09-2019 - |
Question
In reference to this different but not unrelated question I will borrow the example models.
class Foo(db.Model): bars = db.ListProperty(db.Key)
class Bar(db.Model): pass
If I have a certain Foo entity and I want to get all of the other foo entities also containing a certain bar Key in its bars ListProperty, I would use the following query:
related_foos = Foo.all().filter('bars', bar_entity).fetch(fetch_count)
What about if I want to find all other entities of model kind Foo that have at least N number of matching bar entities? The obvious way to do this with a for-loop would involve drastic inefficiencies, and it might be best to actually change the model itself to make this easier, but it doesn't seem obvious how to do so.
Solution
Given a foo record that has 10 bar_entities and looking for all foo records that have at least 2 of these 10 entities would result in 45 possible equality values 10!/(2!*(10-2)!)=45.
This can be deduced in 10_C_(2-1)=10 reads.
SELECT * from table WHERE bar="1" AND bar in ["2", "3", "4", "5", "6", "7", "8", "9", "0"]
SELECT * from table WHERE bar="2" AND bar in ["3", "4", "5", "6", "7", "8", "9", "0"]
SELECT * from table WHERE bar="3" AND bar in ["4", "5", "6", "7", "8", "9", "0"]
etc.
To reduce this to one read would require that when a foo record is added you populate a separate table that had all the 2 combinations for a given record.
Say you had
foo_table
foo1 [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]
foo2 [1, 3, 4]
foo3 [1, 2, a]
foo4 [b, 6, c]
foo_combo_2_table
Parent Combination
foo1 12
foo1 13
... and all 45 foo1 combinations each in its own row
foo2 13
foo2 14
foo2 34
foo3 12
foo3 1a
foo3 2a
etc.
Now you can do a
indexes = SELECT __KEY__ from foo_combo_2_table WHERE combination IN [12, 13, 14, 15, ... all 45]
keys = [k.parent() for k in indexes] # you would need to filter for duplicates
This way you wont get into any exploding index issues.
If you also wanted to do any 3 or any 4 entities than for each of these you would need to create a foo_combo_n_table or do a 10_C_(n-1) number of reads.
OTHER TIPS
You can simply apply the same filter repeatedly:
related_foos = Foo.all().filter('bars', bar_entity).filter('bars', bar_entity_2).fetch(fetch_count)
Or, data-driven:
q = Foo.all()
for bar in bar_entities:
q.filter('bars', bar)
related_foos = q.fetch(fetch_count)
If you don't apply any inequalities or sort orders to the query, the datastore will be able to execute the queries using the built in indexes and the merge join strategy, regardless of how many filters you apply. If you need an inequality or sort order, however, you'll need to have an index for each number of bars you might want to filter on, which leads to exploding indexes (and so is best avoided!)