If you want an efficient system, you'll need to break down the n-grams ahead of time and index them. When I wrote the 5-Gram Experiment (unfortunately the backend is offline now as I had to give back the hardware), I've created a map of word => integer id
, and then stored in MongoDB a hex id sequence in the document key field of a collection (for example, [10, 2] => "a:2"
). Then, randomly distributing the ~350 million 5-grams to 10 machines running MongoDB offered sub-second query times for the whole data set.
You can a similar scheme. With a document such as:
{_id: "a:2", seen: [docId1, docId2, ...]}
You'll be able to find where the given n-gram was found.
Update: Actually, a small correction: in the system that went live I ended up using the same scheme, but encoding the n-gram keys in a binary format for space efficiency (~350M is a lot of 5-grams!), but otherwise the mechanics were all the same.