Pregunta

I have a cube column named embedding in a documents table storing a vectorized (TF-IDF) representation of some text in the field which I converted into dense format. I created a GIST index on the But I am having trouble with query performance. It takes ~20 seconds on this query (~5MM rows on a 32GB machine):

select id 
from documents 
where embedding <-> cube('(0.08470847,...,0.06106149)') < 0.25 
order by embedding <-> cube('(0.08470847,...,0.06106149)') asc 
limit 25

The same query without the order by performs within milliseconds.

I am not sure how to improve the ordering performance.

I ran explain analyze on the query and this is the result:

Limit  (cost=0.54..323.63 rows=25 width=12) (actual time=18032.104..18704.827 rows=25 loops=1)
  ->  Index Scan using ix_100 on documents  (cost=0.54..22895274.16 rows=1771566 width=12) (actual time=18032.101..18704.797 rows=25 loops=1)
        Order By: (embedding <-> '(0.084708469999999994,... , 0.061061490000000003)'::cube)
        Filter: ((embedding <-> '(0.084708469999999994,... , 0.061061490000000003)'::cube) < '0.25'::double precision)
Planning Time: 1.575 ms
Execution Time: 18728.073 ms

I am at a loss how to proceed from here, I wish to avoid sorting after fetching the results in the application layer and ideally should work within the database.

Any ideas?

Edit: adding the explain(analyze,buffers) for the query with limit

query:

explain (analyze, buffers) 
select id 
from documents 
where embedding <-> cube('(0.08470847,..,0.06106149)') < 0.25 
limit 10;

with this output:

Limit  (cost=0.00..7.73 rows=10 width=4) (actual time=0.036..0.076 rows=10 loops=1)
  Buffers: shared hit=5
  ->  Seq Scan on documents  (cost=0.00..1370989.16 rows=1772915 width=4) (actual time=0.034..0.072 rows=10 loops=1)
        Filter: ((embedding <-> '(0.084708469999999994..., 0.061061490000000003)'::cube) < '0.25'::double precision)
        Rows Removed by Filter: 10
        Buffers: shared hit=5
Planning Time: 0.107 ms
Execution Time: 0.098 ms 

Edit -2 :

modified query per last update and results are back to ~20 secs per query

Limit  (cost=0.54..323.56 rows=25 width=12) (actual time=727.488..21603.571 rows=25 loops=1)
  Buffers: shared read=1352076
  ->  Index Scan using ix_100 on documents  (cost=0.54..22910761.65 rows=1773159 width=12) (actual time=727.485..21603.535 rows=25 loops=1)
        Order By: (embedding <-> '(0.0665496899999999947, ... 0.063358020000000001)'::cube)
        Filter: ((embedding <-> '(0.0665496899999999947, ... 0.063358020000000001)'::cube) < '0.25'::double precision)
        Buffers: shared read=1352076
Planning Time: 0.164 ms
Execution Time: 21644.516 ms
¿Fue útil?

Solución

Sorting works on values the query returns.

Here you have an index on embedding column but you are sorting on the result of embedding <-> cube('(0.08470847,...,0.06106149)'), which is not indexed.

So first retrieve the required result with the help of sub-query, then perform sorting.

select id,EDistance
from
(
select id, embedding <-> cube('(0.08470847,...,0.06106149)') EDistance 
from documents 
where embedding <-> cube('(0.08470847,...,0.06106149)') < 0.25
limit 25
) t
order by EDistance ASC

Thanks!

Licenciado bajo: CC-BY-SA con atribución
No afiliado a dba.stackexchange
scroll top