Can Postgresql exploit “accidental” clustering in plans?

https://dba.stackexchange.com/questions/280539

11-03-2021
|

Вопрос

Suppose I have an append-only table with columns customer_id (randomly generated string) and x, and lookups are always done on customer_id.

Say the data look like below, as if we get a batch of rows when a customer signs up for something initially, and then never again for that customer.

customer_id=XCVFY0001, x=...
customer_id=XCVFY0001, x=...
(continues for ~1 page with same customer_id)
customer_id=HUMBN0001, x=...
customer_id=HUMBN0001, x=...
(continues for ~1 page with same customer_id)
(and so on...)

So while customer_id's alphabetical ordering isn't correlated with physical rows, we could make statements like:

there are few distinct customer IDs per page
there are few pages per ID
there are long "runs" of IDs, or, if you need one customer_id, you're likely to find it on a few contiguous pages
in terms of information theory, I think they'd say there's no correlation, but there is high "mutual information"

Can the query planner use information like this in estimates, if one doesn't explicitly run CLUSTER? I'd assume that if there's low correlation as reported in pg_stats, it would guess that rows are uniformly distributed throughout pages, and might be pessimistic with various plans.

(In my real-world analogue, a plain non-clustered index made things nice and fast anyways, but I just got curious when I noticed the pattern in the data.)

Решение

The planner is unaware of this type of clustering and so can't make decision based on it.

The two-step sampling method used by ANALYZE can generate skewed samples in this situation, possibly leading to a drastic underestimate of n_distinct. It is hard to predict what the consequences of that might be without digging into the details of individual queries.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с dba.stackexchange