Can Postgresql exploit “accidental” clustering in plans?
-
11-03-2021 - |
Вопрос
Suppose I have an append-only table with columns customer_id
(randomly generated string) and x
, and lookups are always done on customer_id
.
Say the data look like below, as if we get a batch of rows when a customer signs up for something initially, and then never again for that customer.
customer_id=XCVFY0001, x=...
customer_id=XCVFY0001, x=...
(continues for ~1 page with same customer_id)
customer_id=HUMBN0001, x=...
customer_id=HUMBN0001, x=...
(continues for ~1 page with same customer_id)
(and so on...)
So while customer_id
's alphabetical ordering isn't correlated with physical rows, we could make statements like:
- there are few distinct customer IDs per page
- there are few pages per ID
- there are long "runs" of IDs, or, if you need one
customer_id
, you're likely to find it on a few contiguous pages - in terms of information theory, I think they'd say there's no correlation, but there is high "mutual information"
Can the query planner use information like this in estimates, if one doesn't explicitly run CLUSTER
? I'd assume that if there's low correlation
as reported in pg_stats
, it would guess that rows are uniformly distributed throughout pages, and might be pessimistic with various plans.
(In my real-world analogue, a plain non-clustered index made things nice and fast anyways, but I just got curious when I noticed the pattern in the data.)
Решение
The planner is unaware of this type of clustering and so can't make decision based on it.
The two-step sampling method used by ANALYZE can generate skewed samples in this situation, possibly leading to a drastic underestimate of n_distinct. It is hard to predict what the consequences of that might be without digging into the details of individual queries.