What is the overhead for varchar(n)?

https://dba.stackexchange.com/questions/125499

29-09-2020
|

سؤال

I wanted to ask for the meaning of this fragment from Postgres doc regarding varchar(n) type:

The storage requirement for a short string (up to 126 bytes) is 1 byte plus the actual string, which includes the space padding in the case of character. Longer strings have 4 bytes of overhead instead of 1.

Let's assume that I have a varchar(255) field. And now, the following statements:

If this field holds a string of 10 bytes, then the overhead is 1 byte. So the string will use 11 bytes.
If the field holds string using 140 bytes, then the overhead is 4 bytes. So the string will use 144 bytes.

Are those statements above true? Here someone understands the doc the same way as me but here someone states the overhead is always 4 bytes here?

المحلول

Unsurprisingly, the manual is right. But there is more to it.

For one, size on disk (in any table, even when not actually stored on disk) can be different from size in memory. On disk, the overhead for short varchar values up to 126 bytes is reduced to a 1 byte as stated in the manual. But the overhead in memory is always 4 bytes (once individual values are extracted).

The same is true for text, varchar, varchar(n) or char(n) - except that char(n) is blank-padded to n characters and you normally don't want to use it. Its effective size can still vary in multi-byte encodings because n denotes a maximum of characters, not bytes:

strings up to n characters (not bytes) in length.

All of them use varlena internally.
"char" (with double-quotes) is a different creature and always occupies a single byte.
Untyped string literals ('foo') have a single byte overhead. Not to be confused with typed values!

Test with pg_column_size().

CREATE TEMP TABLE t (id int, v_small varchar, v_big varchar);
INSERT INTO t VALUES (1, 'foo', '12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890');

SELECT pg_column_size(id)        AS id
     , pg_column_size(v_small)   AS v_small
     , pg_column_size(v_big)     AS v_big
     , pg_column_size(t)         AS t
FROM   t
UNION ALL  -- 2nd row measuring values in RAM
SELECT pg_column_size(1)
     , pg_column_size('foo'::varchar)
     , pg_column_size('12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890'::varchar)
     , pg_column_size(ROW(1, 'foo'::varchar, '12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890'::varchar));

 id | v_small | v_big |  t
----+---------+-------+-----
  4 |       4 |   144 | 176
  4 |       7 |   144 | 176

As you can see:

The 3-byte string 'foo' occupies 4 bytes on disk and 7 bytes in RAM (so 1 byte vs. 4 bytes of overhead).
The 140-byte string '123...' occupies 144 bytes both on disk and in RAM (so always 4 bytes of overhead).
Storage of integer has no overhead (but it has alignment requirements that can impose padding).
The row has an additional overhead of 24 bytes for the tuple header (plus an additional 4 bytes per tuple for the item identifier in the page header).
And last but not least: The overhead of the small varchar is still just 1 byte while it has not been extracted from the row - as can be seen from the row size. (That's why it's sometimes a bit faster to select whole rows.)

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى dba.stackexchange