Indexing of a PostgreSQL bitstring (up to 20,000 bits each)

https://dba.stackexchange.com/questions/198711

25-12-2020
|

سؤال

I am building a table that contains chemical compounds (many millions of rows) and of those compounds certain predetermined features/fragments are flagged in a fixed length bitstring. This bitstring will have between 2000 and 20000 bits, further research needs to be done to determine more precise figures there. When searching for compounds that have certain specific features, or lack specific features, a search is done on a select subset of this bitstring. This can be a different subset every time.

Is there an index type that can make these searches efficient in PostgreSQL (9.6 or 10)?

Insertions are infrequent and done in a batch process, whereas searches are the mostly used operation and should preferably be quick and not have false positives or false negatives.

To me this vaguely sounds like work for a GIN index, but my understanding of this index type is not enough to say for sure if that is realy the case.

Actually there could be another solution, and that is to create a separate 'fragment_index' table, with fragment identifiers (as they have a fixed position in the bitstring they also have a numerical identifier) + compound-id pairs. My worry there is that that table can become enormous (20M compounds with avg 50 hits on fragments = 1G rows) and multiple joins against it (one for each fragment) where that join can also return up to 80% of matches with the compounds table (in some cases this is likely) will not perform at all.

I would be gratefull to get any suggestions towards a direction to get this on the road.

update: i tried a GIN index using the trigram module on a varchar array with codified shortcodes, it gave mixed results, mainly depending on the amount of data left over after the filter operations.

For the sake of giving examples that have any meaning let's assume the table looks like this:

CREATE TABLE testcompounds (
 id serial primary key,
 cd_structure text,
 features_as_text varchar(128),
 features_as_bits bit varying(32)
);

CREATE INDEX flags_testcompounds on testcompounds using gin (features_as_text gin_trgm_ops);


CREATE TABLE fragments (
 id serial primary key,
 smarts text,
 keystring varchar(4),
 frequency int 
);


insert into fragments (keystring,smarts) values('AAA', '*=O');
insert into fragments (keystring,smarts) values('AAB', '[#6]=O');
insert into fragments (keystring,smarts) values('AAC', '[#7]=O');
...
insert into fragments (keystring,smarts) values('AAN', '[#6]-F');
insert into fragments (keystring,smarts) values('AAO', '[#6]-Cl');
insert into fragments (keystring,smarts) values('AAP', '[#6]-Br');
...
etc.

and the features_as_text and features_as_bits fields have been fully populated.

An example of a query that can be ran on this would be:

select id, cd_structure from testcompounds 
where 
    (features_as_bits & (B'00000000000000000000000000000001' << (2-1)) = (B'00000000000000000000000000000001' << (2-1))) 

AND (features_as_bits & (B'00000000000000000000000000000001' << (18-1)) = (B'00000000000000000000000000000000')) 
AND (features_as_bits & (B'00000000000000000000000000000001' << (19-1)) = (B'00000000000000000000000000000000')) 
AND (features_as_bits & (B'00000000000000000000000000000001' << (5-1)) = (B'00000000000000000000000000000000')) 
AND (features_as_bits & (B'00000000000000000000000000000001' << (6-1)) = (B'00000000000000000000000000000000')) 
AND (features_as_bits & (B'00000000000000000000000000000001' << (7-1)) = (B'00000000000000000000000000000000')) 
AND (features_as_bits & (B'00000000000000000000000000000001' << (8-1)) = (B'00000000000000000000000000000000')) 
AND (features_as_bits & (B'00000000000000000000000000000001' << (9-1)) = (B'00000000000000000000000000000000'))

In words: get everything that has feature 2, and does not have any of features 18,19,5,6,7,8,9

المحلول

I did test some index strategies for bitwise operation. A little long procedure, sorry.

First of all,

This is just a test on my RDS environment. Your database performance depends on your environment and data (CPU, Memory, Disk, number of rows in your table, cardinality of your column and so on)
I am not familiar with PostgreSQL. Especially, I am not sure of which bit length of bloom index is best. (Could someone help me ?)
Generally, small index size is preferred. So, table partitioning may be a good solution.

Environment

version: PostgreSQL 10.3
host: A new instance of Amazon RDS db.m4.large (vCPU:2, Memory:8GB, Disk:GP2-SSD 100GB)

table/index definition

Create extensions

CREATE EXTENSION IF NOT EXISTS intarray;
CREATE EXTENSION IF NOT EXISTS bloom;

Create table

CREATE TABLE IF NOT EXISTS t_bitwise (
  id BIGINT
  ,c_int INTEGER
  ,c_bit BIT(31)
  ,c_int_arr _int4
  ,c_bool_0 BOOLEAN ,c_bool_1 BOOLEAN ,c_bool_2 BOOLEAN ,c_bool_3 BOOLEAN ,c_bool_4 BOOLEAN
  ,c_bool_5 BOOLEAN ,c_bool_6 BOOLEAN ,c_bool_7 BOOLEAN ,c_bool_8 BOOLEAN ,c_bool_9 BOOLEAN
  ,c_bool_10 BOOLEAN ,c_bool_11 BOOLEAN ,c_bool_12 BOOLEAN ,c_bool_13 BOOLEAN ,c_bool_14 BOOLEAN
  ,c_bool_15 BOOLEAN ,c_bool_16 BOOLEAN ,c_bool_17 BOOLEAN ,c_bool_18 BOOLEAN ,c_bool_19 BOOLEAN
  ,c_bool_20 BOOLEAN ,c_bool_21 BOOLEAN ,c_bool_22 BOOLEAN ,c_bool_23 BOOLEAN ,c_bool_24 BOOLEAN
  ,c_bool_25 BOOLEAN ,c_bool_26 BOOLEAN ,c_bool_27 BOOLEAN ,c_bool_28 BOOLEAN ,c_bool_29 BOOLEAN
  ,c_bool_30 BOOLEAN
  ,PRIMARY KEY (id)
);

Create one btree index on each boolean column

CREATE INDEX IF NOT EXISTS idx_btree_on_bool_1 ON t_bitwise (c_bool_1);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_2 ON t_bitwise (c_bool_2);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_3 ON t_bitwise (c_bool_3);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_4 ON t_bitwise (c_bool_4);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_5 ON t_bitwise (c_bool_5);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_6 ON t_bitwise (c_bool_6);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_7 ON t_bitwise (c_bool_7);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_8 ON t_bitwise (c_bool_8);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_9 ON t_bitwise (c_bool_9);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_10 ON t_bitwise (c_bool_10);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_11 ON t_bitwise (c_bool_11);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_12 ON t_bitwise (c_bool_12);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_13 ON t_bitwise (c_bool_13);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_14 ON t_bitwise (c_bool_14);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_15 ON t_bitwise (c_bool_15);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_16 ON t_bitwise (c_bool_16);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_17 ON t_bitwise (c_bool_17);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_18 ON t_bitwise (c_bool_18);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_19 ON t_bitwise (c_bool_19);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_20 ON t_bitwise (c_bool_20);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_21 ON t_bitwise (c_bool_21);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_22 ON t_bitwise (c_bool_22);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_23 ON t_bitwise (c_bool_23);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_24 ON t_bitwise (c_bool_24);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_25 ON t_bitwise (c_bool_25);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_26 ON t_bitwise (c_bool_26);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_27 ON t_bitwise (c_bool_27);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_28 ON t_bitwise (c_bool_28);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_29 ON t_bitwise (c_bool_29);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_30 ON t_bitwise (c_bool_30);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_31 ON t_bitwise (c_bool_31);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_32 ON t_bitwise (c_bool_32);

Create one composite btree index on bool columns

CREATE INDEX IF NOT EXISTS idx_btree_on_composite_bool ON t_bitwise (
  c_bool_32 ,c_bool_31
  ,c_bool_30 ,c_bool_29 ,c_bool_28 ,c_bool_27 ,c_bool_26 ,c_bool_25 ,c_bool_24 ,c_bool_23 ,c_bool_22 ,c_bool_21
  ,c_bool_20 ,c_bool_19 ,c_bool_18 ,c_bool_17 ,c_bool_16 ,c_bool_15 ,c_bool_14 ,c_bool_13 ,c_bool_12 ,c_bool_11
  ,c_bool_10 ,c_bool_9 ,c_bool_8 ,c_bool_7 ,c_bool_6 ,c_bool_5 ,c_bool_4 ,c_bool_3 ,c_bool_2 ,c_bool_1
)
;

Create one bloom index on bool columns

CREATE INDEX IF NOT EXISTS idx_bloom_on_bool ON t_bitwise
  USING bloom (
    CAST(c_bool_1 AS INTEGER) ,CAST(c_bool_2 AS INTEGER) ,CAST(c_bool_3 AS INTEGER) ,CAST(c_bool_4 AS INTEGER) ,CAST(c_bool_5 AS INTEGER)
    ,CAST(c_bool_6 AS INTEGER) ,CAST(c_bool_7 AS INTEGER) ,CAST(c_bool_8 AS INTEGER) ,CAST(c_bool_9 AS INTEGER) ,CAST(c_bool_10 AS INTEGER)
    ,CAST(c_bool_11 AS INTEGER) ,CAST(c_bool_12 AS INTEGER) ,CAST(c_bool_13 AS INTEGER) ,CAST(c_bool_14 AS INTEGER) ,CAST(c_bool_15 AS INTEGER)
    ,CAST(c_bool_16 AS INTEGER) ,CAST(c_bool_17 AS INTEGER) ,CAST(c_bool_18 AS INTEGER) ,CAST(c_bool_19 AS INTEGER) ,CAST(c_bool_20 AS INTEGER)
    ,CAST(c_bool_21 AS INTEGER) ,CAST(c_bool_22 AS INTEGER) ,CAST(c_bool_23 AS INTEGER) ,CAST(c_bool_24 AS INTEGER) ,CAST(c_bool_25 AS INTEGER)
    ,CAST(c_bool_26 AS INTEGER) ,CAST(c_bool_27 AS INTEGER) ,CAST(c_bool_28 AS INTEGER) ,CAST(c_bool_29 AS INTEGER) ,CAST(c_bool_30 AS INTEGER)
    ,CAST(c_bool_31 AS INTEGER) ,CAST(c_bool_32 AS INTEGER)
  ) WITH (length=128)
;

Create one bloom index on int column

CREATE OR REPLACE FUNCTION bitcheck_int(INTEGER, INTEGER)
RETURNS INTEGER
LANGUAGE SQL
IMMUTABLE
SECURITY INVOKER
PARALLEL SAFE
COST 0.01
AS $$
  SELECT CASE
    WHEN $1 & (1 << ($2 - 1)) > 0 THEN $2
    ELSE -$2
  END
$$
;
CREATE INDEX IF NOT EXISTS idx_bloom_on_int ON t_bitwise
  USING bloom (
    bitcheck_int(c_int, 1)
    ,bitcheck_int(c_int, 2)
    ,bitcheck_int(c_int, 3)
    ,bitcheck_int(c_int, 4)
    ,bitcheck_int(c_int, 5)
    ,bitcheck_int(c_int, 6)
    ,bitcheck_int(c_int, 7)
    ,bitcheck_int(c_int, 8)
    ,bitcheck_int(c_int, 9)
    ,bitcheck_int(c_int, 10)
    ,bitcheck_int(c_int, 11)
    ,bitcheck_int(c_int, 12)
    ,bitcheck_int(c_int, 13)
    ,bitcheck_int(c_int, 14)
    ,bitcheck_int(c_int, 15)
    ,bitcheck_int(c_int, 16)
    ,bitcheck_int(c_int, 17)
    ,bitcheck_int(c_int, 18)
    ,bitcheck_int(c_int, 19)
    ,bitcheck_int(c_int, 20)
    ,bitcheck_int(c_int, 21)
    ,bitcheck_int(c_int, 22)
    ,bitcheck_int(c_int, 23)
    ,bitcheck_int(c_int, 24)
    ,bitcheck_int(c_int, 25)
    ,bitcheck_int(c_int, 26)
    ,bitcheck_int(c_int, 27)
    ,bitcheck_int(c_int, 28)
    ,bitcheck_int(c_int, 29)
    ,bitcheck_int(c_int, 20)
    ,bitcheck_int(c_int, 31)
    ,bitcheck_int(c_int, 32)
  ) WITH (length=128)
;

Create one bloom index on bit column

CREATE OR REPLACE FUNCTION bitcheck_bit(BIT(32), INTEGER)
RETURNS INTEGER
LANGUAGE SQL
IMMUTABLE
SECURITY INVOKER
PARALLEL SAFE
COST 0.01
AS $$
  SELECT CASE
    WHEN $1::INTEGER & (1 << ($2 - 1)) > 0 THEN $2
    ELSE -$2
  END
$$
;
CREATE INDEX IF NOT EXISTS idx_bloom_on_bit ON t_bitwise
  USING bloom (
    bitcheck_bit(c_bit, 1)
    ,bitcheck_bit(c_bit, 2)
    ,bitcheck_bit(c_bit, 3)
    ,bitcheck_bit(c_bit, 4)
    ,bitcheck_bit(c_bit, 5)
    ,bitcheck_bit(c_bit, 6)
    ,bitcheck_bit(c_bit, 7)
    ,bitcheck_bit(c_bit, 8)
    ,bitcheck_bit(c_bit, 9)
    ,bitcheck_bit(c_bit, 10)
    ,bitcheck_bit(c_bit, 11)
    ,bitcheck_bit(c_bit, 12)
    ,bitcheck_bit(c_bit, 13)
    ,bitcheck_bit(c_bit, 14)
    ,bitcheck_bit(c_bit, 15)
    ,bitcheck_bit(c_bit, 16)
    ,bitcheck_bit(c_bit, 17)
    ,bitcheck_bit(c_bit, 18)
    ,bitcheck_bit(c_bit, 19)
    ,bitcheck_bit(c_bit, 20)
    ,bitcheck_bit(c_bit, 21)
    ,bitcheck_bit(c_bit, 22)
    ,bitcheck_bit(c_bit, 23)
    ,bitcheck_bit(c_bit, 24)
    ,bitcheck_bit(c_bit, 25)
    ,bitcheck_bit(c_bit, 26)
    ,bitcheck_bit(c_bit, 27)
    ,bitcheck_bit(c_bit, 28)
    ,bitcheck_bit(c_bit, 29)
    ,bitcheck_bit(c_bit, 20)
    ,bitcheck_bit(c_bit, 31)
    ,bitcheck_bit(c_bit, 32)
  ) WITH (length=128)
;

Create one GIN index on int array column

CREATE INDEX IF NOT EXISTS idx_gin_on_int_arr ON t_bitwise
  USING GIN(c_int_arr gin__int_ops)
;

Create one GIN int index on int column

CREATE OR REPLACE FUNCTION convert_int_to_int_arr(INTEGER)
RETURNS _int4
LANGUAGE SQL
IMMUTABLE
STRICT
SECURITY INVOKER
PARALLEL SAFE
COST 0.01
AS $$
  SELECT ARRAY(
    SELECT
      CASE
        WHEN (1 << (i - 1)) & $1 > 0 THEN i
        ELSE -i
      END
    FROM generate_series(1, 32) AS i
  )
$$
;
CREATE INDEX IF NOT EXISTS idx_gin_int_on_int ON t_bitwise
  USING GIN(convert_int_to_int_arr(c_int) gin__int_ops)
;

Create one GIN int index on bit column

CREATE OR REPLACE FUNCTION convert_bit_to_int_arr(BIT(32))
RETURNS _int4
LANGUAGE SQL
IMMUTABLE
STRICT
SECURITY INVOKER
PARALLEL SAFE
COST 0.01
AS $$
  SELECT ARRAY(
    SELECT
      CASE
        WHEN (1 << (i - 1)) & $1::INTEGER > 0 THEN i
        ELSE -i
      END
    FROM generate_series(1, 32) AS i
  )
$$
;
CREATE INDEX IF NOT EXISTS idx_gin_int_on_bit ON t_bitwise
  USING GIN(convert_bit_to_int_arr(c_bit) gin__int_ops)
;

Insert

It takes long time.

INSERT INTO t_bitwise (
  id ,c_int ,c_bit ,c_int_arr
  ,c_bool_1 ,c_bool_2 ,c_bool_3 ,c_bool_4 ,c_bool_5 ,c_bool_6 ,c_bool_7 ,c_bool_8 ,c_bool_9 ,c_bool_10
  ,c_bool_11 ,c_bool_12 ,c_bool_13 ,c_bool_14 ,c_bool_15 ,c_bool_16 ,c_bool_17 ,c_bool_18 ,c_bool_19 ,c_bool_20
  ,c_bool_21 ,c_bool_22 ,c_bool_23 ,c_bool_24 ,c_bool_25 ,c_bool_26 ,c_bool_27 ,c_bool_28 ,c_bool_29 ,c_bool_30
  ,c_bool_31 ,c_bool_32
) SELECT
  id
  ,t.rand_int
  ,t.rand_int::BIT(32)
  /*
  ,ARRAY(
    SELECT
        (1 << (i - 1)) & t.rand_int
    FROM generate_series(1, 32) AS i
    WHERE
      ((1 << (i - 1)) & t.rand_int > 0)
  )
  */
  ,ARRAY(
    SELECT
        CASE
        WHEN (1 << (i - 1)) & t.rand_int > 0 THEN i
        ELSE -i
      END
    FROM generate_series(1, 32) AS i
  )
  ,(1 << 0) & t.rand_int > 0 ,(1 << 1) & t.rand_int > 0 ,(1 << 2) & t.rand_int > 0 ,(1 << 3) & t.rand_int > 0
  ,(1 << 4) & t.rand_int > 0 ,(1 << 5) & t.rand_int > 0 ,(1 << 6) & t.rand_int > 0 ,(1 << 7) & t.rand_int > 0
  ,(1 << 8) & t.rand_int > 0 ,(1 << 9) & t.rand_int > 0 ,(1 << 10) & t.rand_int > 0 ,(1 << 11) & t.rand_int > 0
  ,(1 << 12) & t.rand_int > 0 ,(1 << 13) & t.rand_int > 0 ,(1 << 14) & t.rand_int > 0 ,(1 << 15) & t.rand_int > 0
  ,(1 << 16) & t.rand_int > 0 ,(1 << 17) & t.rand_int > 0 ,(1 << 18) & t.rand_int > 0 ,(1 << 19) & t.rand_int > 0
  ,(1 << 20) & t.rand_int > 0 ,(1 << 21) & t.rand_int > 0 ,(1 << 22) & t.rand_int > 0 ,(1 << 23) & t.rand_int > 0
  ,(1 << 24) & t.rand_int > 0 ,(1 << 25) & t.rand_int > 0 ,(1 << 26) & t.rand_int > 0 ,(1 << 27) & t.rand_int > 0
  ,(1 << 28) & t.rand_int > 0 ,(1 << 29) & t.rand_int > 0 ,(1 << 30) & t.rand_int > 0 ,(1 << 31) & t.rand_int > 0
FROM (
  SELECT
    id
    ,(random() * 4294967295)::BIGINT::BIT(32)::INTEGER AS rand_int
  FROM generate_series(1, 30000000) AS id
) AS t
;

analyze

SELECT
  gin_clean_pending_list('idx_gin_on_int_arr')
  ,gin_clean_pending_list('idx_gin_int_on_int')
  ,gin_clean_pending_list('idx_gin_int_on_bit')
;
VACUUM FULL
;
ANALYZE t_bitwise
;
SELECT COUNT(*) FROM t_bitwise
;

index size

SELECT
    indexname,
    pg_size_pretty(pg_relation_size(quote_ident(t.tablename)::text)) AS table_size,
    pg_size_pretty(pg_relation_size(quote_ident(indexrelname)::text)) AS index_size
FROM pg_tables t
LEFT OUTER JOIN pg_class c ON t.tablename = c.relname
LEFT OUTER JOIN (
  SELECT
    c.relname AS ctablename, ipg.relname AS indexname, indexrelname
  FROM pg_index x
  JOIN pg_class c ON c.oid = x.indrelid
  JOIN pg_class ipg ON ipg.oid = x.indexrelid
  JOIN pg_stat_all_indexes psai ON x.indexrelid = psai.indexrelid
) AS foo ON t.tablename = foo.ctablename
WHERE t.schemaname = 'public' AND t.tablename = 't_bitwise' AND indexname != 't_bitwise_pkey'
;

result

          indexname          | table_size | index_size 
-----------------------------+------------+------------
 idx_btree_on_bool_1         | 6893 MB    | 643 MB
 idx_btree_on_bool_2         | 6893 MB    | 643 MB
 idx_btree_on_bool_3         | 6893 MB    | 643 MB
 idx_btree_on_bool_4         | 6893 MB    | 643 MB
 idx_btree_on_bool_5         | 6893 MB    | 643 MB
 idx_btree_on_bool_6         | 6893 MB    | 643 MB
 idx_btree_on_bool_7         | 6893 MB    | 643 MB
 idx_btree_on_bool_8         | 6893 MB    | 643 MB
 idx_btree_on_bool_9         | 6893 MB    | 643 MB
 idx_btree_on_bool_10        | 6893 MB    | 643 MB
 idx_btree_on_bool_11        | 6893 MB    | 643 MB
 idx_btree_on_bool_12        | 6893 MB    | 643 MB
 idx_btree_on_bool_13        | 6893 MB    | 643 MB
 idx_btree_on_bool_14        | 6893 MB    | 643 MB
 idx_btree_on_bool_15        | 6893 MB    | 643 MB
 idx_btree_on_bool_16        | 6893 MB    | 643 MB
 idx_btree_on_bool_17        | 6893 MB    | 643 MB
 idx_btree_on_bool_18        | 6893 MB    | 643 MB
 idx_btree_on_bool_19        | 6893 MB    | 643 MB
 idx_btree_on_bool_20        | 6893 MB    | 643 MB
 idx_btree_on_bool_21        | 6893 MB    | 643 MB
 idx_btree_on_bool_22        | 6893 MB    | 643 MB
 idx_btree_on_bool_23        | 6893 MB    | 643 MB
 idx_btree_on_bool_24        | 6893 MB    | 643 MB
 idx_btree_on_bool_25        | 6893 MB    | 643 MB
 idx_btree_on_bool_26        | 6893 MB    | 643 MB
 idx_btree_on_bool_27        | 6893 MB    | 643 MB
 idx_btree_on_bool_28        | 6893 MB    | 643 MB
 idx_btree_on_bool_29        | 6893 MB    | 643 MB
 idx_btree_on_bool_30        | 6893 MB    | 643 MB
 idx_btree_on_bool_31        | 6893 MB    | 643 MB
 idx_btree_on_bool_32        | 6893 MB    | 643 MB
 idx_btree_on_composite_bool | 6893 MB    | 1423 MB
 idx_bloom_on_bool           | 6893 MB    | 633 MB
 idx_bloom_on_int            | 6893 MB    | 633 MB
 idx_bloom_on_bit            | 6893 MB    | 633 MB
 idx_gin_on_int_arr          | 6893 MB    | 1030 MB
 idx_gin_int_on_int          | 6893 MB    | 1030 MB
 idx_gin_int_on_bit          | 6893 MB    | 1030 MB

test sql

All Execution time result is taken from second query result since first query usually takes some time to load index into memory.

full-scan on int column

1 bit filtering(Parallel Seq Scan, Execution time: 122930.731 ms)

EXPLAIN ANALYZE SELECT 1 FROM t_bitwise
WHERE
  c_int & 1 = 1
;

16 bit filtering(Parallel Seq Scan, 122896.131 ms)

EXPLAIN ANALYZE SELECT 1 FROM t_bitwise
WHERE
  c_int & 255 = 255
  AND (c_int >> 8) & 255 = 0
;

btree index on bool column

1 bit filtering(Seq Scan, Execution time: 122853.069 ms)

EXPLAIN ANALYZE SELECT 1 FROM t_bitwise
WHERE
  c_bool_1 IS TRUE
;

16 bit filtering(Parallel Seq, Execution time: 122834.960 ms)

EXPLAIN ANALYZE SELECT 1 FROM t_bitwise
WHERE
  c_bool_1 IS TRUE
  AND c_bool_2 IS TRUE
  AND c_bool_3 IS TRUE
  AND c_bool_4 IS TRUE
  AND c_bool_5 IS TRUE
  AND c_bool_6 IS TRUE
  AND c_bool_7 IS TRUE
  AND c_bool_8 IS TRUE
  AND c_bool_9 IS FALSE
  AND c_bool_10 IS FALSE
  AND c_bool_11 IS FALSE
  AND c_bool_12 IS FALSE
  AND c_bool_13 IS FALSE
  AND c_bool_14 IS FALSE
  AND c_bool_15 IS FALSE
  AND c_bool_16 IS FALSE
;

composite btree index on bool columns

16 bit filtering(idx_btree_on_composite_bool is used, Execution time: 293.317 ms)

EXPLAIN ANALYZE SELECT 1 FROM t_bitwise
WHERE
  c_bool_32 IN (TRUE, FALSE)
  AND c_bool_31 IN (TRUE, FALSE)
  AND c_bool_30 IN (TRUE, FALSE)
  AND c_bool_29 IN (TRUE, FALSE)
  AND c_bool_28 IN (TRUE, FALSE)
  AND c_bool_27 IN (TRUE, FALSE)
  AND c_bool_26 IN (TRUE, FALSE)
  AND c_bool_25 IN (TRUE, FALSE)
  AND c_bool_24 IN (TRUE, FALSE)
  AND c_bool_23 IN (TRUE, FALSE)
  AND c_bool_22 IN (TRUE, FALSE)
  AND c_bool_21 IN (TRUE, FALSE)
  AND c_bool_20 IN (TRUE, FALSE)
  AND c_bool_19 IN (TRUE, FALSE)
  AND c_bool_18 IN (TRUE, FALSE)
  AND c_bool_17 IN (TRUE, FALSE)
  AND c_bool_16 IS FALSE
  AND c_bool_15 IS FALSE
  AND c_bool_14 IS FALSE
  AND c_bool_13 IS FALSE
  AND c_bool_12 IS FALSE
  AND c_bool_11 IS FALSE
  AND c_bool_10 IS FALSE
  AND c_bool_9 IS FALSE
  AND c_bool_8 IS TRUE
  AND c_bool_7 IS TRUE
  AND c_bool_6 IS TRUE
  AND c_bool_5 IS TRUE
  AND c_bool_4 IS TRUE
  AND c_bool_3 IS TRUE
  AND c_bool_2 IS TRUE
  AND c_bool_1 IS TRUE
;

bloom index on bool columns

1 bit filtering(Seq Scan, Execution time: 122726.850 ms)

EXPLAIN ANALYZE SELECT 1 FROM t_bitwise
WHERE
  CAST(c_bool_1 AS INTEGER) = 1
;

16 bit filtering(idx_bloom_on_bool is used, Execution time: 373.581 ms)

EXPLAIN ANALYZE SELECT 1 FROM t_bitwise
WHERE
  CAST(c_bool_1 AS INTEGER) = 1
  AND CAST(c_bool_2 AS INTEGER) = 1
  AND CAST(c_bool_3 AS INTEGER) = 1
  AND CAST(c_bool_4 AS INTEGER) = 1
  AND CAST(c_bool_5 AS INTEGER) = 1
  AND CAST(c_bool_6 AS INTEGER) = 1
  AND CAST(c_bool_7 AS INTEGER) = 1
  AND CAST(c_bool_8 AS INTEGER) = 1
  AND CAST(c_bool_9 AS INTEGER) = 0
  AND CAST(c_bool_10 AS INTEGER) = 0
  AND CAST(c_bool_11 AS INTEGER) = 0
  AND CAST(c_bool_12 AS INTEGER) = 0
  AND CAST(c_bool_13 AS INTEGER) = 0
  AND CAST(c_bool_14 AS INTEGER) = 0
  AND CAST(c_bool_15 AS INTEGER) = 0
  AND CAST(c_bool_16 AS INTEGER) = 0
;

bloom index on int column

1 bit filtering(Seq Scan, Execution time: 122660.620 ms)

EXPLAIN ANALYZE SELECT 1 FROM t_bitwise
WHERE
  bitcheck_int(c_int, 1) = 1
;

16 bit filtering(idx_bloom_on_int is used, Execution time: 391.335 ms)

EXPLAIN ANALYZE SELECT 1 FROM t_bitwise
WHERE
  bitcheck_int(c_int, 1) = 1
  AND bitcheck_int(c_int, 2) = 2
  AND bitcheck_int(c_int, 3) = 3
  AND bitcheck_int(c_int, 4) = 4
  AND bitcheck_int(c_int, 5) = 5
  AND bitcheck_int(c_int, 6) = 6
  AND bitcheck_int(c_int, 7) = 7
  AND bitcheck_int(c_int, 8) = 8
  AND bitcheck_int(c_int, 9) = -9
  AND bitcheck_int(c_int, 10) = -10
  AND bitcheck_int(c_int, 11) = -11
  AND bitcheck_int(c_int, 12) = -12
  AND bitcheck_int(c_int, 13) = -13
  AND bitcheck_int(c_int, 14) = -14
  AND bitcheck_int(c_int, 15) = -15
  AND bitcheck_int(c_int, 16) = -16
;

bloom index on bit column

1 bit filtering(Seq Scan, Execution time: 122434.644 ms)

EXPLAIN ANALYZE SELECT 1 FROM t_bitwise
WHERE
  bitcheck_bit(c_bit, 1) = 1
;

16 bit filtering(idx_bloom_on_bit is used, Execution time: 397.157 ms)

EXPLAIN ANALYZE SELECT 1 FROM t_bitwise
WHERE
  bitcheck_bit(c_bit, 1) = 1
  AND bitcheck_bit(c_bit, 2) = 2
  AND bitcheck_bit(c_bit, 3) = 3
  AND bitcheck_bit(c_bit, 4) = 4
  AND bitcheck_bit(c_bit, 5) = 5
  AND bitcheck_bit(c_bit, 6) = 6
  AND bitcheck_bit(c_bit, 7) = 7
  AND bitcheck_bit(c_bit, 8) = 8
  AND bitcheck_bit(c_bit, 9) = -9
  AND bitcheck_bit(c_bit, 10) = -10
  AND bitcheck_bit(c_bit, 11) = -11
  AND bitcheck_bit(c_bit, 12) = -12
  AND bitcheck_bit(c_bit, 13) = -13
  AND bitcheck_bit(c_bit, 14) = -14
  AND bitcheck_bit(c_bit, 15) = -15
  AND bitcheck_bit(c_bit, 16) = -16
;

GIN index on int array

1 bit filtering(idx_gin_on_int_arr is used, Execution time: 440942.779 ms)

EXPLAIN ANALYZE SELECT 1 FROM t_bitwise
WHERE
  c_int_arr @> (ARRAY[1])::_int4
;

16 bit filtering(idx_gin_on_int_arr is used, Execution time: 15322.953 ms)

EXPLAIN ANALYZE SELECT 1 FROM t_bitwise
WHERE
  c_int_arr @> (ARRAY[1,2,3,4,5,6,7,8,-9,-10,-11,-12,-13,-14,-15,-16])::_int4
;

GIN int index on int

1 bit filtering(idx_gin_int_on_int is used, Execution time: 489909.621 ms)

EXPLAIN ANALYZE SELECT 1 FROM t_bitwise
WHERE
  convert_int_to_int_arr(c_int) @> (ARRAY[1])::_int4
;

16 bit filtering(idx_gin_int_on_int is used, Execution time: 15259.772 ms)

EXPLAIN ANALYZE SELECT 1 FROM t_bitwise
WHERE
  convert_int_to_int_arr(c_int) @> (ARRAY[1,2,3,4,5,6,7,8,-9,-10,-11,-12,-13,-14,-15,-16])::_int4
;

GIN int index on bit

1 bit filtering(idx_gin_int_on_bit is used, Execution time: 506071.625 ms)

EXPLAIN ANALYZE SELECT 1 FROM t_bitwise
WHERE
  convert_bit_to_int_arr(c_bit) @> (ARRAY[1])::_int4
;

16 bit filtering(idx_gin_int_on_bit is used, Execution time: 15292.945 ms)

EXPLAIN ANALYZE SELECT 1 FROM t_bitwise
WHERE
  convert_bit_to_int_arr(c_bit) @> (ARRAY[1,2,3,4,5,6,7,8,-9,-10,-11,-12,-13,-14,-15,-16])::_int4
;

My thoughts

Bloom index seems to be better than the others since it has smaller index size and is not so bad execution time. (But anyway, in case of low cardinality, any query plan results in full-scan)
Your column is 20000 bits ! I can not test it. If you want to use bloom index, you have to split it because bloom index can have up to 32 normal or expression columns.

نصائح أخرى

Famous problem in cheminformatics. You will have a hard time coming up with an index that will beat a full-table scan for this purpose. The most performant solutions keep the bitstrings outside of the database, where they don't have to deal with transactional overhead, and are optimized for efficient use of popcnt hardware support.

You can take some shortcuts, by for example keeping the bitstring data sorted by the number of on bits. If the query says "find me things with each of these 45 different bits turned on" then you can quickly skip all the records with less than 45 bits turned on, as they can't possibly qualify. The problem is that queries are usually sparser in bits than whole compounds are, so the number of records skipped by this manner is pretty small. Another optimization is to use something like a genetic algorithm to select out certain bits as being more important than others (trained on the nature of the queries you run), and sorting on the population count of just those select bits rather than the entire bitstring.

Anyway, I don't think you will find what you want in an in-core PostgreSQL index, so be prepared to do some heavy hacking. You might want to look to Indigo source code for some inspiration. (Or maybe you can just use Indigo for your intended purpose directly)

I am building a table that contains chemical compounds (many millions of rows) and of those compounds certain predetermined features/fragments are flagged in a fixed length bitstring. This bitstring will have between 2000 and 20000 bits, further research needs to be done to determine more precise figures there. When searching for compounds that have certain specific features, or lack specific features, a search is done on a select subset of this bitstring. This can be a different subset every time.

I think you can use a bloom index here.

CREATE EXTENSION bloom;

Do the voodoo to return a boolean for the argument.

// (Bitwise AND it against the bit flag)
CREATE FUNCTION myT1 (int)
RETURNS bool AS $$
  SELECT true -- YOU WRITE CODE HERE
$$ LANGUAGE SQL
IMMUTABLE;

// (Bitwise AND it against the bit flag)
CREATE FUNCTION myT2 (int)
RETURNS bool AS $$
  SELECT $1 & b'10'::bit(32) -- YOU WRITE CODE HERE
$$ LANGUAGE SQL
IMMUTABLE;

// (Bitwise AND it against the bit flag)
CREATE FUNCTION myTn (int)
RETURNS bool AS $$
  SELECT $1 & b'01'::bit(32) -- YOU WRITE CODE HERE
$$ LANGUAGE SQL
IMMUTABLE;

CREATE INDEX b ON f
  USING bloom (myT1(id), myT2, myTn(id))
  WITH (length=80, col1=1);

That may work to give you indexed access to big search strings with arbitrary conditions while getting around the column limit of PostgreSQL, but it's messy. The index returns false-positives. But the query will recheck, so you don't have to worry about it.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى dba.stackexchange