PostgreSQL - Return unique combinations of columns based on where clause

https://dba.stackexchange.com/questions/60158

02-11-2019
|

Вопрос

This is quite a long question, please bear with me.

So I'd like to first explain I have a database of firewall logs created using the following command:

CREATE TABLE firewall_logs_mapped
(
  log_time text,
  log_time_mapped double precision,
  syslog_priority text,
  syslog_priority_mapped double precision,
  operation text,
  operation_mapped double precision,
  message_code text,
  message_code_mapped double precision,
  protocol text,
  protocol_mapped double precision,
  source_ip text,
  source_ip_mapped double precision,
  destination_ip text,
  destination_ip_mapped double precision,
  source_port text,
  source_port_mapped double precision,
  destination_port text,
  destination_port_mapped double precision,
  destination_service text,
  destination_service_mapped double precision,
  direction text,
  direction_mapped double precision,
  connections_built text,
  connections_built_mapped double precision,
  connections_torn_down text,
  connections_torn_down_mapped double precision,
  hourofday text,
  hourofday_mapped double precision,
  meridiem text,
  meridiem_mapped double precision
)

So basically for every value in the database there is also a mapped value which is a float representing the y co-ordinate for the value such that it can be plotted on a graph which I have set-up.

I also have an index on each of the un-mapped columns in the database.

Now I have a an interface whereby the user can search the database by specifying the column name and value, and the software will convert this to a query and query the database. I.e. they can input something like:

operation Built

and the software creates and runs the following query:

SELECT DISTINCT log_time_mapped, syslog_priority_mapped, 
       operation_mapped, message_code_mapped, protocol_mapped, 
       source_ip_mapped, destination_ip_mapped, 
       source_port_mapped, destination_port_mapped, 
       destination_service_mapped, direction_mapped, 
       connections_built_mapped, connections_torn_down_mapped, 
       hourofday_mapped, meridiem_mapped 
FROM firewall_logs_mapped 
WHERE operation = 'Built'

Now this particular query doesn't actually use the index (presumably because it returns ~48% of the database rows) and returns 11,426,373 rows in 126,775ms. Here is the explain analyze for the query:

"Seq Scan on public.firewall_logs_mapped  (cost=0.00..1234282.95 rows=11295161 width=120) (actual time=0.357..16139.005 rows=11426373 loops=1)"
"  Output: log_time_mapped, syslog_priority_mapped, operation_mapped, message_code_mapped, protocol_mapped, source_ip_mapped, destination_ip_mapped, source_port_mapped, destination_port_mapped, destination_service_mapped, direction_mapped, connections_built (...)"
"  Filter: (firewall_logs_mapped.operation = 'Built'::text)"
"  Rows Removed by Filter: 12049756"
"Total runtime: 16751.255 ms"

Now if i understand this right it seems the time is spent filtering the data by the where clause. Can this time be reduced?

So as I am plotting a parallel co-ordinate graph I only require the unique combinations of the column values returned through the where clause, as otherwise there will just be a lot of lines overdrawn on each other.

How would I go about selecting only the unique combinations of adjacent columns? Is this too complicated a task to perform through a database query? Would it speed up the query?

SELECT DISTINCT will select the unique rows (combinations of all values in a row), I'm talking about the unique combinations of any two adjacent columns, as in parallel co-ordinates a line is drawn between adjacent columns, for each column in a row, i.e. from log_time to syslog_priority, the next line for this row will be from syslog_priority to operation etc. Check out this link: http://en.wikipedia.org/wiki/File:ParCorFisherIris.png, in my case each axes would be one column in the database.

Currently I do this in Python by using the results to create an array of co-ordinates for the lines to be drawn, and then I loop through the array adding the line co-ordinates to a set if not seen before in the loop.

EDIT:

Indexes have been created as follows:

CREATE INDEX unique_log_time_index ON unique_firewall_logs_mapped (log_time);
CREATE INDEX unique_syslog_priority_index ON unique_firewall_logs_mapped (syslog_priority);
CREATE INDEX unique_operation_index ON unique_firewall_logs_mapped (operation);
CREATE INDEX unique_message_code_index ON unique_firewall_logs_mapped (message_code);
CREATE INDEX unique_protocol_index ON unique_firewall_logs_mapped (protocol);
CREATE INDEX unique_source_ip_index ON unique_firewall_logs_mapped (source_ip);
CREATE INDEX unique_destination_ip_index ON unique_firewall_logs_mapped (destination_ip);
CREATE INDEX unique_source_port_index ON unique_firewall_logs_mapped (source_port);
CREATE INDEX unique_destination_port_index ON unique_firewall_logs_mapped (destination_port);
CREATE INDEX unique_destination_service_index ON unique_firewall_logs_mapped (destination_service);
CREATE INDEX unique_direction_index ON unique_firewall_logs_mapped (direction);
CREATE INDEX unique_connections_built_index ON unique_firewall_logs_mapped (connections_built);
CREATE INDEX unique_connections_torn_down_index ON unique_firewall_logs_mapped (connections_torn_down);
CREATE INDEX unique_hourofday_index ON unique_firewall_logs_mapped (hourofday);
CREATE INDEX unique_meridiem_index ON unique_firewall_logs_mapped (meridiem);

My PostgreSQL version is 9.3, random_page_cost is 4.0, seq_page_cost is 1.0.

Also I'd like to say I tried forcing the use of the index and the query took 2 seconds or so longer than without the index.

Нет правильного решения

Лицензировано под: CC-BY-SA с атрибуция

Не связан с dba.stackexchange