fastest way to put together 100+ columns in a single table

phil2 · July 2018

Hello!

I have a sparse table with 100+ columns. Each column is calculated as last value from 100+ tables (one attribute per table, 6NF, each table consists of item_id, value, actual_date) so that the target table contains actual values for 100+ attributes (item_id, value1, value2, value...). 6NF here to maintain changes of each attribute separatly to save space. But after all I need to combine current values of attributes for each item to make analyst life easier. In order to combine attributes together I need to join 100+ tables. It is kinda slow. But it seems that all that I need is to put files containg data blocks for attributes togeather. I can make them to be of the same cardinality. But it seems that there is no such a low level utility for vertica to make it that easy.

Can you advise me how to combine 100+ columns together in a cheapest way.

Jim_Knicely · July 2018

Check out the "Flattened Table" feature!

https://my.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/AnalyzingData/FlattenedTables/FlattenedTables.htm

phil2 · July 2018

Thanks, I'll give it a try

marcothesane · July 2018

Hi phil2,

This could just as well go into Vertica Tips ...

In this comment, which prepares example data, and my next one, which is its continuation containing the actual query, you can copy-paste the code snippets below into VSQL to test their behaviour, and see if they satisfy your needs :

If I understand you right, you use 6th- normal form - which is very much the on-change-only storage approach in sensor data, where you don't store one column per type of measure, but a measure-type-indicator and a value column, and you keep a measure-type dimension table like so for explanation (names for it can be "device", "measure_type", "signal" or others):

DROP TABLE IF EXISTS d_devices CASCADE;
CREATE TABLE d_devices (
  dev_id 
, dev_unit
, dev_shname
, dev_name
)
AS
          SELECT 1, 'g/sec'       , 'airmass' ,'airmass'
UNION ALL SELECT 2, 'rpm'         , 'rpm'     ,'revolutions per minute'
UNION ALL SELECT 3, 'deg. Celsius', 'out_temp','outside temperature'
;

And the sensor data table would look like this:

DROP TABLE IF EXISTS sdata;
CREATE TABLE sdata(veh_id,drv_id,dev_id,ts,val) AS
          SELECT 3,27,1,TIME '00:00:00',0
UNION ALL SELECT 3,27,1,TIME '00:00:10',5
UNION ALL SELECT 3,27,1,TIME '00:00:20',6
UNION ALL SELECT 3,27,1,TIME '00:00:30',8
UNION ALL SELECT 3,27,1,TIME '00:00:40',12
UNION ALL SELECT 3,27,1,TIME '00:00:50',7
UNION ALL SELECT 3,27,1,TIME '00:01:00',4
UNION ALL SELECT 3,27,1,TIME '00:01:12',1
UNION ALL SELECT 3,27,1,TIME '00:01:23',0
UNION ALL SELECT 3,27,1,TIME '00:05:10',5
UNION ALL SELECT 3,27,1,TIME '00:05:20',6
UNION ALL SELECT 3,27,1,TIME '00:05:50',7
UNION ALL SELECT 3,27,1,TIME '00:06:00',4
UNION ALL SELECT 3,27,1,TIME '00:06:12',1
UNION ALL SELECT 3,27,1,TIME '00:06:23',0
UNION ALL SELECT 3,27,2,TIME '00:00:00',0
UNION ALL SELECT 3,27,2,TIME '00:00:10',300
UNION ALL SELECT 3,27,2,TIME '00:00:20',400
UNION ALL SELECT 3,27,2,TIME '00:00:30',520
UNION ALL SELECT 3,27,2,TIME '00:00:40',780
UNION ALL SELECT 3,27,2,TIME '00:00:50',430
UNION ALL SELECT 3,27,2,TIME '00:01:00',340
UNION ALL SELECT 3,27,2,TIME '00:01:12',40
UNION ALL SELECT 3,27,2,TIME '00:01:23',0
UNION ALL SELECT 3,27,2,TIME '00:05:10',300
UNION ALL SELECT 3,27,2,TIME '00:05:20',400
UNION ALL SELECT 3,27,2,TIME '00:05:50',430
UNION ALL SELECT 3,27,2,TIME '00:06:00',340
UNION ALL SELECT 3,27,2,TIME '00:06:12',40
UNION ALL SELECT 3,27,2,TIME '00:06:23',0
UNION ALL SELECT 3,27,3,TIME '00:00:00',15
UNION ALL SELECT 3,27,3,TIME '00:03:45',15.5
UNION ALL SELECT 3,27,3,TIME '00:06:23',16
UNION ALL SELECT 3,29,1,TIME '00:00:08',0
UNION ALL SELECT 3,29,1,TIME '00:00:17',5
UNION ALL SELECT 3,29,1,TIME '00:00:24',6
UNION ALL SELECT 3,29,1,TIME '00:00:37',7
UNION ALL SELECT 3,29,1,TIME '00:00:58',4
UNION ALL SELECT 3,29,1,TIME '00:01:12',1
UNION ALL SELECT 3,29,1,TIME '00:01:23',0
UNION ALL SELECT 3,29,1,TIME '00:02:51',5
UNION ALL SELECT 3,29,1,TIME '00:03:01',6
UNION ALL SELECT 3,29,1,TIME '00:03:17',7
UNION ALL SELECT 3,29,1,TIME '00:03:25',4
UNION ALL SELECT 3,29,1,TIME '00:03:43',1
UNION ALL SELECT 3,29,1,TIME '00:03:57',0
UNION ALL SELECT 3,29,2,TIME '00:00:08',0
UNION ALL SELECT 3,29,2,TIME '00:00:17',300
UNION ALL SELECT 3,29,2,TIME '00:00:24',400
UNION ALL SELECT 3,29,2,TIME '00:00:37',430
UNION ALL SELECT 3,29,2,TIME '00:00:58',340
UNION ALL SELECT 3,29,2,TIME '00:01:12',40
UNION ALL SELECT 3,29,2,TIME '00:01:23',0
UNION ALL SELECT 3,29,2,TIME '00:02:51',300
UNION ALL SELECT 3,29,2,TIME '00:03:01',400
UNION ALL SELECT 3,29,2,TIME '00:03:17',430
UNION ALL SELECT 3,29,2,TIME '00:03:25',340
UNION ALL SELECT 3,29,2,TIME '00:03:43',40
UNION ALL SELECT 3,29,2,TIME '00:03:57',0
UNION ALL SELECT 3,29,3,TIME '00:00:00',15
UNION ALL SELECT 3,29,3,TIME '00:03:45',15.5
UNION ALL SELECT 3,29,3,TIME '00:06:23',16
;

This is the data preparation phase for this post.
My next comment will contain how to query this.

Bear with me ....

Marco

marcothesane · July 2018

Hi again, Phil2.

In the data prepared in my previous post above, note that , for each test drive (drv_id), we only have 3 measures for outside temperature, as that changes only sporadically, while we have more frequent changes for airmass and rpm.

The quickest way to get the measures side by side, as a colleague of mine, Maurizio, and I , found out, is not Vertica's event series join, but the more "classical" horizontal pivoting approach, where we collapse the vertical structure, in our case by vehicle id, test drive id and timestamp. We will not like it at the beginning, as the resulting table will come out sparse / ragged:

WITH
piv_ragged (veh_id,drv_id,ts,airmass,rpm,out_temp) AS (
  SELECT
    veh_id
  , drv_id
  , ts
  , SUM(CASE dev_id WHEN 1 THEN val  END) AS airmass
  , SUM(CASE dev_id WHEN 2 THEN val  END) AS rpm
  , SUM(CASE dev_id WHEN 3 THEN val  END) AS out_temp
  FROM sdata
  GROUP BY
    veh_id
  , drv_id
  , ts
)
SELECT * FROM piv_ragged ORDER BY veh_id, drv_id, ts;

And the result is:

veh_id|drv_id|ts      |airmass|rpm   |out_temp
     3|    27|00:00:00|    0.0|   0.0|    15.0
     3|    27|00:00:10|    5.0| 300.0|(null)
     3|    27|00:00:20|    6.0| 400.0|(null)
     3|    27|00:00:30|    8.0| 520.0|(null)
     3|    27|00:00:40|   12.0| 780.0|(null)
     3|    27|00:00:50|    7.0| 430.0|(null)
     3|    27|00:01:00|    4.0| 340.0|(null)
     3|    27|00:01:12|    1.0|  40.0|(null)
     3|    27|00:01:23|    0.0|   0.0|(null)
     3|    27|00:03:45|(null) |(null)|    15.5
     3|    27|00:05:10|    5.0| 300.0|(null)
     3|    27|00:05:20|    6.0| 400.0|(null)
     3|    27|00:05:50|    7.0| 430.0|(null)
     3|    27|00:06:00|    4.0| 340.0|(null)
     3|    27|00:06:12|    1.0|  40.0|(null)
     3|    27|00:06:23|    0.0|   0.0|    16.0
     3|    29|00:00:00|(null) |(null)|    15.0
     3|    29|00:00:08|    0.0|   0.0|(null)
     3|    29|00:00:17|    5.0| 300.0|(null)
     3|    29|00:00:24|    6.0| 400.0|(null)
     3|    29|00:00:37|    7.0| 430.0|(null)
     3|    29|00:00:58|    4.0| 340.0|(null)
     3|    29|00:01:12|    1.0|  40.0|(null)
     3|    29|00:01:23|    0.0|   0.0|(null)
     3|    29|00:02:51|    5.0| 300.0|(null)
     3|    29|00:03:01|    6.0| 400.0|(null)
     3|    29|00:03:17|    7.0| 430.0|(null)
     3|    29|00:03:25|    4.0| 340.0|(null)
     3|    29|00:03:43|    1.0|  40.0|(null)
     3|    29|00:03:45|(null) |(null)|    15.5
     3|    29|00:03:57|    0.0|   0.0|(null)
     3|    29|00:06:23|(null) |(null)|    16.0

As the storage is on-change-only, it's correct to assume that each NULL value will correspond to the last known value per vehicle id and test drive id for the measure in question. To reflect that, we can use the same OLAP window, which we name "w", for all three measures, and apply a LAST_VALUE( IGNORE NULLS) OVER (w) for them. So we add that bit to the query (that's why I formulated the main query above as a Common Table Expression in the first place):

WITH
piv_ragged (veh_id,drv_id,ts,airmass,rpm,out_temp) AS (
  SELECT
    veh_id
  , drv_id
  , ts
  , SUM(CASE dev_id WHEN 1 THEN val  END) AS airmass
  , SUM(CASE dev_id WHEN 2 THEN val  END) AS rpm
  , SUM(CASE dev_id WHEN 3 THEN val  END) AS out_temp
  FROM sdata
  GROUP BY
    veh_id
  , drv_id
  , ts
)
SELECT
  veh_id
, drv_id
, ts
, LAST_VALUE(airmass  IGNORE NULLS) OVER(w) AS airmass
, LAST_VALUE(rpm      IGNORE NULLS) OVER(w) AS rpm
, LAST_VALUE(out_temp IGNORE NULLS) OVER(w) AS out_temp
FROM piv_ragged
WINDOW w AS (PARTITION BY veh_id,drv_id ORDER BY ts)
;

And the result this time, is:

WITH
piv_ragged (veh_id,drv_id,ts,airmass,rpm,out_temp) AS (
  SELECT
    veh_id
  , drv_id
  , ts
  , SUM(CASE dev_id WHEN 1 THEN val  END) AS airmass
  , SUM(CASE dev_id WHEN 2 THEN val  END) AS rpm
  , SUM(CASE dev_id WHEN 3 THEN val  END) AS out_temp
  FROM sdata
  GROUP BY
    veh_id
  , drv_id
  , ts
)
SELECT
  veh_id
, drv_id
, ts
, LAST_VALUE(airmass  IGNORE NULLS) OVER(w) AS airmass
, LAST_VALUE(rpm      IGNORE NULLS) OVER(w) AS rpm
, LAST_VALUE(out_temp IGNORE NULLS) OVER(w) AS out_temp
FROM piv_ragged
WINDOW w AS (PARTITION BY veh_id,drv_id ORDER BY ts)
;

If this covers what you need, it does not matter if you have 3 or 100 different measure types ...

Happy Playing ...

Marco

marcothesane · July 2018

I pasted the wrong output of the last query. The right one is:

veh_id|drv_id|ts      |airmass|rpm   |out_temp
     3|    29|00:00:00|(null) |(null)|    15.0
     3|    29|00:00:08|    0.0|   0.0|    15.0
     3|    29|00:00:17|    5.0| 300.0|    15.0
     3|    29|00:00:24|    6.0| 400.0|    15.0
     3|    29|00:00:37|    7.0| 430.0|    15.0
     3|    29|00:00:58|    4.0| 340.0|    15.0
     3|    29|00:01:12|    1.0|  40.0|    15.0
     3|    29|00:01:23|    0.0|   0.0|    15.0
     3|    29|00:02:51|    5.0| 300.0|    15.0
     3|    29|00:03:01|    6.0| 400.0|    15.0
     3|    29|00:03:17|    7.0| 430.0|    15.0
     3|    29|00:03:25|    4.0| 340.0|    15.0
     3|    29|00:03:43|    1.0|  40.0|    15.0
     3|    29|00:03:45|    1.0|  40.0|    15.5
     3|    29|00:03:57|    0.0|   0.0|    15.5
     3|    29|00:06:23|    0.0|   0.0|    16.0
     3|    27|00:00:00|    0.0|   0.0|    15.0
     3|    27|00:00:10|    5.0| 300.0|    15.0
     3|    27|00:00:20|    6.0| 400.0|    15.0
     3|    27|00:00:30|    8.0| 520.0|    15.0
     3|    27|00:00:40|   12.0| 780.0|    15.0
     3|    27|00:00:50|    7.0| 430.0|    15.0
     3|    27|00:01:00|    4.0| 340.0|    15.0
     3|    27|00:01:12|    1.0|  40.0|    15.0
     3|    27|00:01:23|    0.0|   0.0|    15.0
     3|    27|00:03:45|    0.0|   0.0|    15.5
     3|    27|00:05:10|    5.0| 300.0|    15.5
     3|    27|00:05:20|    6.0| 400.0|    15.5
     3|    27|00:05:50|    7.0| 430.0|    15.5
     3|    27|00:06:00|    4.0| 340.0|    15.5
     3|    27|00:06:12|    1.0|  40.0|    15.5
     3|    27|00:06:23|    0.0|   0.0|    16.0

Marco

phil2 · August 2018

Thanks for your advise. It is too dramatic change for me. There are too many places referencing those tables. The whole DWH built on it, so it is not an option for me. Even more, those tables are 100M+ to 1B+ tables, so I'm just not sure if it is a good idea to store all of them in a one EAV super-table.

After all I just group them by frequency, join single attributes in groups in parallel and then join groups together. Looks like it is the best that I can get of it. It seems that slowness of joins comes out of early materialization of that grouped tables when they are turning into inner tables in a final join and all the nulls in sparsed columns. Looks like grouping by frequency saves the day.

It would've been much easier it there were a function to put together same cardinality tables as is. Or if there were a join optimization for joining tables having absolutely the same join key column so that there would've been no need for materialization during join at all. I believe that such an operation would've been extremely fast.

fastest way to put together 100+ columns in a single table

Comments

Leave a Comment