Can one build a live table or projection out of TIMESERIES?

BHIFP · September 2021

CREATE PROJECTION prices_sync.ticks_mins /*+basename(ticks),createtype(L)*/
    (
     distribution_id ENCODING RLE,
     base_asset_id ENCODING RLE,
     counter_asset_id ENCODING RLE,
     business_time1 ENCODING DELTAVAL,
     price ENCODING AUTO
        )
    AS
        SELECT distribution_id,
               base_asset_id,
               counter_asset_id,
               business_time1,
               TS_LAST_VALUE(price IGNORE NULLS) AS price
        FROM prices_sync.ticks TIMESERIES business_time1 AS '1 minute'
            OVER (PARTITION BY distribution_id, base_asset_id, counter_asset_id ORDER BY business_time)
    SEGMENTED BY HASH(base_asset_id, counter_asset_id) ALL NODES OFFSET 0
    KSAFE 1;

Complains:
ERROR 6145: Analytic functions are not allowed in projections.

What is the right way have a "live" version of this statement?

Thanks

SruthiA · September 2021

@BHIFP : you cannot use TS_LAST_VALUE for aggregate projections. we support only 4 functions for aggregate projections. please find the link below. Are you facing performance issues when you run the select part of the query directly?

https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/AnalyzingData/AggregatedData/AggregateFunctionsSupported.htm

BHIFP · September 2021

What I need is to join prices with balances, say, at a specific timestamp, something like this:

SELECT b.*, p.price FROM balances b
JOIN (                     
    SELECT distribution_id,
      base_asset_id,
      counter_asset_id,
      business_time1,
      TS_LAST_VALUE(price IGNORE NULLS) AS price
      FROM prices_sync.ticks TIMESERIES business_time1 AS '1 minute'
      OVER (PARTITION BY distribution_id, base_asset_id, counter_asset_id ORDER BY business_time)
    ) p 
) ON 
    b.base_asset_id=p.base_asset_id and b.counter_asset_id=p.counter_asset_id
WHERE p.business_time1='2018-01-13T21:20:00Z' and distribution_id=1000;

prices_sync.ticks is quite large, so the above does not fly really.

BHIFP · September 2021

Somewhat related, given above prices_sync.ticks table, what would be the best to pivot it by distribution_id, where business_time is all aligned to '1 minute'?

Result I am looking for looks like this:

business_time, base_asset_id, counter_asset_id, distribution_id_1, distribution_id_2, distribution_id_3, distribution_id_N

where business_time is per minute.

Thanks

Vertica_Curtis · September 2021

I'm not sure you need a timeseries statement for this. Aren't you just wanting the most recent non-null price for a given set of filters? MAX() should be able to give you that.

BHIFP · September 2021

Thanks, Curtis, I don't follow it though. Do you mind write the statement in full?

Vertica_Curtis · September 2021

I guess not a max() - that would give something different. But it seems like this would work, or at least something similar.

select price
from prices_sync.ticks p
join balances b on b.base_asset_id=p.base_asset_id and b.counter_asset_id=p.counter_asset_id
WHERE p.business_time1 <='2018-01-13T21:20:00Z'
and distribution_id=1000
and price is not null
order by business_time
limit 1 ;

BHIFP · September 2021

This type of joins are quite slow under our system though. I was hoping with gap filled table, you can just do a p.business_time1 ='2018-01-13T21:20:00Z', which would be much faster.

Vertica_Curtis · September 2021

Well, speed aside, timeseries isn't supported in what you're trying to accomplish.

BHIFP · September 2021

I suppose this is not possible in a performant way either then?

Somewhat related, given above prices_sync.ticks table, what would be the best to pivot it by distribution_id, where business_time is all aligned to '1 minute'?

Result I am looking for looks like this:

business_time, base_asset_id, counter_asset_id, distribution_id_1, distribution_id_2, distribution_id_3, distribution_id_N

where business_time is per minute.

Vertica_Curtis · September 2021

Personally, I think you need to look at your design, and find a more elegant way than by populating a column via a timeseries function That's very costly, regardless of system. What you're describing strikes me as a more operational use-case, rather than something one would find in a robust, scalable analytics system. What you're trying to do is NOT going to scale to any magnitude, and even if you get it working reasonably well today, a year from now when you have 2x the data, the performance is going to make you sad.

Vertica can do the timeseries relatively easily, and you could run this kind of thing at query time, but baking it into your table design is simply not a good idea.

BHIFP · September 2021

You are right, Curtis, we don't have the best design for this use case. but would like to hear your thoughts on a better design?

Assuming we only have 4 columns:
Exchange, Asset, Time, Price

The original timeseries from exchanges are at ticks level, so not necessarily at fixed time interval. I would like to normalize timeseries from all exchanges to fixed interval time series, and put all exchanges side by side, so I can take an average across them.

I am doing for production streaming purpose.

Thanks

Vertica_Curtis · September 2021

You might not need timeseries for that. You could just aggregate the time by using time_slice().
https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Date-Time/TIME_SLICE.htm?zoom_highlight=time slice

That would give you a specific time interval that's consistent across records. See if time_slice() can be useful here. It's not an analytic function, so it might be helpful.

BHIFP · September 2021

Thanks, Curtis. TIME_SLICE may not work if there are gaps on minutes level.

I think I found a fairly generic way to do it with a multiphase UDTF, need to test how performant it is though.

marcothesane · September 2021

I have a dataset out of sensor data (which are time series, too). And a pivoting example that could fit your purpose.
My technique is actually to not have a 'stroboscopic' picture of the time series, but to only keep it as on-change-only.

And to apply the TIMESERIES clause just immediately before I run a Machine Learning function that needs evenly spaced time series in input.

Imagine a table with this data cut out: My msr would be your Asset; my veri_device_id will probably be your exchange.
I only have a new row when the sensor value changes.

 veri_device_id |      data_date      |     msr      |  val   
----------------+---------------------+--------------+--------
           1538 | 2021-01-08 14:00:01 | frequency2   |  51.00
           1538 | 2021-01-08 14:00:01 | voltage1     | 227.00
           1538 | 2021-01-08 14:00:01 | voltage2     | 227.00
           1538 | 2021-01-08 14:00:01 | voltageGN    |   0.90
           1538 | 2021-01-08 14:01:01 | frequency1   |  51.00
           1538 | 2021-01-08 14:01:01 | frequency2   |  50.00
           1538 | 2021-01-08 14:01:01 | voltage1     | 228.00
           1538 | 2021-01-08 14:01:01 | voltage2     | 226.00
           1538 | 2021-01-08 14:02:01 | frequency1   |  50.00
           1538 | 2021-01-08 14:02:01 | humidity2    |  45.90
           1538 | 2021-01-08 14:02:01 | voltage1     | 227.00
           1538 | 2021-01-08 14:03:01 | humidity2    |  46.00
           1538 | 2021-01-08 14:04:00 | frequency1   |  51.00
           1538 | 2021-01-08 14:04:00 | frequency2   |  51.00
           1538 | 2021-01-08 14:05:01 | frequency1   |  50.00
           1538 | 2021-01-08 14:05:01 | frequency2   |  50.00
           1538 | 2021-01-08 14:05:01 | humidity2    |  45.90
           1538 | 2021-01-08 14:06:00 | voltage2     | 227.00
           1538 | 2021-01-08 14:07:02 | voltage2     | 226.00
           1538 | 2021-01-08 14:08:01 | voltage1     | 226.00
           1538 | 2021-01-08 14:09:01 | humidity2    |  45.80
           1538 | 2021-01-08 14:09:01 | temperature2 |  24.90
           1538 | 2021-01-08 14:10:02 | temperature2 |  25.00

Now, I want a data cutout that pivots the three voltage measures, and only puts out a row when any of the three values change. For that, I first pivot using, for example, MAX(CASE msr WHEN 'voltage1' THEN voltage1 END), for all three measures, and GROUP BY the timestamp, and then I apply a LAST_VALUE(voltage1 IGNORE NULLS) to fill the resulting holes of the GROUP BY.

WITH on_change_only AS (
-- just a little cut out from my big table for the 3 measures ...
  SELECT * 
  FROM public.oco 
  WHERE data_date >= '2021-01-08 14:00' 
    AND msr IN('voltage1','voltage2','voltageGN')
  ORDER BY data_date
  LIMIT 32
)
,
grp AS (
  SELECT
    veri_device_id
  , data_date
  , MAX(CASE msr WHEN 'voltage1'  THEN val END) AS voltage1
  , MAX(CASE msr WHEN 'voltage2'  THEN val END) AS voltage2
  , MAX(CASE msr WHEN 'voltageGN' THEN val END) AS voltageGN
  FROM on_change_only
  GROUP BY 
    veri_device_id
  , data_date
  ORDER BY 
    veri_device_id
  , data_date
)
SELECT
  veri_device_id
, data_date
, LAST_VALUE(voltage1  IGNORE NULLS) OVER w AS voltage1
, LAST_VALUE(voltage2  IGNORE NULLS) OVER w AS voltage2
, LAST_VALUE(voltageGN IGNORE NULLS) OVER w AS voltageGN
FROM grp
WINDOW w AS (PARTITION BY veri_device_id) ORDER BY data_date;
 veri_device_id |      data_date      | voltage1 | voltage2 | voltageGN 
----------------+---------------------+----------+----------+-----------
           1538 | 2021-01-08 14:00:01 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:01:01 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:02:01 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:06:00 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:07:02 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:08:01 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:10:02 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:13:02 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:14:01 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:15:02 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:16:01 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:17:01 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:18:01 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:19:00 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:20:01 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:21:00 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:22:00 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:23:00 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:24:01 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:25:03 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:26:02 |   226.00 |   226.00 |      1.00

marcothesane · September 2021

Error in the query, sorry - better now ...

WITH on_change_only AS (
-- just a little cut out from my big table for the 3 measures ...
  SELECT * 
  FROM public.oco 
  WHERE data_date >= '2021-01-08 14:00' 
    AND msr IN('voltage1','voltage2','voltageGN')
  ORDER BY data_date
  LIMIT 32
)
,
grp AS (
  SELECT
    veri_device_id
  , data_date
  , MAX(CASE msr WHEN 'voltage1'  THEN val END) AS voltage1
  , MAX(CASE msr WHEN 'voltage2'  THEN val END) AS voltage2
  , MAX(CASE msr WHEN 'voltageGN' THEN val END) AS voltageGN
  FROM on_change_only
  GROUP BY 
    veri_device_id
  , data_date
  ORDER BY 
    veri_device_id
  , data_date
)
SELECT
  veri_device_id
, data_date
, LAST_VALUE(voltage1  IGNORE NULLS) OVER w AS voltage1
, LAST_VALUE(voltage2  IGNORE NULLS) OVER w AS voltage2
, LAST_VALUE(voltageGN IGNORE NULLS) OVER w AS voltageGN
FROM grp
WINDOW w AS (PARTITION BY veri_device_id ORDER BY data_date);
 veri_device_id |      data_date      | voltage1 | voltage2 | voltageGN 
----------------+---------------------+----------+----------+-----------
           1538 | 2021-01-08 14:00:01 |   227.00 |   227.00 |      0.90
           1538 | 2021-01-08 14:01:01 |   228.00 |   226.00 |      0.90
           1538 | 2021-01-08 14:02:01 |   227.00 |   226.00 |      0.90
           1538 | 2021-01-08 14:06:00 |   227.00 |   227.00 |      0.90
           1538 | 2021-01-08 14:07:02 |   227.00 |   226.00 |      0.90
           1538 | 2021-01-08 14:08:01 |   226.00 |   226.00 |      0.90
           1538 | 2021-01-08 14:10:02 |   227.00 |   226.00 |      0.90
           1538 | 2021-01-08 14:13:02 |   226.00 |   227.00 |      0.90
           1538 | 2021-01-08 14:14:01 |   226.00 |   226.00 |      0.90
           1538 | 2021-01-08 14:15:02 |   226.00 |   225.00 |      1.00
           1538 | 2021-01-08 14:16:01 |   227.00 |   226.00 |      0.90
           1538 | 2021-01-08 14:17:01 |   226.00 |   226.00 |      0.90
           1538 | 2021-01-08 14:18:01 |   227.00 |   226.00 |      0.90
           1538 | 2021-01-08 14:19:00 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:20:01 |   226.00 |   226.00 |      0.90
           1538 | 2021-01-08 14:21:00 |   227.00 |   226.00 |      0.90
           1538 | 2021-01-08 14:22:00 |   226.00 |   226.00 |      0.90
           1538 | 2021-01-08 14:23:00 |   226.00 |   226.00 |      1.00
           1538 | 2021-01-08 14:24:01 |   227.00 |   225.00 |      1.00
           1538 | 2021-01-08 14:25:03 |   226.00 |   226.00 |      0.90
           1538 | 2021-01-08 14:26:02 |   226.00 |   226.00 |      1.00

BHIFP · September 2021

Thanks, marcothesane, this is really helpful. Just wonder if this is a "live" table/projection which can update itself as new data coming in.

If not, my UDTF might be a more flexible and repeatable solution. I am thinking something like this:

    SELECT AvgPricePerPair(distribution_id, base_asset_id, counter_asset_id, business_time, price USING PARAMETERS lookback=10, period="sec") 
    OVER(PARTITION BY distribution_id, base_asset_id, counter_asset_id) 
    FROM prices_sync.ticks
    WHERE base_asset_id=10007 and counter_asset_id=840 ORDER BY business_time;

AvgPricePerPair is multiphase UDTF, where
the first phase is to gather data across different distribution_id
the second phase is to aggregate across distribution_ids

Would like to hear your thoughts.

Thanks

marcothesane · September 2021

I think I'd need a significant cutout of your input data - and how you calculate AvgPricePerPair - to really come up with a solution.

If my approach with the sensor data here is helpful to you - consider a Live Aggregate Projection on the base table:

CREATE PROJECTION base_tb_agg AS
  SELECT
    veri_device_id
  , data_date
  , MAX(CASE msr WHEN 'voltage1'  THEN val END) AS voltage1
  , MAX(CASE msr WHEN 'voltage2'  THEN val END) AS voltage2
  , MAX(CASE msr WHEN 'voltageGN' THEN val END) AS voltageGN
  FROM base_tb
  GROUP BY 
    veri_device_id
  , data_date
;

(of course, with a MAX(CASE msg WHEN .... for each measure that you have in the table.
You can't build a Live Aggregate Projection using OLAP functions like LAST_VALUE() OVER()

BHIFP · September 2021

Your dataset was exactly on point. My AvePricePerPair function was literally taking an average, in my case across distribution_id, and in your case across different msr values.
MAX will not work, I need LAST_VALUE per time interval.

Since your solution cannot be live, it seems my UDTF is the better solution then. I am able to run it and get returns under 500 ms.

We're Moving!

Create My New Community Account Now

Can one build a live table or projection out of TIMESERIES?

Answers

Leave a Comment