Can one build a live table or projection out of TIMESERIES?

BHIFPBHIFP Vertica Customer
CREATE PROJECTION prices_sync.ticks_mins /*+basename(ticks),createtype(L)*/
    (
     distribution_id ENCODING RLE,
     base_asset_id ENCODING RLE,
     counter_asset_id ENCODING RLE,
     business_time1 ENCODING DELTAVAL,
     price ENCODING AUTO
        )
    AS
        SELECT distribution_id,
               base_asset_id,
               counter_asset_id,
               business_time1,
               TS_LAST_VALUE(price IGNORE NULLS) AS price
        FROM prices_sync.ticks TIMESERIES business_time1 AS '1 minute'
            OVER (PARTITION BY distribution_id, base_asset_id, counter_asset_id ORDER BY business_time)
    SEGMENTED BY HASH(base_asset_id, counter_asset_id) ALL NODES OFFSET 0
    KSAFE 1;

Complains:
ERROR 6145: Analytic functions are not allowed in projections.

What is the right way have a "live" version of this statement?

Thanks

Answers

  • SruthiASruthiA Vertica Employee Administrator

    @BHIFP : you cannot use TS_LAST_VALUE for aggregate projections. we support only 4 functions for aggregate projections. please find the link below. Are you facing performance issues when you run the select part of the query directly?

    https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/AnalyzingData/AggregatedData/AggregateFunctionsSupported.htm

  • BHIFPBHIFP Vertica Customer

    What I need is to join prices with balances, say, at a specific timestamp, something like this:

    SELECT b.*, p.price FROM balances b
    JOIN (                     
        SELECT distribution_id,
          base_asset_id,
          counter_asset_id,
          business_time1,
          TS_LAST_VALUE(price IGNORE NULLS) AS price
          FROM prices_sync.ticks TIMESERIES business_time1 AS '1 minute'
          OVER (PARTITION BY distribution_id, base_asset_id, counter_asset_id ORDER BY business_time)
        ) p 
    ) ON 
        b.base_asset_id=p.base_asset_id and b.counter_asset_id=p.counter_asset_id
    WHERE p.business_time1='2018-01-13T21:20:00Z' and distribution_id=1000;
    

    prices_sync.ticks is quite large, so the above does not fly really.

  • BHIFPBHIFP Vertica Customer
    edited September 2021

    Somewhat related, given above prices_sync.ticks table, what would be the best to pivot it by distribution_id, where business_time is all aligned to '1 minute'?

    Result I am looking for looks like this:

    business_time, base_asset_id, counter_asset_id, distribution_id_1, distribution_id_2, distribution_id_3, distribution_id_N

    where business_time is per minute.

    Thanks

  • I'm not sure you need a timeseries statement for this. Aren't you just wanting the most recent non-null price for a given set of filters? MAX() should be able to give you that.

  • BHIFPBHIFP Vertica Customer

    Thanks, Curtis, I don't follow it though. Do you mind write the statement in full?

  • I guess not a max() - that would give something different. But it seems like this would work, or at least something similar.

    select price
    from prices_sync.ticks p
    join balances b on b.base_asset_id=p.base_asset_id and b.counter_asset_id=p.counter_asset_id
    WHERE p.business_time1 <='2018-01-13T21:20:00Z'
    and distribution_id=1000
    and price is not null
    order by business_time
    limit 1 ;

  • BHIFPBHIFP Vertica Customer

    This type of joins are quite slow under our system though. I was hoping with gap filled table, you can just do a p.business_time1 ='2018-01-13T21:20:00Z', which would be much faster.

  • Well, speed aside, timeseries isn't supported in what you're trying to accomplish.

  • BHIFPBHIFP Vertica Customer

    I suppose this is not possible in a performant way either then?

    Somewhat related, given above prices_sync.ticks table, what would be the best to pivot it by distribution_id, where business_time is all aligned to '1 minute'?

    Result I am looking for looks like this:

    business_time, base_asset_id, counter_asset_id, distribution_id_1, distribution_id_2, distribution_id_3, distribution_id_N

    where business_time is per minute.

  • Personally, I think you need to look at your design, and find a more elegant way than by populating a column via a timeseries function That's very costly, regardless of system. What you're describing strikes me as a more operational use-case, rather than something one would find in a robust, scalable analytics system. What you're trying to do is NOT going to scale to any magnitude, and even if you get it working reasonably well today, a year from now when you have 2x the data, the performance is going to make you sad.

    Vertica can do the timeseries relatively easily, and you could run this kind of thing at query time, but baking it into your table design is simply not a good idea.

  • BHIFPBHIFP Vertica Customer

    You are right, Curtis, we don't have the best design for this use case. but would like to hear your thoughts on a better design?

    Assuming we only have 4 columns:
    Exchange, Asset, Time, Price

    The original timeseries from exchanges are at ticks level, so not necessarily at fixed time interval. I would like to normalize timeseries from all exchanges to fixed interval time series, and put all exchanges side by side, so I can take an average across them.

    I am doing for production streaming purpose.

    Thanks

  • You might not need timeseries for that. You could just aggregate the time by using time_slice().
    https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Date-Time/TIME_SLICE.htm?zoom_highlight=time slice

    That would give you a specific time interval that's consistent across records. See if time_slice() can be useful here. It's not an analytic function, so it might be helpful.

  • BHIFPBHIFP Vertica Customer

    Thanks, Curtis. TIME_SLICE may not work if there are gaps on minutes level.

    I think I found a fairly generic way to do it with a multiphase UDTF, need to test how performant it is though.

  • marcothesanemarcothesane - Select Field - Administrator

    I have a dataset out of sensor data (which are time series, too). And a pivoting example that could fit your purpose.
    My technique is actually to not have a 'stroboscopic' picture of the time series, but to only keep it as on-change-only.

    And to apply the TIMESERIES clause just immediately before I run a Machine Learning function that needs evenly spaced time series in input.

    Imagine a table with this data cut out: My msr would be your Asset; my veri_device_id will probably be your exchange.
    I only have a new row when the sensor value changes.

     veri_device_id |      data_date      |     msr      |  val   
    ----------------+---------------------+--------------+--------
               1538 | 2021-01-08 14:00:01 | frequency2   |  51.00
               1538 | 2021-01-08 14:00:01 | voltage1     | 227.00
               1538 | 2021-01-08 14:00:01 | voltage2     | 227.00
               1538 | 2021-01-08 14:00:01 | voltageGN    |   0.90
               1538 | 2021-01-08 14:01:01 | frequency1   |  51.00
               1538 | 2021-01-08 14:01:01 | frequency2   |  50.00
               1538 | 2021-01-08 14:01:01 | voltage1     | 228.00
               1538 | 2021-01-08 14:01:01 | voltage2     | 226.00
               1538 | 2021-01-08 14:02:01 | frequency1   |  50.00
               1538 | 2021-01-08 14:02:01 | humidity2    |  45.90
               1538 | 2021-01-08 14:02:01 | voltage1     | 227.00
               1538 | 2021-01-08 14:03:01 | humidity2    |  46.00
               1538 | 2021-01-08 14:04:00 | frequency1   |  51.00
               1538 | 2021-01-08 14:04:00 | frequency2   |  51.00
               1538 | 2021-01-08 14:05:01 | frequency1   |  50.00
               1538 | 2021-01-08 14:05:01 | frequency2   |  50.00
               1538 | 2021-01-08 14:05:01 | humidity2    |  45.90
               1538 | 2021-01-08 14:06:00 | voltage2     | 227.00
               1538 | 2021-01-08 14:07:02 | voltage2     | 226.00
               1538 | 2021-01-08 14:08:01 | voltage1     | 226.00
               1538 | 2021-01-08 14:09:01 | humidity2    |  45.80
               1538 | 2021-01-08 14:09:01 | temperature2 |  24.90
               1538 | 2021-01-08 14:10:02 | temperature2 |  25.00
    

    Now, I want a data cutout that pivots the three voltage measures, and only puts out a row when any of the three values change. For that, I first pivot using, for example, MAX(CASE msr WHEN 'voltage1' THEN voltage1 END), for all three measures, and GROUP BY the timestamp, and then I apply a LAST_VALUE(voltage1 IGNORE NULLS) to fill the resulting holes of the GROUP BY.

    WITH on_change_only AS (
    -- just a little cut out from my big table for the 3 measures ...
      SELECT * 
      FROM public.oco 
      WHERE data_date >= '2021-01-08 14:00' 
        AND msr IN('voltage1','voltage2','voltageGN')
      ORDER BY data_date
      LIMIT 32
    )
    ,
    grp AS (
      SELECT
        veri_device_id
      , data_date
      , MAX(CASE msr WHEN 'voltage1'  THEN val END) AS voltage1
      , MAX(CASE msr WHEN 'voltage2'  THEN val END) AS voltage2
      , MAX(CASE msr WHEN 'voltageGN' THEN val END) AS voltageGN
      FROM on_change_only
      GROUP BY 
        veri_device_id
      , data_date
      ORDER BY 
        veri_device_id
      , data_date
    )
    SELECT
      veri_device_id
    , data_date
    , LAST_VALUE(voltage1  IGNORE NULLS) OVER w AS voltage1
    , LAST_VALUE(voltage2  IGNORE NULLS) OVER w AS voltage2
    , LAST_VALUE(voltageGN IGNORE NULLS) OVER w AS voltageGN
    FROM grp
    WINDOW w AS (PARTITION BY veri_device_id) ORDER BY data_date;
     veri_device_id |      data_date      | voltage1 | voltage2 | voltageGN 
    ----------------+---------------------+----------+----------+-----------
               1538 | 2021-01-08 14:00:01 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:01:01 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:02:01 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:06:00 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:07:02 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:08:01 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:10:02 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:13:02 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:14:01 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:15:02 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:16:01 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:17:01 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:18:01 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:19:00 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:20:01 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:21:00 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:22:00 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:23:00 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:24:01 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:25:03 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:26:02 |   226.00 |   226.00 |      1.00
    
  • marcothesanemarcothesane - Select Field - Administrator

    Error in the query, sorry - better now ...

    WITH on_change_only AS (
    -- just a little cut out from my big table for the 3 measures ...
      SELECT * 
      FROM public.oco 
      WHERE data_date >= '2021-01-08 14:00' 
        AND msr IN('voltage1','voltage2','voltageGN')
      ORDER BY data_date
      LIMIT 32
    )
    ,
    grp AS (
      SELECT
        veri_device_id
      , data_date
      , MAX(CASE msr WHEN 'voltage1'  THEN val END) AS voltage1
      , MAX(CASE msr WHEN 'voltage2'  THEN val END) AS voltage2
      , MAX(CASE msr WHEN 'voltageGN' THEN val END) AS voltageGN
      FROM on_change_only
      GROUP BY 
        veri_device_id
      , data_date
      ORDER BY 
        veri_device_id
      , data_date
    )
    SELECT
      veri_device_id
    , data_date
    , LAST_VALUE(voltage1  IGNORE NULLS) OVER w AS voltage1
    , LAST_VALUE(voltage2  IGNORE NULLS) OVER w AS voltage2
    , LAST_VALUE(voltageGN IGNORE NULLS) OVER w AS voltageGN
    FROM grp
    WINDOW w AS (PARTITION BY veri_device_id ORDER BY data_date);
     veri_device_id |      data_date      | voltage1 | voltage2 | voltageGN 
    ----------------+---------------------+----------+----------+-----------
               1538 | 2021-01-08 14:00:01 |   227.00 |   227.00 |      0.90
               1538 | 2021-01-08 14:01:01 |   228.00 |   226.00 |      0.90
               1538 | 2021-01-08 14:02:01 |   227.00 |   226.00 |      0.90
               1538 | 2021-01-08 14:06:00 |   227.00 |   227.00 |      0.90
               1538 | 2021-01-08 14:07:02 |   227.00 |   226.00 |      0.90
               1538 | 2021-01-08 14:08:01 |   226.00 |   226.00 |      0.90
               1538 | 2021-01-08 14:10:02 |   227.00 |   226.00 |      0.90
               1538 | 2021-01-08 14:13:02 |   226.00 |   227.00 |      0.90
               1538 | 2021-01-08 14:14:01 |   226.00 |   226.00 |      0.90
               1538 | 2021-01-08 14:15:02 |   226.00 |   225.00 |      1.00
               1538 | 2021-01-08 14:16:01 |   227.00 |   226.00 |      0.90
               1538 | 2021-01-08 14:17:01 |   226.00 |   226.00 |      0.90
               1538 | 2021-01-08 14:18:01 |   227.00 |   226.00 |      0.90
               1538 | 2021-01-08 14:19:00 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:20:01 |   226.00 |   226.00 |      0.90
               1538 | 2021-01-08 14:21:00 |   227.00 |   226.00 |      0.90
               1538 | 2021-01-08 14:22:00 |   226.00 |   226.00 |      0.90
               1538 | 2021-01-08 14:23:00 |   226.00 |   226.00 |      1.00
               1538 | 2021-01-08 14:24:01 |   227.00 |   225.00 |      1.00
               1538 | 2021-01-08 14:25:03 |   226.00 |   226.00 |      0.90
               1538 | 2021-01-08 14:26:02 |   226.00 |   226.00 |      1.00
    
  • BHIFPBHIFP Vertica Customer

    Thanks, marcothesane, this is really helpful. Just wonder if this is a "live" table/projection which can update itself as new data coming in.

    If not, my UDTF might be a more flexible and repeatable solution. I am thinking something like this:

        SELECT AvgPricePerPair(distribution_id, base_asset_id, counter_asset_id, business_time, price USING PARAMETERS lookback=10, period="sec") 
        OVER(PARTITION BY distribution_id, base_asset_id, counter_asset_id) 
        FROM prices_sync.ticks
        WHERE base_asset_id=10007 and counter_asset_id=840 ORDER BY business_time;
    

    AvgPricePerPair is multiphase UDTF, where
    the first phase is to gather data across different distribution_id
    the second phase is to aggregate across distribution_ids

    Would like to hear your thoughts.

    Thanks

  • marcothesanemarcothesane - Select Field - Administrator

    I think I'd need a significant cutout of your input data - and how you calculate AvgPricePerPair - to really come up with a solution.

    If my approach with the sensor data here is helpful to you - consider a Live Aggregate Projection on the base table:

    CREATE PROJECTION base_tb_agg AS
      SELECT
        veri_device_id
      , data_date
      , MAX(CASE msr WHEN 'voltage1'  THEN val END) AS voltage1
      , MAX(CASE msr WHEN 'voltage2'  THEN val END) AS voltage2
      , MAX(CASE msr WHEN 'voltageGN' THEN val END) AS voltageGN
      FROM base_tb
      GROUP BY 
        veri_device_id
      , data_date
    ;
    

    (of course, with a MAX(CASE msg WHEN .... for each measure that you have in the table.
    You can't build a Live Aggregate Projection using OLAP functions like LAST_VALUE() OVER()

  • BHIFPBHIFP Vertica Customer

    Your dataset was exactly on point. My AvePricePerPair function was literally taking an average, in my case across distribution_id, and in your case across different msr values.
    MAX will not work, I need LAST_VALUE per time interval.

    Since your solution cannot be live, it seems my UDTF is the better solution then. I am able to run it and get returns under 500 ms.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file