Can one build a live table or projection out of TIMESERIES?
BHIFP
Vertica Customer ✭
CREATE PROJECTION prices_sync.ticks_mins /*+basename(ticks),createtype(L)*/ ( distribution_id ENCODING RLE, base_asset_id ENCODING RLE, counter_asset_id ENCODING RLE, business_time1 ENCODING DELTAVAL, price ENCODING AUTO ) AS SELECT distribution_id, base_asset_id, counter_asset_id, business_time1, TS_LAST_VALUE(price IGNORE NULLS) AS price FROM prices_sync.ticks TIMESERIES business_time1 AS '1 minute' OVER (PARTITION BY distribution_id, base_asset_id, counter_asset_id ORDER BY business_time) SEGMENTED BY HASH(base_asset_id, counter_asset_id) ALL NODES OFFSET 0 KSAFE 1;
Complains:
ERROR 6145: Analytic functions are not allowed in projections.
What is the right way have a "live" version of this statement?
Thanks
0
Answers
@BHIFP : you cannot use TS_LAST_VALUE for aggregate projections. we support only 4 functions for aggregate projections. please find the link below. Are you facing performance issues when you run the select part of the query directly?
https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/AnalyzingData/AggregatedData/AggregateFunctionsSupported.htm
What I need is to join prices with balances, say, at a specific timestamp, something like this:
prices_sync.ticks is quite large, so the above does not fly really.
Somewhat related, given above prices_sync.ticks table, what would be the best to pivot it by distribution_id, where business_time is all aligned to '1 minute'?
Result I am looking for looks like this:
where business_time is per minute.
Thanks
I'm not sure you need a timeseries statement for this. Aren't you just wanting the most recent non-null price for a given set of filters? MAX() should be able to give you that.
Thanks, Curtis, I don't follow it though. Do you mind write the statement in full?
I guess not a max() - that would give something different. But it seems like this would work, or at least something similar.
select price
from prices_sync.ticks p
join balances b on b.base_asset_id=p.base_asset_id and b.counter_asset_id=p.counter_asset_id
WHERE p.business_time1 <='2018-01-13T21:20:00Z'
and distribution_id=1000
and price is not null
order by business_time
limit 1 ;
This type of joins are quite slow under our system though. I was hoping with gap filled table, you can just do a p.business_time1 ='2018-01-13T21:20:00Z', which would be much faster.
Well, speed aside, timeseries isn't supported in what you're trying to accomplish.
I suppose this is not possible in a performant way either then?
Personally, I think you need to look at your design, and find a more elegant way than by populating a column via a timeseries function That's very costly, regardless of system. What you're describing strikes me as a more operational use-case, rather than something one would find in a robust, scalable analytics system. What you're trying to do is NOT going to scale to any magnitude, and even if you get it working reasonably well today, a year from now when you have 2x the data, the performance is going to make you sad.
Vertica can do the timeseries relatively easily, and you could run this kind of thing at query time, but baking it into your table design is simply not a good idea.
You are right, Curtis, we don't have the best design for this use case. but would like to hear your thoughts on a better design?
Assuming we only have 4 columns:
Exchange, Asset, Time, Price
The original timeseries from exchanges are at ticks level, so not necessarily at fixed time interval. I would like to normalize timeseries from all exchanges to fixed interval time series, and put all exchanges side by side, so I can take an average across them.
I am doing for production streaming purpose.
Thanks
You might not need timeseries for that. You could just aggregate the time by using time_slice().
https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Date-Time/TIME_SLICE.htm?zoom_highlight=time slice
That would give you a specific time interval that's consistent across records. See if time_slice() can be useful here. It's not an analytic function, so it might be helpful.
Thanks, Curtis. TIME_SLICE may not work if there are gaps on minutes level.
I think I found a fairly generic way to do it with a multiphase UDTF, need to test how performant it is though.
I have a dataset out of sensor data (which are time series, too). And a pivoting example that could fit your purpose.
My technique is actually to not have a 'stroboscopic' picture of the time series, but to only keep it as on-change-only.
And to apply the TIMESERIES clause just immediately before I run a Machine Learning function that needs evenly spaced time series in input.
Imagine a table with this data cut out: My
msr
would be your Asset; myveri_device_id
will probably be your exchange.I only have a new row when the sensor value changes.
Now, I want a data cutout that pivots the three voltage measures, and only puts out a row when any of the three values change. For that, I first pivot using, for example,
MAX(CASE msr WHEN 'voltage1' THEN voltage1 END)
, for all three measures, and GROUP BY the timestamp, and then I apply aLAST_VALUE(voltage1 IGNORE NULLS)
to fill the resulting holes of the GROUP BY.Error in the query, sorry - better now ...
Thanks, marcothesane, this is really helpful. Just wonder if this is a "live" table/projection which can update itself as new data coming in.
If not, my UDTF might be a more flexible and repeatable solution. I am thinking something like this:
AvgPricePerPair is multiphase UDTF, where
the first phase is to gather data across different distribution_id
the second phase is to aggregate across distribution_ids
Would like to hear your thoughts.
Thanks
I think I'd need a significant cutout of your input data - and how you calculate
AvgPricePerPair
- to really come up with a solution.If my approach with the sensor data here is helpful to you - consider a Live Aggregate Projection on the base table:
(of course, with a
MAX(CASE msg WHEN ....
for each measure that you have in the table.You can't build a Live Aggregate Projection using OLAP functions like
LAST_VALUE() OVER()
Your dataset was exactly on point. My AvePricePerPair function was literally taking an average, in my case across distribution_id, and in your case across different msr values.
MAX will not work, I need LAST_VALUE per time interval.
Since your solution cannot be live, it seems my UDTF is the better solution then. I am able to run it and get returns under 500 ms.