Inconsistency in vertica query results
I have a relatively complicated query that is supposed to calculate standard deviation of BidPrice differences within each hour of a day:
with ticks as (
select *, TickTime at time zone 'Europe/London' at time zone 'utc' as TickTimeGMT from TickHistory.Futures
where (TickType = 'Quote' and BidPrice < AskPrice)
and RIC in ('FGBLM0') and TickTime > '2020-03-19 00:00:00 UTC' and TickTime < '2020-03-20 00:00:00 UTC'
),
bars as (
SELECT TimeSlice, TS_LAST_VALUE(BidPrice, 'const') as BidPrice
from ticks
TIMESERIES TimeSlice AS '1 seconds' over (ORDER BY TickTimeGMT)
),
differences as (
SELECT TimeSlice, (LEAD(BidPrice) over (order by TimeSlice) - BidPrice) as BidPriceDiff
from bars
)
SELECT DISTINCT TIME_SLICE(TimeSlice, 3600, 'SECOND') at time zone 'Europe/London' at time zone 'utc' as BarTime,
STDDEV(BidPriceDiff) OVER (PARTITION BY TIME_SLICE(TimeSlice, 3600, 'SECOND')) as Volatility
from differences;
Everything in this query seems deterministic. Nevertheless, every time I execute this query I get slightly but noticeably different results. What could be the reason for it and is there a way to improve the query to get stable results?
Answers
Can you post the DDL for the TickHistory.Futures table?
Thanks Jim, here it is:
CREATE TABLE TickHistory.futures
(
RIC varchar(50) NOT NULL,
AliasUnderlyingRIC varchar(50),
TickDomain varchar(25),
TickTime timestamp NOT NULL,
GMTOffset int,
TickType varchar(25) NOT NULL,
Price float,
Volume int,
BidPrice float,
BidSize int,
AskPrice float,
AskSize int,
Qualifiers varchar(255),
DateTimeTicks int NOT NULL,
InsertDate int NOT NULL,
InsertCount int NOT NULL
)
PARTITION BY (((date_part('year', futures.TickTime) * 100) + date_part('month', futures.TickTime)));
Hi Jim,
If a hard table is used rather than a temp the result set remains constant:-
-- Create hard table with the data
select * into TickHistory.Futures_STDDEV from TickHistory.Futures
where (TickType = 'Quote' and BidPrice < AskPrice)
and RIC in ('FGBLM0') and TickTime > '2020-03-19 00:00:00 UTC' and TickTime < '2020-03-20 00:00:00 UTC';
select count(*) from TickHistory.Futures_STDDEV;
1,705,710
-- Then run STDDEV against the hard table, the numbers remain unchanged.
with ticks as (
select *, TickTime at time zone 'Europe/London' at time zone 'utc' as TickTimeGMT from TickHistory.Futures_STDDEV
),
bars as (
SELECT TimeSlice, TS_LAST_VALUE(BidPrice, 'const') as BidPrice
from ticks
TIMESERIES TimeSlice AS '1 seconds' over (ORDER BY TickTimeGMT)
),
differences as (
SELECT TimeSlice, (LEAD(BidPrice) over (order by TimeSlice) - BidPrice) as BidPriceDiff
from bars
)
SELECT DISTINCT TIME_SLICE(TimeSlice, 3600, 'SECOND') at time zone 'Europe/London' at time zone 'utc' as BarTime,
STDDEV(BidPriceDiff) OVER (PARTITION BY TIME_SLICE(TimeSlice, 3600, 'SECOND')) as Volatility
from differences;
Hi Tim,
The simple design for stddev on float just isn't that stable, because float has rounding to it and so depending on the order the rows are processed you get different answers. This is very noticeable when the magitude of the deviation is small compared to the magnitude of the values going in... to the point where stddev(8+ digit constant) is not always 0.
Understood, thank you for the answer!
Or you can apply below trick to get a constant STDDEV value for your query.
The only downside is that you need to work on adding 1 more layer of subquery.
This may help you get consistent value of STDDEV.
SELECT col1::int
,SQRT(SUM(POWER(col2-avg_analytic,2)) / (cnt_analytic-1) ) AS "STDDEV_n-1(STDDEV)"
FROM (
SELECT
aaa.*
,AVG(col2) OVER (PARTITION BY col1) as avg_analytic
,COUNT(col2) OVER (PARTITION BY col1) as cnt_analytic
from aaa
) bbb
GROUP BY col1 ,cnt_analytic
ORDER BY 1;