Ordering by non-time data within a timeseries clause

eric_the_read · February 2016

I have a vertica table that looks something like this:

CREATE_TABLE sensor_data (

device_id varchar(255),

occurred_at timestamp NOT NULL,

sequence_id int,

data_1 int,

data_2 int

)

The problem comes when trying to use a TIMESERIES clause to build a running state for the device. Since the data fields are sparsely populated, an apparently valid query might look something like this:

SELECT slice_time, device_id, TS_LAST_VALUE(data_1 IGNORE NULLS), TS_LAST_VALUE(field2 IGNORE NULLS)

FROM (SELECT occurred_at, device_id, data_1, data_2, FROM sensor_data)

TIMESERIES slice_time AS '1 minute' OVER (ORDER BY occurred_at)

The problem here is the devices providing us with the data are not reliable. In particular, a device may report two or more different samples at the same timestamp, so we must use the sequence number (which is reliably monotonically increasing) to identify the most recent (and as it turns out, correct) sample. In this case, the value I want from data_1 is 3, and for data_2 is 4.

INSERT INTO sensor_data VALUES ("device_1", '2016-02-10 12:00:00', 1000, 1, 2);

INSERT INTO sensor_data VALUES ("device_1", '2016-02-10 12:00:00', 1001, 3, 4);

With the above TIMESERIES clause, it seems that Vertica is apparently choosing the first sample at a given occurred_at to be the one reported from TS_LAST_VALUE. If this were not in a TIMESERIES clause, I'd just ORDER BY occurred_at, sequence_id and be done with it, but given that it is, that option does not appear to be available.

So far, I've come up with the following ideas:

Don't insert the overwritten data in the first place! The problem here is that the data is ETL'd from another database, and at the time we do that (by generating files that we COPY FROM), we may not have the previous sample available.
Delete the superseded samples. On the whole, I'd like to keep them around, if for no other reason that to be able to provide the device makers with data proving the trouble they're causing me.
Write a DB crawler that periodically goes through the data, marking superseded events, allowing me to reject those samples from my inner query.

I don't like any of them, to be honest; even #3 is going to produce wrong answers for some period of time (until the crawler runs again). Is there a way to coerce Vertica into ordering apparently identical timestamps by a non-time column, or some other solution that allows me to generate the timeseries using the most recent sequence_id for a particular timestamp?

Thanks for any suggestions.

eli_revach · March 2016

Hi ,

You can simply overcome this limitation by enriching your timestamp sample value .

Timestamp data type , can include milliseconds parts , you can enrich your sample value with the sequence _id value , the value of the sequence id will be added to the sample as milliseconds part .

Using that method your data in will look like

INSERT INTO sensor_data VALUES ("device_1", '2016-02-10 12:00:00.1000', 1000, 1, 2);
INSERT INTO sensor_data VALUES ("device_1", '2016-02-10 12:00:00.1000', 1001, 3, 4);

Example from the documentations :

CREATE TABLE temp (datecol TIMESTAMP);
INSERT INTO temp VALUES (TIMESTAMP '2010-03-25 12:47:32.62');
INSERT INTO temp VALUES (TIMESTAMP '2010-03-25 12:55:49.123456');
INSERT INTO temp VALUES (TIMESTAMP '2010-03-25 01:08:15.12374578');
SELECT * FROM temp;
datecol
----------------------------
2010-03-25 12:47:32.62
2010-03-25 12:55:49.123456
2010-03-25 01:08:15.123746
(3 rows)

I hope you will find it useful

Thanks

We're Moving!

Create My New Community Account Now

Ordering by non-time data within a timeseries clause

Comments

Leave a Comment