Ordering by non-time data within a timeseries clause
I have a vertica table that looks something like this:
CREATE_TABLE sensor_data (
occurred_at timestamp NOT NULL,
The problem comes when trying to use a TIMESERIES clause to build a running state for the device. Since the data fields are sparsely populated, an apparently valid query might look something like this:
SELECT slice_time, device_id, TS_LAST_VALUE(data_1 IGNORE NULLS), TS_LAST_VALUE(field2 IGNORE NULLS)
FROM (SELECT occurred_at, device_id, data_1, data_2, FROM sensor_data)
TIMESERIES slice_time AS '1 minute' OVER (ORDER BY occurred_at)
The problem here is the devices providing us with the data are not reliable. In particular, a device may report two or more different samples at the same timestamp, so we must use the sequence number (which is reliably monotonically increasing) to identify the most recent (and as it turns out, correct) sample. In this case, the value I want from data_1 is 3, and for data_2 is 4.
INSERT INTO sensor_data VALUES ("device_1", '2016-02-10 12:00:00', 1000, 1, 2);
INSERT INTO sensor_data VALUES ("device_1", '2016-02-10 12:00:00', 1001, 3, 4);
With the above TIMESERIES clause, it seems that Vertica is apparently choosing the first sample at a given occurred_at to be the one reported from TS_LAST_VALUE. If this were not in a TIMESERIES clause, I'd just ORDER BY occurred_at, sequence_id and be done with it, but given that it is, that option does not appear to be available.
So far, I've come up with the following ideas:
- Don't insert the overwritten data in the first place! The problem here is that the data is ETL'd from another database, and at the time we do that (by generating files that we COPY FROM), we may not have the previous sample available.
- Delete the superseded samples. On the whole, I'd like to keep them around, if for no other reason that to be able to provide the device makers with data proving the trouble they're causing me.
- Write a DB crawler that periodically goes through the data, marking superseded events, allowing me to reject those samples from my inner query.
I don't like any of them, to be honest; even #3 is going to produce wrong answers for some period of time (until the crawler runs again). Is there a way to coerce Vertica into ordering apparently identical timestamps by a non-time column, or some other solution that allows me to generate the timeseries using the most recent sequence_id for a particular timestamp?
Thanks for any suggestions.