Why does nth_value only return a result for a few lines of data?
I have a query that is designed to give a due date ~3 business days after the order was received. I have 1 table that has the order information in it and another table that lists out all the days, the products, the place produced, and a flag to indicate if it is a business day for that combination. I have asked for help with this before and I was directed to using nth_value, however out of 34k lines for 1 days worth of data, it returns 7 lines where the due date is not null (and 3 of those lines have the incorrect value in them). I am not understanding what I have done wrong, so I am hoping that someone might be able to help me understand this function better or give me another idea of how to accurately give a due date that varies on a day, product, and place produced basis.
Best Answer
-
Vertica_Curtis Employee
I think your partition by clause is a bit out of control. I'm not entirely certain that matters here, but it's probably very redundant. Seems like this only needs to be partitioned by vendorId. But see if this gets you what you're looking for. It might be that the window frame clause is missing and that's causing it to do weird things.
select t.vendorid, n.InvoicedDate, sladays
, nth_Value(n.InvoicedDate, SlaDays) over (partition by n.VendorID order by n.InvoicedDate asc rows between current row and unbounded following) as due_Date
from temp t
join networkdays n on n.InvoicedDate > t.ReceivedTime::date and t.VendorId = n.VendorId
where n.DateFlag = 'Y'
order by t.VendorId, n.InvoicedDate ;5
Answers
@Elemist11 - Please post some sample data and the query that you tried.
Attached is the full export from the networkdays table that is used below and also the results from the queries below.
create table temp (shipbinid numeric(19,0), accountid numeric(19,0), skuid varchar(64), optionid varchar(64), lineitemoptionid numeric(19,0), quantity numeric(19,0), productcategory varchar(64), vendorid varchar(64), fulfillmentlocation varchar(64), contextname varchar(64), projectid numeric(19,0), invoicedtime timestamp, shippedtime timestamp, receivedtime timestamp, fulfilledtime timestamp, deliveredtime timestamp, shipbinstatus varchar(64), originalshipbin varchar(64), deliveryoption varchar(64), cartedd date, maxedd date, shipper varchar(64), trackingid varchar(64), reorderreasoncode varchar(256), reorderagentid varchar(128), eddperformance varchar(64), sladays integer)
insert into temp values
(1,100,'CP_1','NA',-1,49,'p','a','a','au',50,'12/24/2019 19:15','12/30/2019 13:07','12/24/2019 20:22','12/30/2019 13:07',null,'Complete',null,'sd','1/9/2020','1/9/2020','POST','NA',null, null,'N/A',3);
insert into temp values
(2,101,'CP_2','NA',-1,2,'p','b','b','us',51,'12/11/2019 15:14','12/15/2019 4:53','12/11/2019 16:20','12/15/2019 4:53','12/15/2019 4:53','Delivered',null,'ec','12/24/2019','12/23/2019','MI','75',null, null,'Before',3);
insert into temp values
(3,102,'CP_3','NA',-1,5,'c','d','d','us',52,'11/14/2019 10:14','11/15/2019 15:47','11/14/2019 11:25','11/15/2019 15:47','11/27/2019 18:52','Delivered',null,'sd','11/28/2019','11/27/2019','AU','KL23',null, null,'Before',3);
insert into temp values
(4,103,'CP_4','CO_17',17,4,'pb','j',null,'uk',53,'5/10/2019 8:49','5/11/2019 22:15','5/10/2019 10:07','5/11/2019 22:15','6/27/2019 12:09','Delivered',null,'ec','5/19/2019','6/27/2019','ECOM','123',null, null,'Past',3);
insert into temp values
(5,104,'CP_1','NA',-1,9,'hd','e','e','us',54,'7/2/2019 16:34','7/9/2019 3:35','7/2/2019 17:42','7/9/2019 3:35','7/12/2019 13:35','Delivered',null,'sd','7/11/2019','7/10/2019','US','NA',null, null,'Past',3);
That is the foundation of the data.
select
seven.*,
case
when FulfilledTime is null and current_date() - DueDate > 0 then 'Late'
when FulfilledTime::date - DueDate > 0 then 'Late'
when FulfilledTime::date - DueDate <= 0 then 'On Time'
else 'In Production' end SLAPerformance
from (
select
ShipBinId,
AccountId,
SkuId,
OptionId,
LineItemOptionId,
Quantity,
ProductCategory,
VendorId,
fulfillmentLocation,
ContextName,
ProjectId,
InvoicedTime,
ShippedTime,
ReceivedTime,
FulfilledTime,
DeliveredTime,
ShipBinStatus,
OriginalShipbin,
DeliveryOption,
CartEDD,
maxEDD,
Shipper,
TrackingId,
ReorderReasonCode,
ReorderAgentId,
EDDPerformance,
SlaDays,
max(DueDate) DueDate
-- six.*
from (
select
a.*,
nth_value(e.InvoicedDate, SlaDays) over
(partition by ShipBinId,
AccountId,
SkuId,
-- OptionId,
-- LineItemOptionId,
-- Quantity,
ProductCategory,
a.VendorId,
fulfillmentLocation,
ContextName,
ProjectId,
InvoicedTime,
ShippedTime,
ReceivedTime,
FulfilledTime,
DeliveredTime,
ShipBinStatus,
OriginalShipbin,
DeliveryOption,
CartEDD,
maxEDD,
Shipper,
TrackingId,
ReorderReasonCode,
ReorderAgentId,
EDDPerformance,
SlaDays
order by e.InvoicedDate) DueDate
from temp a join networkdays e
on
e.InvoicedDate > a.ReceivedTime::date and
e.VendorId = a.VendorId
where
e.DateFlag = 'Y'
) six
--limit 1 over (partition by ShipBinId,
-- AccountId,
-- SkuId,
-- OptionId,
-- LineItemOptionId,
-- Quantity,
-- ProductCategory,
-- six.VendorId,
-- fulfillmentLocation,
-- ContextName,
-- ProjectId,
-- InvoicedTime,
-- ShippedTime,
-- ReceivedTime,
-- FulfilledTime,
-- DeliveredTime,
-- ShipBinStatus,
-- OriginalShipbin,
-- DeliveryOption,
-- CartEDD,
-- maxEDD,
-- Shipper,
-- TrackingId,
-- ReorderReasonCode,
-- ReorderAgentId,
-- EDDPerformance,
-- SlaDays
-- order by DueDate desc)
group by
ShipBinId,
AccountId,
SkuId,
OptionId,
LineItemOptionId,
Quantity,
ProductCategory,
VendorId,
fulfillmentLocation,
ContextName,
ProjectId,
InvoicedTime,
ShippedTime,
ReceivedTime,
FulfilledTime,
DeliveredTime,
ShipBinStatus,
OriginalShipbin,
DeliveryOption,
CartEDD,
maxEDD,
Shipper,
TrackingId,
ReorderReasonCode,
ReorderAgentId,
EDDPerformance,
SlaDays
) seven
Normally when I run this, for every line, I will have a [null] value for DueDate, but if I were to single out one of the lines and run it for that shipbinid, it would give me a value DueDate, however sometimes not the correct one. In this example, it gives a duedate for 3 out of the 5 lines, however on a larger scale the results are a lot lower.
Hello Curtis,
I think it works, I just need to validate, but I am getting results for every line that at first glance make sense. Thank you so much for your help. You are awesome and you helped to finish my other question I had asked that you were helping me with.
Hey, great! Glad to help. Feel free to reach out any time.
Hello Curtis,
Unfortunately I have found some instances where things are not calculating properly.
The two that I have highlighted are examples of that. Anything Received on 1/1/19 should be due 1/4/19 and anything received on 1/2/19 should be due 1/7/19 per the networkdays table. Do you know why nth_value would be working correctly on some rows of data, but not on others?
Thank you again so much for your help.
It seems nth_value returns a prior value if it exceeds the given range. I don't have data representative of this in my dataset, but I'd need to see what values in NetworkDays exist for this combination of dates/vendor. The "between current row and unbounded following" is inclusive of current row. So, it's counting the start date (receivedTime in your example) as one of the days. You can verify this by looking at "nth_value(InvoicedDate, 1), for example, or 2, or 3, or 4 - instead of "SlaDays" as a parameter. It will show you that specific date. The "unbounded following" should include all days past these dates, but maybe there are no dates past these dates? So, if I had to guess, the nth_value is returning the last date it has available to it (it can't make up data!). In other words, if the data only has data up to 1-03, and the dateflag for 1-4 and 1-5 are 'N' - and then the data stops (There are no more days after that), then nth_value is going to return 1-03, since there is no value beyond that one that satisfies the function.
Think of it this way. If I had 3 children, Gertrude, Mary and Bill, and I ask nth_value to return the name of my 4th child, it's going to return Bill. Bill doesn't make sense as an answer, but neither did me asking for the name of my 4th child.
But that's a guess - I don't know what the specific details are for those records.
And because current row is inclusive, it might satisfy your query to put in "slaDays+1" into nth_value, so that it doesn't count the starting date as one of it's dates.
Hello Curtis,
Your explanation definitely makes sense, but for every vendor I have every day accounted for from 1/1/18 - 12/31/20. Every day has a 'Y' or 'N' flag and for the exact same combination I get two different results. In the image I provided above, these two are the exact same vendor, product, etc.
Yet the bottom one gives the correct duedate of 1/4/19 and the top one does not.
So these two records are the same, the only difference being the timestamp associated with both dates?
There are other minor details that differ like quantity, but the way the networkdays table works, it only focuses on receiveddate and vendor and uses the SLADays to count. Given that those all match, so should the due date.
And what do the incremental nth_values return? Run your query, but instead of sladays as an offset, us 1, 2, and 3. I'm curious what those offsets look like.
1:
2019-01-01 20:14:00 2019-01-02
2019-01-01 19:52:00 2019-01-02
2:
2019-01-01 20:14:00 2019-01-03
2019-01-01 19:52:00 2019-01-02
3:
2019-01-01 20:14:00 2019-01-04
2019-01-01 19:52:00 2019-01-02
4:
2019-01-01 20:14:00 2019-01-07
2019-01-01 19:52:00 2019-01-02
So the 20:14:00 works correctly every time, the other does not.
So, if I'm reading that right, the entry for 20:14 goes to the 2nd, the 3rd, the 4th and then (for nth_value=4) the 7th? And the other record never advances beyond the 2nd? These are the same vendorId? That doesn't make a lot of sense to me. If they're pulling in the same set of values, why would nth_value return a different answer for two different records? What happens if you truncate the time on this? Is it possible that the timestamp is somehow throwing it off? I couldn't imagine that's the case, but it's worth eliminating. Your nth_value partition is just on vendorId, right? Anything else might confuse it.
Yes, you are reading that correctly, 20:14 works in all ways, and the other does not.
Here is the portion of the query using nth_value:
select
five.*,
nth_value(e.InvoicedDate, 4) over
(partition by e.VendorId, shipbinid order by e.invoiceddate asc rows between current row and unbounded following) DueDate
Truncating the time didn't help:
2019-01-01 00:00:00 2019-01-07
2019-01-01 00:00:00 2019-01-02
Here is a snapshot of the data
But nothing that varies should have any effect on the Due Date calculation. I am just as confused as you regarding why this is happening. Below is the full query again to give color behind how it is compiled:
select
seven.*,
case
when FulfilledTime is null and current_date() - DueDate > 0 then 'Late'
when FulfilledTime::date - DueDate > 0 then 'Late'
when FulfilledTime::date - DueDate <= 0 then 'On Time'
else 'In Production' end SLAPerformance
from (
select
six.*
from (
select
a.*,
nth_value(e.InvoicedDate, 4) over
(partition by e.VendorId, shipbinid order by e.invoiceddate asc rows between current row and unbounded following) DueDate
) temp a join networkdays e
on
e.InvoicedDate > a.ReceivedTime and
e.VendorId = a.VendorId
where
e.DateFlag = 'Y'
) six
limit 1 over (partition by VendorId, ShipBinId order by ReceivedTime)
) seven
Drop "shipbinid" in your partition by clause.
This:
select
seven.*,
case
when FulfilledTime is null and current_date() - DueDate > 0 then 'Late'
when FulfilledTime::date - DueDate > 0 then 'Late'
when FulfilledTime::date - DueDate <= 0 then 'On Time'
else 'In Production' end SLAPerformance
from (
select
six.*
from (
select
a.*,
nth_value(e.InvoicedDate, 4) over
(partition by e.VendorId order by e.invoiceddate asc rows between current row and unbounded following) DueDate
) temp a join networkdays e
on
e.InvoicedDate > a.ReceivedTime and
e.VendorId = a.VendorId
where
e.DateFlag = 'Y'
) six
limit 1 over (partition by VendorId order by ReceivedTime)
) seven
Gives this result:
This:
select
seven.*,
case
when FulfilledTime is null and current_date() - DueDate > 0 then 'Late'
when FulfilledTime::date - DueDate > 0 then 'Late'
when FulfilledTime::date - DueDate <= 0 then 'On Time'
else 'In Production' end SLAPerformance
from (
select
six.*
from (
select
a.*,
nth_value(e.InvoicedDate, 4) over
(partition by e.VendorId, ShipbinId order by e.invoiceddate asc rows between current row and unbounded following) DueDate
) temp a join networkdays e
on
e.InvoicedDate > a.ReceivedTime and
e.VendorId = a.VendorId
where
e.DateFlag = 'Y'
) six
limit 1 over (partition by VendorId order by ReceivedTime)
) seven
Gives this result:
This:
select
seven.*,
case
when FulfilledTime is null and current_date() - DueDate > 0 then 'Late'
when FulfilledTime::date - DueDate > 0 then 'Late'
when FulfilledTime::date - DueDate <= 0 then 'On Time'
else 'In Production' end SLAPerformance
from (
select
six.*
from (
select
a.*,
nth_value(e.InvoicedDate, 4) over
(partition by e.VendorId order by e.invoiceddate asc rows between current row and unbounded following) DueDate
) temp a join networkdays e
on
e.InvoicedDate > a.ReceivedTime and
e.VendorId = a.VendorId
where
e.DateFlag = 'Y'
) six
limit 1 over (partition by VendorId, ShipbinId order by ReceivedTime)
Gives this result:
This:
select
seven.*,
case
when FulfilledTime is null and current_date() - DueDate > 0 then 'Late'
when FulfilledTime::date - DueDate > 0 then 'Late'
when FulfilledTime::date - DueDate <= 0 then 'On Time'
else 'In Production' end SLAPerformance
from (
select
six.*
from (
select
a.*,
nth_value(e.InvoicedDate, 4) over
(partition by e.VendorId, ShipbinId order by e.invoiceddate asc rows between current row and unbounded following) DueDate
) temp a join networkdays e
on
e.InvoicedDate > a.ReceivedTime and
e.VendorId = a.VendorId
where
e.DateFlag = 'Y'
) six
limit 1 over (partition by VendorId, ShipbinId order by ReceivedTime)
Gives this result:
So only when Shipbinid is present in both partitions does it give the correct result for one of the lines of data. Otherwise it either loses one line of data or they are both wrong.
ShipBinId doesn't make sense here. Vendors work specific days, they don't work specific days with regard to shipBinId. Let's remove it. It's just complicating this issue. I'm also wondering why you're doing a limit 1. Why are you getting multiple rows such that you're applying a limit here?
So trying this:
select
seven.*,
case
when FulfilledTime is null and current_date() - DueDate > 0 then 'Late'
when FulfilledTime::date - DueDate > 0 then 'Late'
when FulfilledTime::date - DueDate <= 0 then 'On Time'
else 'In Production' end SLAPerformance
from (
select
a.*,
nth_value(e.InvoicedDate, 4) over
(partition by e.VendorId order by e.invoiceddate asc rows between current row and unbounded following) DueDate
) temp a join networkdays e
on
e.InvoicedDate > a.ReceivedTime and
e.VendorId = a.VendorId
where
e.DateFlag = 'Y'
) seven
Yields 2,520 rows of output for the 2 shipbins. I had added the limit 1 per your recommendation in the other question I had asked:
https://forum.vertica.com/discussion/241220/how-do-i-find-the-3rd-occurrence-in-a-column-and-get-the-corresponding-data#latest
What is a shipbin? And how does it relate to a vendor? I think that had the limit because we were partitioning by orderId as well as Vendor. Though it's not clear to me whether that makes sense or not. I don't understand this data well enough.
This solution might work, too. nth_value just makes it a little simpler.
SELECT OrderNum, InvoicedDate, Vendor, DueDate
FROM (SELECT a.OrderNum, a.InvoicedDate, a.Vendor, b.Date DueDate
FROM test a
JOIN networkdays b ON b.Date > a.InvoicedDate
WHERE b.DateFlag = 'Y'
LIMIT 3 OVER (PARTITION BY b.Vendor, a.OrderNum ORDER BY b.Date)) foo
LIMIT 1 OVER (PARTITION BY Vendor, OrderNum ORDER BY DueDate DESC);
My coworker Jim had put that one together. It limits the internal set to 3 rows (limit 3), and then gets the first one from that set (limit 1).
A shipbin is like an order id, a unique identifier for a customer order. Vendors produce the shipbins and 1 shipbin can only go to 1 vendor.
Trying that method, I still get the same results. 20:14 gives me the correct DueDate of 1/4 and the 19:52 stays at 1/2
So, my question back to you is why? I don't know why. I can't know why. It doesn't make sense for nth_value to return one set of values for some rows and a different value for other rows. It just doesn't work like that. There has to be a reason why it's doing that. You'll need to break up that query into a simpler form, and analyze the set of data that it's looking. Essentially, all nth_value is doing is return a specific row within a set of rows in a 1:m relationship. For example, I'm joining the parents to all the kids, and I'm ordering the kids in birth order, and I want the name of the 3rd kid. If there is no 3rd kid, it's going to return the 2nd kid. nth_value is an extension of first_value and last_value (which return the two extremes of the window). There has to be something fundamentally different between the records for 20:14 and for 19:52. I don't have that data, so it's not possible for me to see why on my side.