computing percentiles from binned data
Greetings Vertica experts,
My tables have binned data..
Ex:
Bin_Name | Bin_Count
"56" | 20000
"57" | 23344
Is there inbuilt support to compute percentiles from this? I am looking for something analogous to numpy support around this.. https://stackoverflow.com/questions/22638557/using-numpy-percentile-on-binned-data
GIven the volume of the data, I would prefer these computations be handled by Vertica as opposed to client side processing in python.
Thanks!
Analytics Noob
0
Best Answer
-
Jim_Knicely - Select Field - Administrator
Hi,
Using the logic provided here:
https://stats.stackexchange.com/questions/65710/derive-percentiles-from-binned-data
Maybe this could help?
dbadmin=> select * from binned_data; bin_name | bin_count ----------+----------- 1 | 2 2 | 4 3 | 2 4 | 3 (4 rows) dbadmin=> select bin_name as percentile_30th from (select bin_name, bin_count, sum(bin_count) over(order by bin_name rows between unbounded preceding and current row) bin_count_rolling, sum(bin_count) over() bin_count_total from binned_data) foo where 0.3*bin_count_total <= bin_count_rolling order by bin_name limit 1; percentile_30th ----------------- 2 (1 row)
Your data:
dbadmin=> select * from binned_data; bin_name | bin_count ----------+----------- 56 | 20000 57 | 23344 (2 rows) dbadmin=> select bin_name as percentile_30th from (select bin_name, bin_count, sum(bin_count) over(order by bin_name rows between unbounded preceding and current row) bin_count_rolling, sum(bin_count) over() bin_count_total from binned_data) foo where 0.3*bin_count_total <= bin_count_rolling order by bin_name limit 1; percentile_30th ----------------- 56 (1 row)
1
Answers
Hi,
Vertica has the PERCENTILE_CONT and PERCENTILE_DISC Analytic functions.
See:
https://my.vertica.com/docs/9.0.x/HTML/index.htm#Authoring/SQLReferenceManual/Functions/Analytic/PERCENTILE_CONTAnalytic.htm
https://my.vertica.com/docs/9.0.x/HTML/index.htm#Authoring/SQLReferenceManual/Functions/Analytic/PERCENTILE_DISCAnalytic.htm
Thanks for the pointers. The documentations seem to be for unbinned data..I dont get expected results for binned data. I verified by having a running total of the bin counts and manually comparing the results from the query to hand calculations..
That seems to work! I am going to do some more validation with more data but it passed the initial checks! Thanks a bunch & Happy Holidays!