We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


computing percentiles from binned data — Vertica Forum

computing percentiles from binned data

Greetings Vertica experts,
My tables have binned data..
Ex:
Bin_Name | Bin_Count
"56" | 20000
"57" | 23344

Is there inbuilt support to compute percentiles from this? I am looking for something analogous to numpy support around this.. https://stackoverflow.com/questions/22638557/using-numpy-percentile-on-binned-data
GIven the volume of the data, I would prefer these computations be handled by Vertica as opposed to client side processing in python.

Thanks!

Analytics Noob

Best Answer

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    Answer ✓

    Hi,

    Using the logic provided here:

    https://stats.stackexchange.com/questions/65710/derive-percentiles-from-binned-data

    Maybe this could help?

    dbadmin=> select * from binned_data;
     bin_name | bin_count
    ----------+-----------
            1 |         2
            2 |         4
            3 |         2
            4 |         3
    (4 rows)
    
    dbadmin=> select bin_name as percentile_30th from (select bin_name, bin_count, sum(bin_count) over(order by bin_name rows between unbounded preceding and current row) bin_count_rolling, sum(bin_count) over() bin_count_total from binned_data) foo where 0.3*bin_count_total <= bin_count_rolling order by bin_name limit 1;
     percentile_30th
    -----------------
                   2
    (1 row)
    

    Your data:

    dbadmin=> select * from binned_data;
     bin_name | bin_count
    ----------+-----------
           56 |     20000
           57 |     23344
    (2 rows)
    
    dbadmin=> select bin_name as percentile_30th from (select bin_name, bin_count, sum(bin_count) over(order by bin_name rows between unbounded preceding and current row) bin_count_rolling, sum(bin_count) over() bin_count_total from binned_data) foo where 0.3*bin_count_total <= bin_count_rolling order by bin_name limit 1;
     percentile_30th
    -----------------
                  56
    (1 row)
    

Answers

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file