How to display big FLOAT numbers correctly ?

This has been driving me mad today and I can't figure out what's happening. I was looking through the machine learning sample data and there are a couple of rows in the baseball data (id's 89 & 73) which have huge salary values which I can’t get to display properly in Dbeaver or vsql. You can easily replicate it like this.

create table sample_table(bignumber float);
insert into sample_table values (99999999934147135.84);
select bignumber from sample_table;

In Dbeaver you get 99999999934147104
In vsql you get 9.99999999341471e+16

Or you could just do
select 99999999934147135.84::float;

The output is the wrong value. If you cast into a NUMERIC it returns 99999999934147136 but I can’t get it to stop rounding

select bignumber::numeric from sample_table;

The number is being stored properly because if you put a predicate it works fine

select bignumber from sample_table where bignumber=99999999934147135.84;

So, it looks like some kind of display/formatting issue that I’ve no idea how to fix. Any idea what’s going on and how to get the correct number displayed ?

This is especially dangerous when you export to csv to import somewhere else as the number is wrong in the output.

It also makes the machine learning examples wrong since this will return 444444444448280000 as the MAX salary value when in fact it should be 444444444448279921.75 (id 89 in the data)

SELECT summarize_numcoL(hits, salary) OVER() FROM baseball WHERE dob > '1975-7-1'::DATE;

Thanks
Dave

Best Answer

  • dave_wooddave_wood
    Answer ✓

    Thanks for the reply.

    Indeed you're right, the number isn't stored correctly. I didn't think of trying to query with other values.
    I didn't actually declare the values as FLOAT, I never use it. The person who uploaded the ML examples to GitHub did, if you look in load_ml_data.sql you'll see everything declared as FLOAT.
    If you look at the SQL I gave earlier where I made the value NUMERIC you'll see that the summarize_numcol function gets every statistic wrong for salary so maybe that should be reviewed.

    Sure 99999999934147135.84 is a very big number but it's within the realms of science

    Out of curiosity, I tried it with C and it's wrong in the same way (the include statement needs a hash in front of it, the line is formatted strangely here)

    $ more bignumber.c

    include <stdio.h>

    int main() {
    double bignumber;
    bignumber = 99999999934147135.84;
    printf("bignumber = %.2lf", bignumber);
    return 0;
    }

    $ gcc -o bignumber bignumber.c
    $ ./bignumber
    bignumber = 99999999934147136.00

    So this is just something to remember I guess.

    Regards
    Dave

Answers

  • badroualibadrouali Vertica Employee

    hi Dave,
    Try to store it as a Numeric(20, 2).
    Floats are not really precise (8 Precision). See the types precision here:
    https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/DataTypes/SQLDataTypes.htm
    When you will retrieve it using Python for example, you can get the exact number if you're using Decimals or if you cast it to varchar.

  • Thanks for your reply.

    Accuracy is not the problem here. The value is stored correctly, if it wasn’t the query below using the predicate wouldn’t work. Create my sample table then do this, you’ll see that what you get back is not what you asked for.

    select bignumber from sample_table where bignumber=99999999934147135.84;

    If you cast into NUMERIC(20,2) it will round the value to 99999999934147136.00 and I see no way of influencing it

    select bignumber::NUMERIC(20,2) from sample_table where bignumber=99999999934147135.84;

    I had already thought about declaring the value columns as NUMERIC and had edited the machine learning example tables to change all FLOAT to NUMERIC.

    However, this just pushes the problem down the road since the statistical/machine learning functions use FLOAT internally and display the wrong amounts. See the documentation for SUMMARIZE_NUMCOL “All summary values are FLOAT data types, except INTEGER for count.”

    https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/MachineLearning/SUMMARIZE_NUMCOL.htm

    The following is run on the sample baseball table where it returns a MAX of 444444444448280000 and it should be 444444444448279921.75 (id 89 in the data see below)

    https://github.com/vertica/Machine-Learning-Examples

    SELECT summarize_numcoL(hits, salary) OVER() FROM baseball;

    The function will return 444444444448280000 regardless of whether salary is declared as FLOAT or NUMERIC.

    89,Jacqueline,Richards,10/6/1975,Pink,273333,4490260,0.324,444444444448279921.75

    It’s some kind of display issue.

    Regards
    Dave

  • badroualibadrouali Vertica Employee

    That's why in ML in general, we normalize the data. Computation errors are higher when dealing with Big Floats...
    I recommend in that case data normalization using MinMax.
    Which Vertica version do you have? I think in the new ones, ML functions support numeric data types.

  • I have Vertica Analytic Database v10.0.0-0.
    Do the following and you'll see that the function returns the wrong MAX value
    CREATE TABLE baseball2
    (
    id int,
    first_name varchar(50),
    last_name varchar(50),
    dob date,
    team varchar(20),
    hr int,
    hits int,
    avg numeric,
    salary numeric
    );
    insert into baseball2 values (89,'Jacqueline','Richards','1975-10-06','Pink',273333,4490260,0.324,99999999934147135.84)

    SELECT summarize_numcoL(hits, salary) OVER() FROM baseball2;

  • LenoyJLenoyJ - Select Field - Employee
    edited May 2021

    Just to reiterate what @badrouali was saying earlier - store it as NUMERIC(20,2) and not FLOAT. FLOAT as per the docs:

    Double precision is an inexact, variable-precision numeric type. In other words, some values cannot be represented exactly and are stored as approximations. Thus, input and output operations involving double precision might show slight discrepancies.

    and

    Comparing two floating-point values for equality might not work as expected.

    So when you say "The value is stored correctly", they actually aren't - they're stored as approximations. Using your example:

    dbadmin=> create table sample_table(bignumber float);
    CREATE TABLE
    
    dbadmin=> insert into sample_table values (99999999934147135.84);
     OUTPUT
    --------
          1
    (1 row)
    
    dbadmin=> select bignumber from sample_table;
          bignumber
    ----------------------
     9.99999999341471e+16
    (1 row)
    

    You can see that all the following match the predicate:

    dbadmin=> select bignumber from sample_table where bignumber=99999999934147135.84;
          bignumber
    ----------------------
     9.99999999341471e+16
    (1 row)
    
    dbadmin=> select bignumber from sample_table where bignumber=99999999934147136;
          bignumber
    ----------------------
     9.99999999341471e+16
    (1 row)
    
    dbadmin=> select bignumber from sample_table where bignumber=99999999934147137;
          bignumber
    ----------------------
     9.99999999341471e+16
    (1 row)
    

    So if you store it (not cast) as NUMERIC(20,2) instead:

    dbadmin=> drop table sample_table;
    DROP TABLE
    
    dbadmin=> create table sample_table(bignumber numeric(20,2));
    CREATE TABLE
    
    dbadmin=> insert into sample_table values (99999999934147135.84);
     OUTPUT
    --------
          1
    (1 row)
    
    dbadmin=> select bignumber from sample_table;
          bignumber
    ----------------------
     99999999934147135.84
    (1 row)
    

    Now, it's stored as an exact value. Trying our predicates from before, you see that only one matches:

    dbadmin=> select bignumber from sample_table where bignumber=99999999934147135.84;
          bignumber
    ----------------------
     99999999934147135.84
    (1 row)
    
    dbadmin=> select bignumber from sample_table where bignumber=99999999934147136;
     bignumber
    -----------
    (0 rows)
    
    dbadmin=> select bignumber from sample_table where bignumber=99999999934147137;
     bignumber
    -----------
    (0 rows)
    

    And for MAX:

    dbadmin=> select max(bignumber) from sample_table;
             max
    ----------------------
     99999999934147135.84
    (1 row)
    

     
    RE: DBeaver. All the above was on VSQL. I'm sure DBeaver as a client application has its own way of formatting/rounding numeric types. A Google search reveals things that can be changed in DBeaver settings.
     
    RE: ML functions using/casting to FLOAT. It looks like that is by design and noted as such in the docs.

    All machine learning functions automatically cast NUMERIC arguments to FLOAT.

    Maybe someone else can comment on why that is so. My guess is that such highly precise values do not add much to a model (hence we think of normalizing), and it's probably way faster to cast to float than to deal with the numeric type - but that's just a guess and I may be wrong.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file