How to display big FLOAT numbers correctly ?
This has been driving me mad today and I can't figure out what's happening. I was looking through the machine learning sample data and there are a couple of rows in the baseball data (id's 89 & 73) which have huge salary values which I can’t get to display properly in Dbeaver or vsql. You can easily replicate it like this.
create table sample_table(bignumber float);
insert into sample_table values (99999999934147135.84);
select bignumber from sample_table;
In Dbeaver you get 99999999934147104
In vsql you get 9.99999999341471e+16
Or you could just do
select 99999999934147135.84::float;
The output is the wrong value. If you cast into a NUMERIC it returns 99999999934147136 but I can’t get it to stop rounding
select bignumber::numeric from sample_table;
The number is being stored properly because if you put a predicate it works fine
select bignumber from sample_table where bignumber=99999999934147135.84;
So, it looks like some kind of display/formatting issue that I’ve no idea how to fix. Any idea what’s going on and how to get the correct number displayed ?
This is especially dangerous when you export to csv to import somewhere else as the number is wrong in the output.
It also makes the machine learning examples wrong since this will return 444444444448280000 as the MAX salary value when in fact it should be 444444444448279921.75 (id 89 in the data)
SELECT summarize_numcoL(hits, salary) OVER() FROM baseball WHERE dob > '1975-7-1'::DATE;
Thanks
Dave
Best Answer
-
Thanks for the reply.
Indeed you're right, the number isn't stored correctly. I didn't think of trying to query with other values.
I didn't actually declare the values as FLOAT, I never use it. The person who uploaded the ML examples to GitHub did, if you look in load_ml_data.sql you'll see everything declared as FLOAT.
If you look at the SQL I gave earlier where I made the value NUMERIC you'll see that the summarize_numcol function gets every statistic wrong for salary so maybe that should be reviewed.Sure 99999999934147135.84 is a very big number but it's within the realms of science
Out of curiosity, I tried it with C and it's wrong in the same way (the include statement needs a hash in front of it, the line is formatted strangely here)
$ more bignumber.c
include <stdio.h>
int main() {
double bignumber;
bignumber = 99999999934147135.84;
printf("bignumber = %.2lf", bignumber);
return 0;
}$ gcc -o bignumber bignumber.c
$ ./bignumber
bignumber = 99999999934147136.00So this is just something to remember I guess.
Regards
Dave0
Answers
hi Dave,
Try to store it as a Numeric(20, 2).
Floats are not really precise (8 Precision). See the types precision here:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/DataTypes/SQLDataTypes.htm
When you will retrieve it using Python for example, you can get the exact number if you're using Decimals or if you cast it to varchar.
Thanks for your reply.
Accuracy is not the problem here. The value is stored correctly, if it wasn’t the query below using the predicate wouldn’t work. Create my sample table then do this, you’ll see that what you get back is not what you asked for.
select bignumber from sample_table where bignumber=99999999934147135.84;
If you cast into NUMERIC(20,2) it will round the value to 99999999934147136.00 and I see no way of influencing it
select bignumber::NUMERIC(20,2) from sample_table where bignumber=99999999934147135.84;
I had already thought about declaring the value columns as NUMERIC and had edited the machine learning example tables to change all FLOAT to NUMERIC.
However, this just pushes the problem down the road since the statistical/machine learning functions use FLOAT internally and display the wrong amounts. See the documentation for SUMMARIZE_NUMCOL “All summary values are FLOAT data types, except INTEGER for count.”
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/MachineLearning/SUMMARIZE_NUMCOL.htm
The following is run on the sample baseball table where it returns a MAX of 444444444448280000 and it should be 444444444448279921.75 (id 89 in the data see below)
https://github.com/vertica/Machine-Learning-Examples
SELECT summarize_numcoL(hits, salary) OVER() FROM baseball;
The function will return 444444444448280000 regardless of whether salary is declared as FLOAT or NUMERIC.
89,Jacqueline,Richards,10/6/1975,Pink,273333,4490260,0.324,444444444448279921.75
It’s some kind of display issue.
Regards
Dave
That's why in ML in general, we normalize the data. Computation errors are higher when dealing with Big Floats...
I recommend in that case data normalization using MinMax.
Which Vertica version do you have? I think in the new ones, ML functions support numeric data types.
I have Vertica Analytic Database v10.0.0-0.
Do the following and you'll see that the function returns the wrong MAX value
CREATE TABLE baseball2
(
id int,
first_name varchar(50),
last_name varchar(50),
dob date,
team varchar(20),
hr int,
hits int,
avg numeric,
salary numeric
);
insert into baseball2 values (89,'Jacqueline','Richards','1975-10-06','Pink',273333,4490260,0.324,99999999934147135.84)
SELECT summarize_numcoL(hits, salary) OVER() FROM baseball2;
Just to reiterate what @badrouali was saying earlier - store it as NUMERIC(20,2) and not FLOAT. FLOAT as per the docs:
and
So when you say "The value is stored correctly", they actually aren't - they're stored as approximations. Using your example:
You can see that all the following match the predicate:
So if you store it (not cast) as NUMERIC(20,2) instead:
Now, it's stored as an exact value. Trying our predicates from before, you see that only one matches:
And for MAX:
RE: DBeaver. All the above was on VSQL. I'm sure DBeaver as a client application has its own way of formatting/rounding numeric types. A Google search reveals things that can be changed in DBeaver settings.
RE: ML functions using/casting to FLOAT. It looks like that is by design and noted as such in the docs.
Maybe someone else can comment on why that is so. My guess is that such highly precise values do not add much to a model (hence we think of normalizing), and it's probably way faster to cast to float than to deal with the numeric type - but that's just a guess and I may be wrong.