There are some tables in Vertica where primary key, unique key, or composite key is not defined and duplicates are possible. How to identify each record uniquely to push the data to the Kafka pipeline
Vertica, in this respect, is a real relational database: If there is no declared identifier, there is no identifier. And it is columnar.
Row based DBMSs have a way to physically identify the location of a certain row. That's where, for example, the ROWID of Oracle comes from.
Vertica being columnar, you can not even locate the actual position of the first column in the sort order:
If, for example, you have gender as the first column in the sort order, and the column is encoded as Run-Length-Encoding (RLE), then, you have, in that file that contains the column, for example, the value 'F', an integer of 502, the value 'M' and an integer of 498 for a table containing 1000 rows.
You could calculate the hash of all columns (with a small risk of hash collisions), but if you have two rows like this:
42 | Arthur Dent | 2022-01-25
42 | Arthur Dent | 2022-01-25
There is now way of discerning one row from the other.
Even if you apply a ROW_NUMBER() OVER(PARTITION BY <all_columns_of_the_table> ) - which would lead one of the above rows to get a 1 and the other to get a 2- there will be no way of determining which was assigned to which row.
These are the two, not completely satisfactory, ways of working around this behaviour. Which is not a problem, but a behavioural feature.
Is there any way to get the number of distinct values in a column without running a select query. I want to fetch the details using jdbc connection and java program
I don't think there is. But if you run this type of query often, then create a projection with the column in question as the first in the sort order; encode this column RLE (run-length encoding) in that projection. Then, a SELECT DISTINCT foo FROM bar; will run as fast as a query of previously accumulated histograms.
Answers
Vertica, in this respect, is a real relational database: If there is no declared identifier, there is no identifier. And it is columnar.
Row based DBMSs have a way to physically identify the location of a certain row. That's where, for example, the ROWID of Oracle comes from.
Vertica being columnar, you can not even locate the actual position of the first column in the sort order:
If, for example, you have
gender
as the first column in the sort order, and the column is encoded as Run-Length-Encoding (RLE), then, you have, in that file that contains the column, for example, the value 'F', an integer of 502, the value 'M' and an integer of 498 for a table containing 1000 rows.You could calculate the hash of all columns (with a small risk of hash collisions), but if you have two rows like this:
There is now way of discerning one row from the other.
Even if you apply a
ROW_NUMBER() OVER(PARTITION BY <all_columns_of_the_table> )
- which would lead one of the above rows to get a 1 and the other to get a 2- there will be no way of determining which was assigned to which row.These are the two, not completely satisfactory, ways of working around this behaviour. Which is not a problem, but a behavioural feature.
The very easy way if you can't fix your model, is to perform a SELECT DISTINCT, but you'll pay the fee each time you run the query.
The better thing to do is to recreate your table with a PK / UK and insert select distinct from your current table, you pay only once.
I don't think there is. But if you run this type of query often, then create a projection with the column in question as the first in the sort order; encode this column RLE (run-length encoding) in that projection. Then, a
SELECT DISTINCT foo FROM bar;
will run as fast as a query of previously accumulated histograms.