In that case I would counsel against it. Vertica's strength is that it is a relational database platform built for huge amounts of data.
You don't filter, group by, join on, get the sum of, the average of, the standard deviation of, a pdf file, a doc file, let alone an executable. In a relational database, what you do is filter, group by, join on, get the sum of, the average of, the standard deviation of, columns of tables. You would only increase the license size dramatically for just data to store and retrieve, and then process in a front end application. Any metadata about the files, yes. The data itself, I would store it outside of Vertica. My two Swiss Francs ...
In addition if you want to replace all values in f1 consider something like the following.
UPDATE public.t1 SET f1=REPLACE(f1,chr(174),'');
SELECT * FROM t1;
f1 | f2 | f3
solid works | solid works | solid works
solidworks | solidworks | solidworks
Try :/opt/vertica/bin/scrutinize -d OSSA -U dbadmin -P dbadminpass
/opt/vertica/bin/scrutinize -d OSSA -U dbadmin -P dbadminpass
Note: If you run this command you will see the options avaiable:/opt/vertica/bin/scrutinize -h
P.S. You are running Vertica 8.1.1? Time to upgrade!
The least tiresome way I can think of is my sql based profiler on GitHub: https://github.com/marco-the-sane/sqlprofile . It builds a sort-of response table containing schema and table name of the tables to be profiled, then runs SQL generating SQL for a two-pass full table profile exercise, coming up with a CREATE TABLE with type-wise optimal data types ...
I would not leave it at that, @joergschaber .
To illustrate the question: An 'à' sign uses 2 bytes in a Vertica UTF-8 string. The 'a' sign uses 1 byte. The Euro sign - '€' - uses 3 bytes. Do you now create a VARCHAR(12) for a string with a maximum of 4 letters? 'Maja' uses 4 bytes. 'Künz' uses 5 bytes.
But '€€€€' does use 12 bytes. How high is the probability to find that string in your table?
That does not hurt when you encode in Vertica.
But if ever you sort, hash-group-by or hash-join by an over-sized column, you will have, in the worst case, to allocate 4 times as much memory as actually needed for the hash table, and will spill to disk during the query.
The clean way of going about this is to oversize by factor 4 as @SergeB suggested. But then, to run a SELECT MAX(OCTET_LENGTH(<the_big_varchar_column>)) to find out how much you actually need for it, and then to re-size the column in an ALTER TABLE statement accordingly.
It seems that HANA's NVARCHAR(X) means up to X characters , each character being up to 4 bytes long (UTF-32).
In Vertica, VARCHAR(Y) means up to Y bytes,
So Y = X * 4 would be a safe number when converting HANA's NVARCHAR to Vertica's VARCHAR.
There is a brief global lock while Vertica prepares a snapshot of the database. Then the snapshot is used to copycluster.
I suppose you created the table like I did below here, or in a similar way - that is, specifying the primary key at CREATE time:
The default projection that comes out of that is sorted by and segmented by the hash of the primary key.
CREATE TABLE foo (
foo_id INT NOT NULL
, foo_name VARCHAR(32)
, CONSTRAINT pk_foo PRIMARY KEY(foo_id)
-- Insert a row to force the creation of a projection
INSERT INTO FOO VALUES(1,'Arthur');COMMIT;
-- let Vertica show what it created:
-- out CREATE TABLE dbadmin.foo
-- out (
-- out foo_id int NOT NULL,
-- out foo_name varchar(32),
-- out CONSTRAINT pk_foo PRIMARY KEY (foo_id) DISABLED
-- out );
-- out CREATE PROJECTION dbadmin.foo_super /*+basename(foo),createtype(L)*/
-- out (
-- out foo_id,
-- out foo_name
-- out )
-- out AS
-- out SELECT foo.foo_id,
-- out foo.foo_name
-- out FROM dbadmin.foo
-- out ORDER BY foo.foo_id
-- out SEGMENTED BY hash(foo.foo_id) ALL NODES OFFSET 0;
So the projection you seem to intend to build would be a complete waste of time and resources...
Fyi, I have a client hitting this issue. I opened Jira VER-72952.
See my answer here: https://forum.vertica.com/discussion/comment/245075/#Comment_245075