Hm... took a look on tablefunc... its not so hard to implement it with UDF. Just for fun I implemented an easiest function - normal_rand(numrows, mean, stddev)
Example
dev=> select * from user_functions where function_name ilike '%normal%'; -[ RECORD 1 ]----------+----------------------------------------------------- schema_name | public function_name | normal_rand procedure_type | User Defined Transform function_return_type | Float function_argument_type | function_definition | Class 'NormRandFactory' in Library 'public.Gaussian' volatility | is_strict | f is_fenced | t comment |
I can share a code if you are interesting in this function. May be (also just for fun) I will implement others functions too. Can you tell me witch function is more important for you? And I will start from this function.
As I got to say, it would be important for people like me who uses the database as a data source for simulations and statistical models of machine learning, have a function that creates horizontal metrics that were calculated, I thank you and I appreciate your help, I hope also that hp-vertica may notice your work and adopt it as soon as helping you develop it.
cardinality of PIVOT'ed column (its dynamic and mutable and its a problem)
The only way I can implement it - via external procedures or UDF. Unfortunately UDF is out of scope:
Vertica will try to parallelize UDF
I need to know a cardinality of pivoted data. I need it to define a pivoted table columns. Of cause I can use in ODBC/JDBC and create a Flex table and so columns will be dynamic, but Vertica will parallelize it and you will get a garbage.
CREATE LIBRARY Gaussian AS '/tmp/Gaussian.so'; CREATE TRANSFORM FUNCTION normal_rand AS LANGUAGE 'C++' NAME 'NormalDistributionFactory' LIBRARY Gaussian;
Q: How can you query a database from UDF for pivot? A: Only via ODBC or JDBC connection. So I have to create a connection to fetch data for pivot. Vertica isn't released a native connector and UDF doesn't support "data fetch" from database.
Q: And is it a problem? A: Yes, Vertica parallelizes UDF and so many connections are opened. And Vertica raises exception, because each connection tries to create pivot table and to insert a pivot data to it. First tread succeeded but others throws an exception. I can't use in "OVER (PARTITION AUTO)" because i need "OVER ()" to define a pivot column.
Q: Does Oracle approach can be implemented with UDF? A: Hm... interesting. I will try. Its more suitable, since I don't need to know a cardinality. Cardinality defined in query. Nice.
Q: But you still need to fetch data from database. How you will limit a query to a single execution? A:
I can insert data to temporary table(but Im not sure it will solve a problem with "many connections")
I can define "CREATE TABLE IF NOT EXISTS", so a second thread will trow an warning only.
If these method will fail so I have a last option to limit a query to a single thread execution. I can define witch projections to use
I have no more options. May be someone will suggest a something?
Q: And what if a projection is segmented? A: So far I can't answer, I need to investigate it, but Im afraid that without a help from Vertica Support I can't implement it with UDF if table is segmented.
May be Vertica Support will provide a hint that limits a query to a single thread?
PS Hm... and what about CPU Affinity? I will try, looks like "CREATE TABLE IF NOT EXISTS" solves a problem with exceptions. I will update.
Yes, it is possible with UDF Multi-Phase Transform function, but I compared a performance with DECODE function - performance is under any critics. So far my code have some assumptions(read bugs), because I simplified a problem, just wanted to understand - can I do it?
***************************
When I will fix a bugs I will publish it. Feel free to remind.
***************************
But I recommend you to compare a performance of some UDF function against a built-in function. For example, Vertica provides an UDF Aggregate function - AVG:
Comments
http://vertica-forums.com/viewtopic.php?f=48&t=849&p=2648
http://www.postgresql.org/docs/9.1/static/tablefunc.html
You can try to open a feature request. Post under topic IDEAS
Regards
Example Bench 1MIL rows: 2,5MIL rows: 5MIL rows: Nice linear scalability I can share a code if you are interesting in this function. May be (also just for fun) I will implement others functions too. Can you tell me witch function is more important for you? And I will start from this function.
best
crosstab !
see you soon :- )
I will try, but syntax will be differ from PG implementation.
I will update you in any case - success or failed.
Regards.
I failed. I will explain where I failed.
Implementation of PIVOT require:
- table data
- cardinality of PIVOT'ed column (its dynamic and mutable and its a problem)
The only way I can implement it - via external procedures or UDF.Unfortunately UDF is out of scope:
- Vertica will try to parallelize UDF
- I need to know a cardinality of pivoted data. I need it to define a pivoted table columns. Of cause I can use in ODBC/JDBC and create a Flex table and so columns will be dynamic, but Vertica will parallelize it and you will get a garbage.
How I see it with EP? where- src_table - an original data for pivoting
- pivot_column - pivoted column
- dest_table - EP will insert results to this table
Or suggest your syntax(but take in mind - I must know how many columns will be in a new table and so I have to query a pivoted column for cardinality)@massimo
Will you accept solution with EP?
PS
The main problem - is cardinality of pivoted column. I can limit query execution on one node only, but I can't create table with dynamic columns.
http://pastebin.com/gDPcQTE5
Compile: Deploy: Regards.
do you think this could be a possible solution (from oracle 11g) ?
select * from( select deptno, job, sal from emp ) e
pivot( sum (sal ) for job in ( 'CLERK', 'SALESMAN', 'MANAGER', 'ANALYST', 'PRESIDENT' ) )
http://www.oracle.com/technetwork/issue-archive/2008/08-mar/o28asktom-087592.htmlCiao
Q: How can you query a database from UDF for pivot?
A: Only via ODBC or JDBC connection. So I have to create a connection to fetch data for pivot. Vertica isn't released a native connector and UDF doesn't support "data fetch" from database.
Q: And is it a problem?
A: Yes, Vertica parallelizes UDF and so many connections are opened. And Vertica raises exception, because each connection tries to create pivot table and to insert a pivot data to it. First tread succeeded but others throws an exception. I can't use in "OVER (PARTITION AUTO)" because i need "OVER ()" to define a pivot column.
Q: Does Oracle approach can be implemented with UDF?
A: Hm... interesting. I will try. Its more suitable, since I don't need to know a cardinality. Cardinality defined in query. Nice.
Q: But you still need to fetch data from database. How you will limit a query to a single execution?
A:
- I can insert data to temporary table(but Im not sure it will solve a problem with "many connections")
- I can define "CREATE TABLE IF NOT EXISTS", so a second thread will trow an warning only.
- If these method will fail so I have a last option to limit a query to a single thread execution. I can define witch projections to use
Example: I have no more options. May be someone will suggest a something?Q: And what if a projection is segmented?
A: So far I can't answer, I need to investigate it, but Im afraid that without a help from Vertica Support I can't implement it with UDF if table is segmented.
May be Vertica Support will provide a hint that limits a query to a single thread?
PS
Hm... and what about CPU Affinity? I will try, looks like "CREATE TABLE IF NOT EXISTS" solves a problem with exceptions.
I will update.
Regards.
Yes, it is possible with UDF Multi-Phase Transform function, but I compared a performance with DECODE function - performance is under any critics. So far my code have some assumptions(read bugs), because I simplified a problem, just wanted to understand - can I do it?
***************************
When I will fix a bugs I will publish it. Feel free to remind.
***************************
But I recommend you to compare a performance of some UDF function against a built-in function. For example, Vertica provides an UDF Aggregate function - AVG: and take in mind that Transform Multi-Phase will be x100(at least) slower than doing it with DECODE.
Regards.