Are there sql commands to pivot data?

Glenn_Sontheime · November 2013

Daniel_Leybovic · November 2013

http://vertica-forums.com/viewtopic.php?f=63&t=474&p=1564
http://vertica-forums.com/viewtopic.php?f=48&t=849&p=2648

Massimo_Loporch · January 2015

I prefer an implementation like the one below if possible:
http://www.postgresql.org/docs/9.1/static/tablefunc.html

pviennea · January 2015

+1

id10t · January 2015

Hi!

You can try to open a feature request. Post under topic IDEAS

Regards

id10t · January 2015

Hm... took a look on tablefunc... its not so hard to implement it with UDF. Just for fun I implemented an easiest function - normal_rand(numrows, mean, stddev)

Example

dev=> select * from user_functions where function_name ilike '%normal%';
-[ RECORD 1 ]----------+-----------------------------------------------------
schema_name            | public
function_name          | normal_rand
procedure_type         | User Defined Transform
function_return_type   | Float
function_argument_type | 
function_definition    | Class 'NormRandFactory' in Library 'public.Gaussian'
volatility             | 
is_strict              | f
is_fenced              | t
comment                | 


dev=> select normal_rand(USING PARAMETERS rows=3, mean=3.3, stddev=6.3) over (PARTITION AUTO);
    RAND_VALUE     
-------------------
  2.53161555990794
 -3.54695367884655
   7.6110266459183
(3 rows)

Bench

dev=> \! lscpu | grep -P '(^CPU.s.)|(MHz)'
CPU(s):                8
CPU MHz:               1200.000
dev=> \o /dev/null
dev=> \timing 
Timing is on.

1MIL rows:

dev=> select normal_rand(USING PARAMETERS rows=1000000, mean=3.3, stddev=6.3) over (PARTITION AUTO);
Time: First fetch (1000 rows): 72.207 ms. All rows formatted: 1405.582 ms

2,5MIL rows:

dev=> select normal_rand(USING PARAMETERS rows=2500000, mean=3.3, stddev=6.3) over (PARTITION AUTO);
Time: First fetch (1000 rows): 68.825 ms. All rows formatted: 3521.669 ms

5MIL rows:

dev=> select normal_rand(USING PARAMETERS rows=5000000, mean=3.3, stddev=6.3) over (PARTITION AUTO);
Time: First fetch (1000 rows): 72.705 ms. All rows formatted: 7135.159 ms

Nice linear scalability

+=========+========+
| ROWS    | TIME(s)|
+=========+========+
| 1000000 | 1.4    |
| 2500000 | 3.5    |
| 5000000 | 7.1    |
+---------+--------+

I can share a code if you are interesting in this function. May be (also just for fun) I will implement others functions too. Can you tell me witch function is more important for you? And I will start from this function.

best

Massimo_Loporch · January 2015

Great ! !

crosstab !

see you soon :- )

id10t · January 2015

Hi!

I will try, but syntax will be differ from PG implementation.
I will update you in any case - success or failed.

Regards.

Massimo_Loporch · January 2015

As I got to say, it would be important for people like me who uses the database as a data source for simulations and statistical models of machine learning, have a function that creates horizontal metrics that were calculated, I thank you and I appreciate your help, I hope also that hp-vertica may notice your work and adopt it as soon as helping you develop it.

id10t · January 2015

Hi!

I failed. I will explain where I failed.

Implementation of PIVOT require:

table data
cardinality of PIVOT'ed column (its dynamic and mutable and its a problem)

The only way I can implement it - via external procedures or UDF.
Unfortunately UDF is out of scope:

Vertica will try to parallelize UDF
I need to know a cardinality of pivoted data. I need it to define a pivoted table columns. Of cause I can use in ODBC/JDBC and create a Flex table and so columns will be dynamic, but Vertica will parallelize it and you will get a garbage.

How I see it with EP?

vsql=> select pivot(src_table=<Table>, dest_table=<Table>, pivot_column=<column>);

where

src_table - an original data for pivoting
pivot_column - pivoted column
dest_table - EP will insert results to this table

Or suggest your syntax(but take in mind - I must know how many columns will be in a new table and so I have to query a pivoted column for cardinality)

@massimo
Will you accept solution with EP?

PS
The main problem - is cardinality of pivoted column. I can limit query execution on one node only, but I can't create table with dynamic columns.

id10t · January 2015

Gaussian Distribution source code:
http://pastebin.com/gDPcQTE5

Compile:

g++ -std=c++11 -D HAVE_LONG_INT_64  -I /opt/vertica/sdk/include -Wall -shared -Wno-unused-value -fPIC -o Gaussian.so NormalDistribution.cpp /opt/vertica/sdk/include/Vertica.cpp

Deploy:

CREATE LIBRARY Gaussian AS '/tmp/Gaussian.so';
CREATE TRANSFORM FUNCTION normal_rand AS LANGUAGE 'C++' NAME 'NormalDistributionFactory' LIBRARY Gaussian;

Regards.

Massimo_Loporch · January 2015

Hello Genius !, " id10t " it seems not appropriate to you,
do you think this could be a possible solution (from oracle 11g) ?

select * from( select deptno, job, sal from emp ) e

pivot( sum (sal ) for job in ( 'CLERK', 'SALESMAN', 'MANAGER', 'ANALYST', 'PRESIDENT' ) )

order by deptno        DEPTNO     'CLERK'     'SALESMAN'      'MANAGER'    'ANALYST'     'PRESIDENT' -----------   --------    ------------   ------------   ----------   -------------          10       1300                           2450                         5000          20       1900                           2975         6000          30        950           5600            2850

http://www.oracle.com/technetwork/issue-archive/2008/08-mar/o28asktom-087592.html

Ciao

id10t · January 2015

Hi!

Q: How can you query a database from UDF for pivot?
A: Only via ODBC or JDBC connection. So I have to create a connection to fetch data for pivot. Vertica isn't released a native connector and UDF doesn't support "data fetch" from database.

Q: And is it a problem?
A: Yes, Vertica parallelizes UDF and so many connections are opened. And Vertica raises exception, because each connection tries to create pivot table and to insert a pivot data to it. First tread succeeded but others throws an exception. I can't use in "OVER (PARTITION AUTO)" because i need "OVER ()" to define a pivot column.

Q: Does Oracle approach can be implemented with UDF?
A: Hm... interesting. I will try. Its more suitable, since I don't need to know a cardinality. Cardinality defined in query. Nice.

Q: But you still need to fetch data from database. How you will limit a query to a single execution?
A:

I can insert data to temporary table(but Im not sure it will solve a problem with "many connections")
I can define "CREATE TABLE IF NOT EXISTS", so a second thread will trow an warning only.
If these method will fail so I have a last option to limit a query to a single thread execution. I can define witch projections to use

Example:

SELECT set_optimizer_directives('AvoidUsingProjections=prj_sup,prj_rep');

I have no more options. May be someone will suggest a something?

Q: And what if a projection is segmented?
A: So far I can't answer, I need to investigate it, but Im afraid that without a help from Vertica Support I can't implement it with UDF if table is segmented.

May be Vertica Support will provide a hint that limits a query to a single thread?

PS
Hm... and what about CPU Affinity? I will try, looks like "CREATE TABLE IF NOT EXISTS" solves a problem with exceptions.
I will update.

Regards.

id10t · January 2015

Hi Massimo.

Yes, it is possible with UDF Multi-Phase Transform function, but I compared a performance with DECODE function - performance is under any critics. So far my code have some assumptions(read bugs), because I simplified a problem, just wanted to understand - can I do it?

***************************

When I will fix a bugs I will publish it. Feel free to remind.

***************************

But I recommend you to compare a performance of some UDF function against a built-in function. For example, Vertica provides an UDF Aggregate function - AVG:

/opt/vertica/sdk/examples/AggregateFunctions/Average.cpp

and take in mind that Transform Multi-Phase will be x100(at least) slower than doing it with DECODE.

Regards.

We're Moving!

Create My New Community Account Now

Are there sql commands to pivot data?

Comments

Leave a Comment