Are there sql commands to pivot data?

Comments

  • I prefer an implementation like the one below if possible: 
    http://www.postgresql.org/docs/9.1/static/tablefunc.html



  • Hi!

    You can try to open a feature request. Post under topic IDEAS

    Regards
  • Hm... took a look on tablefunc... its not so hard to implement it with UDF. Just for fun I implemented an easiest function -  normal_rand(numrows, mean, stddev)

    Example
    dev=> select * from user_functions where function_name ilike '%normal%';
    -[ RECORD 1 ]----------+-----------------------------------------------------
    schema_name | public
    function_name | normal_rand
    procedure_type | User Defined Transform
    function_return_type | Float
    function_argument_type |
    function_definition | Class 'NormRandFactory' in Library 'public.Gaussian'
    volatility |
    is_strict | f
    is_fenced | t
    comment |


    dev=> select normal_rand(USING PARAMETERS rows=3, mean=3.3, stddev=6.3) over (PARTITION AUTO);
    RAND_VALUE
    -------------------
    2.53161555990794
    -3.54695367884655
    7.6110266459183
    (3 rows)

    Bench
    dev=> \! lscpu | grep -P '(^CPU.s.)|(MHz)'
    CPU(s): 8
    CPU MHz: 1200.000
    dev=> \o /dev/null
    dev=> \timing
    Timing is on.


    1MIL rows:
    dev=> select normal_rand(USING PARAMETERS rows=1000000, mean=3.3, stddev=6.3) over (PARTITION AUTO);
    Time: First fetch (1000 rows): 72.207 ms. All rows formatted: 1405.582 ms
    2,5MIL rows:
    dev=> select normal_rand(USING PARAMETERS rows=2500000, mean=3.3, stddev=6.3) over (PARTITION AUTO);
    Time: First fetch (1000 rows): 68.825 ms. All rows formatted: 3521.669 ms
    5MIL rows:
    dev=> select normal_rand(USING PARAMETERS rows=5000000, mean=3.3, stddev=6.3) over (PARTITION AUTO);
    Time: First fetch (1000 rows): 72.705 ms. All rows formatted: 7135.159 ms
    Nice linear scalability
    +=========+========+
    | ROWS | TIME(s)|
    +=========+========+
    | 1000000 | 1.4 |
    | 2500000 | 3.5 |
    | 5000000 | 7.1 |
    +---------+--------+
    I can share a code if you are interesting in this function. May be (also just for fun) I will implement others functions too. Can you tell me witch function is more important for you? And I will start from this function.

    best
  • Great !  !

    crosstab !

    see you soon :-  )

  • Hi!


    I will try, but syntax will be differ from PG implementation.
    I will update you in any case - success or failed.


    Regards.
  • As I got to say, it would be important for people like me who uses the database as a data source for simulations and statistical models of machine learning, have a function that creates horizontal metrics that were calculated, I thank you and I appreciate your help, I hope also that hp-vertica may notice your work and adopt it as soon as helping you develop it.
  • Hi!

    I failed. I will explain where I failed.

    Implementation of PIVOT require:
    • table data
    • cardinality of PIVOT'ed column (its dynamic and mutable and its a problem)
    The only way I can implement it - via external procedures or UDF.
    Unfortunately UDF is out of scope:
    • Vertica will try to parallelize UDF
    • I need to know a cardinality of pivoted data. I need it to define a pivoted table columns. Of cause I can use in ODBC/JDBC and create a Flex table and so columns will be dynamic, but Vertica will parallelize it and you will get a garbage.
    How I see it with EP?
    vsql=> select pivot(src_table=<Table>, dest_table=<Table>, pivot_column=<column>);
    where
    • src_table - an original data for pivoting
    • pivot_column -  pivoted column
    • dest_table - EP will insert results to this table
    Or suggest your syntax(but take in mind - I must know how many columns will be in a new table and so I have to query a pivoted column for cardinality)


    @massimo
    Will you accept solution with EP?

    PS
    The main problem - is cardinality of pivoted column. I can limit query execution on one node only, but I can't create table with dynamic columns.
  • Gaussian Distribution source code:
    http://pastebin.com/gDPcQTE5

    Compile:
    g++ -std=c++11 -D HAVE_LONG_INT_64  -I /opt/vertica/sdk/include -Wall -shared -Wno-unused-value -fPIC -o Gaussian.so NormalDistribution.cpp /opt/vertica/sdk/include/Vertica.cpp
    Deploy:
    CREATE LIBRARY Gaussian AS '/tmp/Gaussian.so';
    CREATE TRANSFORM FUNCTION normal_rand AS LANGUAGE 'C++' NAME 'NormalDistributionFactory' LIBRARY Gaussian;
    Regards.
  • Hello  Genius !,  " id10t " it seems not appropriate to you,
    do you think this could be a possible solution (from oracle 11g) ?

    select * from( select deptno, job, sal from emp ) e 

    pivot(   sum (sal )  for job in ( 'CLERK', 'SALESMAN', 'MANAGER', 'ANALYST', 'PRESIDENT' ) )

    order by deptno        DEPTNO     'CLERK'     'SALESMAN'      'MANAGER'    'ANALYST'     'PRESIDENT' -----------   --------    ------------   ------------   ----------   -------------          10       1300                           2450                         5000          20       1900                           2975         6000          30        950           5600            2850   
    http://www.oracle.com/technetwork/issue-archive/2008/08-mar/o28asktom-087592.html


    Ciao

  • Hi!

    Q: How can you query a database from UDF for pivot?
    A: Only via ODBC or JDBC connection. So I have to create a connection to fetch data for pivot. Vertica isn't released a native connector and UDF doesn't support "data fetch" from database.


    Q: And is it a problem?
    A: Yes, Vertica parallelizes UDF and so many connections are opened. And Vertica raises exception, because each connection tries to create pivot table and to insert a pivot data to it. First tread succeeded but others throws an exception. I can't use in "OVER (PARTITION AUTO)" because i need "OVER ()" to define a pivot column.


    Q: Does Oracle approach can be implemented with UDF?
    A: Hm... interesting. I will try. Its more suitable, since I don't need to know a cardinality. Cardinality defined in query. Nice.


    Q: But you still need to fetch data from database. How you will limit a query to a single execution?
    A:
    • I can insert data to temporary table(but Im not sure it will solve a problem with "many connections")
    • I can define "CREATE TABLE IF NOT EXISTS", so a second thread will trow an warning only.
    • If these method will fail so I have a last option to limit a query to a single thread execution. I can define witch projections to use
    Example:
    SELECT set_optimizer_directives('AvoidUsingProjections=prj_sup,prj_rep');
    I have no more options. May be someone will suggest a something?


    Q: And what if a projection is segmented?
    A: So far I can't answer, I need to investigate it, but Im afraid that without a help from Vertica Support I can't implement it with UDF if table is segmented.

    May be Vertica Support will provide a hint that limits a query to a single thread?

    PS
    Hm... and what about CPU Affinity? I will try, looks like "CREATE TABLE IF NOT EXISTS" solves a problem with exceptions.
    I will update.


    Regards.
  • Hi Massimo.


    Yes, it is possible with UDF Multi-Phase Transform function, but I compared a performance with DECODE function - performance is under any critics.  So far my code have some assumptions(read bugs), because I simplified a problem, just wanted to understand - can I do it?

    ***************************

    When I will fix a bugs I will publish it.  Feel free to remind.

    ***************************

    But I recommend you to compare a performance of some UDF function against a built-in function. For example, Vertica provides an UDF Aggregate function - AVG:
    /opt/vertica/sdk/examples/AggregateFunctions/Average.cpp
    and take in mind that Transform Multi-Phase will be x100(at least) slower than doing it with DECODE.


    Regards.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file