Optimizing DISTANCE using projections

zunzun Community Edition User
edited October 2022 in General Discussion

Hi folks,

I am currently using the DISTANCE function in vertica. I'm not getting great performance, so wanted to get your opinion on if this can be optimized. If anyone has optimized DISTANCE using projections before, your opinion would be greatly appreciated.

My Query:

SELECT Date_trunc('month', modified_date),
      APPROXIMATE_COUNT_DISTINCT(userId)  AS distinct_user_count
FROM   mySchema.userProjection
WHERE modified_date BETWEEN '2021-01-01' AND '2022-01-31'
       AND country = 'US'
       AND DISTANCE(40.71427, -74.00597, userLatitude, userLongitude) < 160
GROUP BY 1 ORDER BY 1

My Projection:

CREATE PROJECTION mySchema.userProjection
    (
     country ENCODING RLE,
     modified_date ENCODING ZSTD_FAST_COMP,
     userLatitude ENCODING ZSTD_FAST_COMP,
     userLongitude ENCODING ZSTD_FAST_COMP,
     userId ENCODING ZSTD_FAST_COMP
     )
    AS
        SELECT myTable.country,
               myTable.modified_date,
               myTable.userLatitude,
               myTable.userLongitude,
               myTable.userId,
        FROM mySchema.myTable
        ORDER BY myTable.country,
                 myTable.userLatitude,
                 myTable.userLongitude
    SEGMENTED BY hash(myTable.country, myTable.userId, myTable.modified_date) ALL NODES;

The table is partitioned by modified_date. Query currently takes about 16s, I was really aiming for <5s.

My guess is DISTANCE might not be using the projection correctly.

Appreciate any help!

Answers

  • HibikiHibiki Vertica Employee Employee

    Can you check which operator took time? Does myTable have modified_date column?

  • zunzun Community Edition User

    @Hibiki How can I figure out which operator took more time? Yes, the table has a modified_date column. The table is partitioned on modified_date (specifically partitioned on Date_trunc('month', modified_date))

  • HibikiHibiki Vertica Employee Employee

    Please collect the query profile by using PROFILE clause.

    PROFILE
    SELECT Date_trunc('month', modified_date),
           APPROXIMATE_COUNT_DISTINCT(userId)  AS distinct_user_count
    FROM   mySchema.userProjection
    WHERE  modified_date BETWEEN '2021-01-01' AND '2022-01-31'
           AND country = 'US'
           AND DISTANCE(40.71427, -74.00597, userLatitude, userLongitude) < 160
    GROUP BY 1 ORDER BY 1;
    

    You can see the transaction ID and statement ID as below.

    HINT:  Select * from v_monitor.execution_engine_profiles where transaction_id=45035996766964481 and statement_id=1;
    

    Then, please execute the following query to collect the time for each operator.

    SELECT
      path_id,
      operator_name,
      SUM(CASE WHEN counter_name = 'execution time (us)' THEN counter_value ELSE 0 END) exec_time_us,
      SUM(CASE WHEN counter_name = 'clock time (us)' THEN counter_value ELSE 0 END) clock_time_us
    FROM
      v_monitor.execution_engine_profiles
    WHERE
      transaction_id = 45035996766964481
      AND statement_id = 1
    GROUP BY 1, 2
    ORDER BY 1, 2;
    

    You can get the query plan as well by the following query.

    SELECT
      path_id,
      path_line
    FROM
      v_monitor.query_plan_profiles
    WHERE
      transaction_id = 45035996766964481
      AND statement_id = 1
    ORDER BY
      path_id,
      path_line_index;
    

    Please replace the transaction ID and statement ID by yours.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file