Avoid Query going to other Nodes if segment upper and lower bound is provided.

esangrame · February 2019

Hi,
I'm trying to use projection_segmnets for exporting out data in parallel.

SELECT get_projection_segments('my_table_b0') shows below segment ranges

v_node0001 v_node0004 v_lnode0002 v_node0003 v_node0005
858993458 1717986917 2576980376 3435973835 4294967294
4294967295 858993459 1717986918 2576980377 3435973836

--> my_table is segmented and ordered on column User_id.
--> User_id (123) resides in v_node0004

case 1:

Explain Select * FROM my_table where User_id = 123;

QUERY PLAN DESCRIPTION:

explain
select * FROM my_table where User_id = 123;

Access Path:
+-STORAGE ACCESS for my_table [Cost: 411, Rows: 789K] (PATH ID: 1)
| Projection: my_table_b0
| Materialize: my_table.User_id
| Filter: (my_table.User_id = (123))
| Execute on: v_node0004

case 2:

explain select person_id FROM my_table where person_id in (123,123) ;

QUERY PLAN DESCRIPTION:

Access Path:
+-STORAGE ACCESS for my_table [Cost: 504K, Rows: 248M] (PATH ID: 1)
| Projection: my_table_b0
| Materialize: my_table.User_id
| Filter: (my_table.User_id = ANY (ARRAY[(123), (123)]))
| Execute on: All Nodes

case 3:

explain select count(*) from my_table WHERE (0xffffffff & hash(User_id)) BETWEEN 858993459 AND 1717986917 ;

QUERY PLAN DESCRIPTION:

Access Path:
+-GROUPBY NOTHING [Cost: 504K, Rows: 1] (PATH ID: 1)
| Aggregates: count(*)
| Execute on: All Nodes
| +---> STORAGE ACCESS for my_table [Cost: 504K, Rows: 248M] (PATH ID: 2)
| | Projection: my_table_b0
| | Filter: (((4294967295 & hash(my_table.User_id)) >= 858993459) AND ((4294967295 & hash(my_table.User_id)) <= 1717986917))
| | Execute on: All Nodes

case 1: When provided single user_id it executes on one node i.e on the Node where the data resides
Even if we have logged in through NODE 1 in VSQL, The query get's executed in NODE 4 only.

Kindly help me with below queries:

Case 2: Why the query is executing on All Nodes, though the user_id are same and belongs to NODE 4, shouldn't it execute only on NODE 4 as in Case 1.

Case 3: If my filter condition is only looking at segment range of NODE 4, why the query is getting distributed on ALL Nodes.

Though Query is getting distributed to ALL NODES, Is it that only NODE 4 is contributing to the query and other don't do any thing (i.e there is no processing cost for this on other nodes except NODE 4).
Is there a system table from where this could be confirmed for a run like this.
If I look into execution_engine_profiles, all Nodes are going through same steps as NODE 4 is.

I'm trying to avoid query going to other nodes if it belongs to Node-4.
Tried case 2 & case 3 by logging in from Node-4 but the execution plan remained same as earlier.

Also please suggest if there is any other way to achieve this use case. (Please Note: I may join this table to another table , but those tables are also segmented and ordered by user_id )

Dingqiang · February 2019

That's because at now, our optimizer can only generate single node plan with equal predicate on segment columns of segmented projections , or unsegmented projections.

esangrame · February 2019

Thanks.

I may be repeating the question, but please help me with this.

Case 3: If my filter condition is only looking at segment range of NODE 4, why the query is getting distributed on ALL Nodes.

Though Query is getting distributed to ALL NODES, Is it that only NODE 4 is contributing to the query and other do Nothing (i.e there is no processing cost for this on other nodes except NODE 4).
Is there a system table from where this could be confirmed for a run like this.
If I look into execution_engine_profiles, all Nodes are going through same steps as NODE 4 is.

Jim_Knicely · February 2019

Maybe LOCAL_NODE_NAME() can help?

dbadmin->vmart@sandbox1=>* SELECT node_name FROM nodes;
    node_name
------------------
 v_vmart_node0001
 v_vmart_node0002
 v_vmart_node0003
(3 rows)

dbadmin->vmart@sandbox1=>* SELECT c1, COUNT(c1) FROM test GROUP BY c1;
 c1 | COUNT
----+--------
  4 | 788160
  5 |    861
  1 |  16420
  2 |  16420
  3 | 229880
(5 rows)

dbadmin->vmart@sandbox1=>* SELECT COUNT(*) FROM test WHERE local_node_name() = 'v_vmart_node0001';
 COUNT
-------
   861
(1 row)

dbadmin->vmart@sandbox1=>* SELECT COUNT(*) FROM test WHERE local_node_name() = 'v_vmart_node0002';
 COUNT
--------
 262720
(1 row)

dbadmin->vmart@sandbox1=>* SELECT COUNT(*) FROM test WHERE local_node_name() = 'v_vmart_node0003';
 COUNT
--------
 788160
(1 row)

dbadmin->vmart@sandbox1=>* SELECT COUNT(*) FROM test;
  COUNT
---------
 1051741
(1 row)

dbadmin->vmart@sandbox1=>* SELECT 861+262720+788160 total_row_count;
 total_row_count
-----------------
         1051741
(1 row)

esangrame · March 2019

Thanks for replying.

It still goes to all nodes.

 QUERY PLAN DESCRIPTION: 
 ------------------------------
 explain select * from my_table where local_node_name() ='v_node0005' ;

 Access Path:
 +-STORAGE ACCESS for my_table [Cost: 5M, Rows: 2B] (PATH ID: 1)
 |  Projection: my_table_b0
 |  Materialize: my_table.User_id
 |  Filter: (local_node_name() = 'v_node0005')
 |  Execute on: All Nodes

Jim_Knicely · March 2019

Although all nodes participate, only the local node specified does the bulk of the work.

dbadmin->vmart@sandbox1=>* SELECT local_node_name(), COUNT(*) FROM test GROUP BY local_node_name() ORDER BY 1;
 local_node_name  | COUNT
------------------+--------
 v_vmart_node0001 |    861
 v_vmart_node0002 | 262720
 v_vmart_node0003 | 788160
(3 rows)

dbadmin->vmart@sandbox1=>* PROFILE  SELECT COUNT(*) FROM test WHERE local_node_name() = 'v_vmart_node0003';
NOTICE 4788:  Statement is being profiled
HINT:  Select * from v_monitor.execution_engine_profiles where transaction_id=45035996280421369 and statement_id=13;
NOTICE 3557:  Initiator memory for query: [on pool general: 19885 KB, minimum: 19885 KB]
NOTICE 5077:  Total memory required by query: [19885 KB]
 COUNT
--------
 788160
(1 row)

Time: First fetch (1 row): 43.151 ms. All rows formatted: 43.178 ms
dbadmin->vmart@sandbox1=>* Select node_name, SUM(counter_value) from v_monitor.execution_engine_profiles where transaction_id=45035996280421369 and statement_id=13 and counter_name = 'rows produced' group by node_name;
    node_name     |   SUM
------------------+---------
 v_vmart_node0001 |      10
 v_vmart_node0003 | 1576323
 v_vmart_node0002 |       3
(3 rows)

If you want only one node to participate, you'll have to replicate the table projection.

dbadmin->vmart@sandbox1=>* SELECT node_name, projection_name, is_segmented FROM projections WHERE anchor_table_name = 'test_rep';
    node_name     | projection_name | is_segmented
------------------+-----------------+--------------
 v_vmart_node0001 | test_rep_super  | f
 v_vmart_node0002 | test_rep_super  | f
 v_vmart_node0003 | test_rep_super  | f
(3 rows)

dbadmin->vmart@sandbox1=>* PROFILE SELECT COUNT(*) FROM test_rep;
NOTICE 4788:  Statement is being profiled
HINT:  Select * from v_monitor.execution_engine_profiles where transaction_id=45035996280421369 and statement_id=16;
NOTICE 3557:  Initiator memory for query: [on pool general: 9284 KB, minimum: 9284 KB]
NOTICE 5077:  Total memory required by query: [9284 KB]
  COUNT
---------
 2103482
(1 row)

dbadmin->vmart@sandbox1=>* Select node_name, SUM(counter_value) from v_monitor.execution_engine_profiles where transaction_id=45035996280421369 and statement_id=16 and counter_name = 'rows produced' group by node_name;
    node_name     |   SUM
------------------+---------
 v_vmart_node0001 | 4206971
(1 row)

 EXPLAIN SELECT COUNT(*) FROM test_rep;

 Access Path:
 +-GROUPBY NOTHING [Cost: 3, Rows: 1 (NO STATISTICS)] (PATH ID: 1)
 |  Aggregates: count(*)
 |  Execute on: Query Initiator
 | +---> STORAGE ACCESS for test_rep [Cost: 2, Rows: 10K (1 RLE) (NO STATISTICS)] (PATH ID: 2)
 | |      Projection: public.test_rep_super
 | |      Execute on: Query Initiator

esangrame · March 2019

Thanks Jim,

All Nodes do not produce the rows, this is correct as we have filtered on the segment range.
Could you please help me understand how we get the value 10 for Node1 and 3 for Node2, though all records for the query is coming from Node3.

    node_name     |   SUM
------------------+---------
 v_vmart_node0001 |      10
 v_vmart_node0003 | 1576323
 v_vmart_node0002 |       3

In my case I have to segment the table on user_id with a KSAFE of 1 so projection table will have "is segmented" as true. Is there any other way to get the result as you got but with segmented projections.

Jim_Knicely · March 2019

Hi,

Question 2:

If we only look at "Scanned" data we see that all rows do come from a single node:

dbadmin->vmart@sandbox1=>* Select node_name, operator_name, SUM(counter_value) from v_monitor.execution_engine_profiles where transaction_id=45035996280421369 and statement_id=13 and counter_name = 'rows produced' and operator_name = 'Scan' group by node_name, operator_name ORDER BY 1, 2;
    node_name     | operator_name |  SUM
------------------+---------------+--------
 v_vmart_node0001 | Scan          |      0
 v_vmart_node0002 | Scan          |      0
 v_vmart_node0003 | Scan          | 788160
(3 rows)

In my example, all of the data did come from node 3. There are other operators that produce rows (i.e. GroupByPipe, StorageUnion, etc.).

Question 3:

For a segmented projection I think the best you can do is limit the computation to 2 nodes (buddy projections) because Vertica is going to look for the buddy segments.

dbadmin->vmart@sandbox1=>* PROFILE SELECT /*+ skipnode(v_vmart_node0002, v_vmart_node0003) */ COUNT(*) FROM test WHERE local_node_name() = 'v_vmart_node0001';
ERROR 3586:  Insufficient projections to answer query
DETAIL:  Path segment called for on node v_vmart_node0002 can not be computed

dbadmin->vmart@sandbox1=>* PROFILE SELECT /*+ skipnode(v_vmart_node0003) */ COUNT(*) FROM test WHERE local_node_name() = 'v_vmart_node0001';
NOTICE 4788:  Statement is being profiled
HINT:  Select * from v_monitor.execution_engine_profiles where transaction_id=45035996280427674 and statement_id=9;
NOTICE 3557:  Initiator memory for query: [on pool general: 19376 KB, minimum: 19376 KB]
NOTICE 5077:  Total memory required by query: [19376 KB]
 COUNT
--------
 788160
(1 row)

dbadmin->vmart@sandbox1=>* Select node_name, operator_name, SUM(counter_value) from v_monitor.execution_engine_profiles where transaction_id=45035996280427674 and statement_id=9 and counter_name = 'rows produced' and operator_name = 'Scan' group by node_name, operator_name ORDER BY 1, 2;
    node_name     | operator_name |  SUM
------------------+---------------+--------
 v_vmart_node0001 | Scan          | 788160
 v_vmart_node0002 | Scan          |      0
(2 rows)

esangrame · March 2019

For question 3:
the tables created in my case has one buddy projection (ie _b0, _b1) and each of them are replicated in all nodes. So for 5 nodes all of them have (_b0, _b1).

It seems we can't do anything with this.
Please suggest.

We're Moving!

Create My New Community Account Now

Avoid Query going to other Nodes if segment upper and lower bound is provided.

case 1:

case 2:

Kindly help me with below queries:

Comments

Leave a Comment