Broadcast join mechanism
Hi, I am wondering that what the broadcast join mechanism is. For some reason, I cannot setup all nodes in my cluster in ONE sub-network. 1) Is it mean that the broadcast join in my cluster are actually a "multicast join"? 2) In a broadcast join, would vertica broadcast the data to each sub-network? or would vertica unicast the data to each node? 3) In the documentation, I found that only UDP and TCP were mentioned. Which protocol would vertica use in a broadcast join?
0
Comments
SELECT sales_quantity, sales_dollar_amount, transaction_type, cc_name FROM online_sales.online_sales_fact INNER JOIN online_sales.call_center_dimension ON (online_sales.online_sales_fact.call_center_key = online_sales.call_center_dimension.call_center_key AND sale_date_key = 156) ORDER BY sales_dollar_amount DESC;
In the join plan (using default partition plan), I found BROADCAST join:+-SORT [Cost: 8K, Rows: 25K] (PATH ID: 1) | Order: online_sales_fact.sales_dollar_amount DESC | Execute on: All Nodes | +---> JOIN HASH [Cost: 8K, Rows: 25K] (PATH ID: 2) Inner (BROADCAST) | | Join Cond: (online_sales_fact.call_center_key = call_center_dimension.call_center_key) | | Materialize at Output: online_sales_fact.sales_quantity, online_sales_fact.sales_dollar_amount, online_sales_fact.transaction_type | | Execute on: All Nodes …
Furthermore, by monitoring the network traffic, I found the BROADCAST join establishes N-1 (where N is the number of nodes) TCP unicast connections from one node to all the others, and repeatedly send the same data from one node to all other others (i.e., the same data is repeated sent across the network N-1 times.) My question is: Why Vertica uses TCP to ‘emulate’ (implement) the BROADCAST join but not uses a reliable broadcasting protocol to implement that? Isn’t that using a real reliable broadcasting protocol would save much more network traffic in BROADCAST join?