Broadcast join mechanism

Petrie_Wong · August 2013

Hi, I am wondering that what the broadcast join mechanism is. For some reason, I cannot setup all nodes in my cluster in ONE sub-network. 1) Is it mean that the broadcast join in my cluster are actually a "multicast join"? 2) In a broadcast join, would vertica broadcast the data to each sub-network? or would vertica unicast the data to each node? 3) In the documentation, I found that only UDP and TCP were mentioned. Which protocol would vertica use in a broadcast join?

[Deleted User] · August 2013

Hi Petrie, Hm... I'm not quite sure how your questions apply to Vertica, but let me try to answer: First, Vertica uses UDP Broadcast by default for its control messages between nodes. There is no traditional multicast option. We do not support configurations that span subnets. It actually is technically possible to span subnets -- look at the installer options -- but it is likely to perform terribly. Many routers choke on the number of connections and the amount of throughput that we produce, and the added latency can cause issues with our cluster-agreement protocol. So your clusters are not just slow, but unstable. We do have customers with clusters that span subnets, but most of them got into that state by mistake and are working hard to get out of it. Regarding broadcast joins, to answer your first question, I'm not sure what you're referring to exactly? Vertica doesn't do any sort of conventional broadcast join -- too slow. Broadcast sends all data to every node. But in a join, each node needs a different part of that data; so giving it all to everyone is wasteful. Also, small tables are typically pre-replicated in Vertica; there's no need to ship them around at all at query-time. As a result, for performance reasons, we're all TCP Unicast at the data layer. Adam

Petrie_Wong · August 2013

Hi Adam, in vertica’s example database VMart query_09.

  SELECT sales_quantity, sales_dollar_amount,   transaction_type, cc_name  FROM online_sales.online_sales_fact  INNER JOIN online_sales.call_center_dimension   ON (online_sales.online_sales_fact.call_center_key =   online_sales.call_center_dimension.call_center_key   AND sale_date_key = 156)  ORDER BY sales_dollar_amount DESC;

In the join plan (using default partition plan), I found BROADCAST join:

   +-SORT [Cost: 8K, Rows: 25K] (PATH ID: 1)   |  Order: online_sales_fact.sales_dollar_amount DESC   |  Execute on: All Nodes   | +---> JOIN HASH [Cost: 8K, Rows: 25K] (PATH ID: 2) Inner (BROADCAST)   | |      Join Cond: (online_sales_fact.call_center_key =  call_center_dimension.call_center_key)   | |      Materialize at Output: online_sales_fact.sales_quantity,  online_sales_fact.sales_dollar_amount, online_sales_fact.transaction_type   | |      Execute on: All Nodes  …

Furthermore, by monitoring the network traffic, I found the BROADCAST join establishes N-1 (where N is the number of nodes) TCP unicast connections from one node to all the others, and repeatedly send the same data from one node to all other others (i.e., the same data is repeated sent across the network N-1 times.) My question is: Why Vertica uses TCP to ‘emulate’ (implement) the BROADCAST join but not uses a reliable broadcasting protocol to implement that? Isn’t that using a real reliable broadcasting protocol would save much more network traffic in BROADCAST join?

[Deleted User] · August 2013

Hi Petrie, It sounds like you haven't finished loading the VMart schema. In particular, please through the "Create a Comprehensive Design" step in the Getting Started Guide (page 39): https://my.vertica.com/docs/6.1.x/PDF/HP_Vertica_6.1.x_GettingStartedGuide.pdf You'll find that the join is no longer flagged as BROADCAST. (In Vertica, if you haven't run the DBD, you're not done setting up your schema :-) ) That said, it's a fair question: Why do we not use a proper broadcast protocol? Well, a few reasons. One, actually it won't save much traffic. It's generally only used for small tables; bigger tables will be resegmented, as described above. (call_center_dimension is very small -- ~20kb. It'll take a fraction of a second to distribute via any protocol that we support.) Two, it doesn't come up often. As noted above, a completed design generally doesn't attempt to broadcast tables. Three, broadcast technologies are tricky to get right. As I mentioned, we use broadcast for control messages already. This causes headaches for people who use network hardware that deals poorly with lots of UDP traffic (as has been discussed in other threads in these forums). That said, could we get it right? Yes, most likely. And do there exist queries somewhere that could be made more efficient with broadcast? Yes, slightly so. So it's an optimization that we may add in the future. If you have data showing that it's very important for your use case (in such a way that you can't do better entirely by improving your projection design), you're welcome to bring up an Idea (with specific details) or a support case. If you have data showing that it's important, and you're experienced with broadcast networking and feel that you could implement it, well, we are hiring :-) Adam

We're Moving!

Create My New Community Account Now

Broadcast join mechanism

Comments

Leave a Comment