HP Vertica Innovations: try them out - give us your feedback!

Take a look at some of the cool new technologies under development and exploration at Vertica.  Our new HP Vertica Innovations effort is a way for you to evaluate some of the fresh ideas we have in incubation and also provide your feedback.  Our goal is to promote to product the ideas that provide the most value to you, our customers.  

Today, we have three Innovations available for download:

  • HP Vertica Place – Add the aspect of ‘where’ to your analytics through storage  and analysis of geospatial data in real time.  Load well-known ESRI-format shapefiles and use Open Geospatial Consortium (OGC) standards-based functionality to represent and analyze big data related to the locations of people,  properties, and regions.
  • HP Vertica Pulse – Answer questions on what customers are saying and feeling about a brand with this in-database sentiment analysis tool. Use HP Vertica Pulse to “score” Tweets related to products and services to help you gauge the most popular topics of interest, analyze how sentiment changes over time, identify advocates and detractors, and view high-level aggregate results as well as low-level comment-level results 
  • HP Vertica Distributed R - Accelerate the analysis of large data sets by running R computations on multiple nodes. Now, data scientists can overcome the scalability and performance limitations of R to tackle problems not previously solvable.

Look for more in depth posts on our blog and in this community over the coming weeks. To download, visit the HP Vertica Community Marketplace at http://vertica.com/marketplace, and click on the Innovations tab.  Please take a look through the Terms and Conditions as they contain important information about the Innovations efforts.  If you have comments, questions, or problems on any of these efforts, please go ahead and post to the Innovations category in our community. 

We want to hear from you!  Try the Innovations and let us know what you think!  Would this functionality have an impact in your business?  What aspect did we overlook?  Posting to this community with your feedback is the surest way to help us bring the most important efforts to market.

Comments

  • With respect to 'HP Vertica Distributed R' could you provide more details on how this is being distributed/parallelised?

    Kurt Monash provides an interesting report on the approach Revolution Analytics have taken to their parallel R implementation (http://www.dbms2.com/2013/11/19/how-revolution-analytics-parallelizes-r/).

    Is the HP Vertica approach to parallelising R similar to that taken by Revolution Analytics, or, does it take another tack?

  • Hello Neil,   

    Distributed R uses a master-slave approach to execute parallel algorithms. It is a full platform to implement distributed data mining algorithms. Data is loaded in parallel into the memory and then functions are shipped and scheduled for data processing. This approach is general enough to encompass running the same sequential algorithm on different parts of the data or implement a distributed version of a sequential algorithm. Of course, as a programmer you don't have to worry about function shipping or scheduling tasks. The runtime takes care of it. You pretty much write the usual R programs and use language extensions that we provide. 

    Unlike other R offerings, Distributed R provides "distributed data structures" (e.g., darray) and language extensions that allow anyone to write their own distributed algorithm. You don't have to wait for the software provider to implement and release new algorithms. 

    I encourage you to look at the Distributed R user guide for more examples and an architecture diagram.

    -Indrajit
  • Hello Neil,

    You can also get more insight into Distributed R architecture from this blog post http://www.vertica.com/2013/02/21/presto-distributed-r-for-big-data/. User guide in the download gives much more details on inner working of the Distributed R.

    One of the key differentiator of the Distributed R Platform, it extends R language with distributed constructs and more open platform to write parallel algorithms. Latest 0.3 version of Distributed R includes a parallel GLM implementation for logistic, linear and poisson regression algorithm. You can even explore the source code of this parallel algorithm to get more insight into this. We will soon have a github repository to manage algorithm source code. If you are interested in benchmark details of GLM, refer to HPDGLM user guide.

    With vRODBC (Vertica optimized RODBC) and parallel data loader you could accomplish high-data transfer rate to enable scalable big data advanced analytics.  

    -Sunil
  • Thanks for the excellent & helpful responses Sunil & Indrajit. I have had a quick read of that blog post too, which is good, and will read the papers it links too. Will take a look at the manual also as suggested.
  • Quick follow-up questions:

    Given I have an existing Vertica cluster, and I want to use the Distributed R Platform, should I:
    A: Provision a new dedicated cluster to run the Distributed R Platform.
    B: Run the Distributed R Platform on the existing Vertica cluster (so the same machines run both).
    C: Something else?

    Thanks again for your support with this,

    Neil
  • Neil, Advanced analytics are computationally intensive and R/Distributed R use in-memory processing to gain high-performance, so hardware requirements differ between Vertica nodes and Distributed R. Also to avoid resource contention, current preferred architecture within the same Vertica rack allocate separate nodes for Distributed R. We understand one-size doesn't fit all. So we are exploring other integration options. I am curious to know, in your scenario what you would prefer and why? Thanks Sunil
  • Thanks for the quick and helpful response Sunil.

    We are investigating the use of Distributed R to enable us to run Optimisation algorithms that leverage data we will hold in our Vertica cluster.

    Whilst we want to ensure that we consistently meet our performance requirements, we also want to balance this with our investment in hardware. Our thinking is that a single, larger cluster would allow us to better absorb spikes in demand more cost-effectively than two separate, smaller clusters.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file