HP Vertica Innovations: try them out - give us your feedback!
Take a look at some of the cool new technologies under development and exploration at Vertica. Our new HP Vertica Innovations effort is a way for you to evaluate some of the fresh ideas we have in incubation and also provide your feedback. Our goal is to promote to product the ideas that provide the most value to you, our customers.
Today, we have three Innovations available for download:
- HP Vertica Place – Add the aspect of ‘where’ to your analytics through storage and analysis of geospatial data in real time. Load well-known ESRI-format shapefiles and use Open Geospatial Consortium (OGC) standards-based functionality to represent and analyze big data related to the locations of people, properties, and regions.
- HP Vertica Pulse – Answer questions on what customers are saying and feeling about a brand with this in-database sentiment analysis tool. Use HP Vertica Pulse to “score” Tweets related to products and services to help you gauge the most popular topics of interest, analyze how sentiment changes over time, identify advocates and detractors, and view high-level aggregate results as well as low-level comment-level results
- HP Vertica Distributed R - Accelerate the analysis of large data sets by running R computations on multiple nodes. Now, data scientists can overcome the scalability and performance limitations of R to tackle problems not previously solvable.
Look for more in depth posts on our blog and in this community over the coming weeks. To download, visit the HP Vertica Community Marketplace at http://vertica.com/marketplace, and click on the Innovations tab. Please take a look through the Terms and Conditions as they contain important information about the Innovations efforts. If you have comments, questions, or problems on any of these efforts, please go ahead and post to the Innovations category in our community.
We want to hear from you! Try the Innovations and let us know what you think! Would this functionality have an impact in your business? What aspect did we overlook? Posting to this community with your feedback is the surest way to help us bring the most important efforts to market.
Comments
Kurt Monash provides an interesting report on the approach Revolution Analytics have taken to their parallel R implementation (http://www.dbms2.com/2013/11/19/how-revolution-analytics-parallelizes-r/).
Is the HP Vertica approach to parallelising R similar to that taken by Revolution Analytics, or, does it take another tack?
Distributed R uses a master-slave approach to execute parallel algorithms. It is a full platform to implement distributed data mining algorithms. Data is loaded in parallel into the memory and then functions are shipped and scheduled for data processing. This approach is general enough to encompass running the same sequential algorithm on different parts of the data or implement a distributed version of a sequential algorithm. Of course, as a programmer you don't have to worry about function shipping or scheduling tasks. The runtime takes care of it. You pretty much write the usual R programs and use language extensions that we provide.
Unlike other R offerings, Distributed R provides "distributed data structures" (e.g., darray) and language extensions that allow anyone to write their own distributed algorithm. You don't have to wait for the software provider to implement and release new algorithms.
I encourage you to look at the Distributed R user guide for more examples and an architecture diagram.
-Indrajit
You can also get more insight into Distributed R architecture from this blog post http://www.vertica.com/2013/02/21/presto-distributed-r-for-big-data/. User guide in the download gives much more details on inner working of the Distributed R.
One of the key differentiator of the Distributed R Platform, it extends R language with distributed constructs and more open platform to write parallel algorithms. Latest 0.3 version of Distributed R includes a parallel GLM implementation for logistic, linear and poisson regression algorithm. You can even explore the source code of this parallel algorithm to get more insight into this. We will soon have a github repository to manage algorithm source code. If you are interested in benchmark details of GLM, refer to HPDGLM user guide.
With vRODBC (Vertica optimized RODBC) and parallel data loader you could accomplish high-data transfer rate to enable scalable big data advanced analytics.
-Sunil
Given I have an existing Vertica cluster, and I want to use the Distributed R Platform, should I:
A: Provision a new dedicated cluster to run the Distributed R Platform.
B: Run the Distributed R Platform on the existing Vertica cluster (so the same machines run both).
C: Something else?
Thanks again for your support with this,
Neil
We are investigating the use of Distributed R to enable us to run Optimisation algorithms that leverage data we will hold in our Vertica cluster.
Whilst we want to ensure that we consistently meet our performance requirements, we also want to balance this with our investment in hardware. Our thinking is that a single, larger cluster would allow us to better absorb spikes in demand more cost-effectively than two separate, smaller clusters.