Performance on the low end
We are currently looking at Vertica to see if it would be a good candidate to migrate our DWH application to.
The main issue is our customers have a very wide range of data volume. Smaller sites only have 10 million records or less in their biggest tables. We need to choose a database that can run well on in both low end configurations for small sites, and high end configurations for bigger sites. ( and can scale up/out well and eventually migrate to the cloud, which is why I am looking at Vertica)
Our current database is fine for these customers, but in bigger installations it slows down.
The thing if we are going to retool our application to support a different database, we want to choose a database that isn't going to increase the cost of installation for smaller sites.(As they far outnumber our larger sites) Currently our smaller sites run on a single server.
While Vertica seems to be speedy on bigger installations with more nodes(from what I read, still in the process of getting a test environment ready), what I am wondering is can Vertica provide descent performance on a single server. If the answer to that is yes, what is a good starting point in terms of CPU, memory etc... to get snappy performance on an install with a relatively small amount of data.
The main issue is our customers have a very wide range of data volume. Smaller sites only have 10 million records or less in their biggest tables. We need to choose a database that can run well on in both low end configurations for small sites, and high end configurations for bigger sites. ( and can scale up/out well and eventually migrate to the cloud, which is why I am looking at Vertica)
Our current database is fine for these customers, but in bigger installations it slows down.
The thing if we are going to retool our application to support a different database, we want to choose a database that isn't going to increase the cost of installation for smaller sites.(As they far outnumber our larger sites) Currently our smaller sites run on a single server.
While Vertica seems to be speedy on bigger installations with more nodes(from what I read, still in the process of getting a test environment ready), what I am wondering is can Vertica provide descent performance on a single server. If the answer to that is yes, what is a good starting point in terms of CPU, memory etc... to get snappy performance on an install with a relatively small amount of data.
0
Comments
Vertica is more than MPP, it is columnar, has encoding and compression too what help a lot the query performance reducing the io. However, it is important that you define the right use case, what type of data are you loading? are you doing deletes and updates? How many concurrent queries do you expect? what type of queries? How many objects etc? Does the system need to be available all the time?
In my experience, sometimes 3 low end nodes are better than 1 super powerful node and it is not only for the MPP ( Multi Parallel Processing) but also because high availability.
About the minimal requirements, when you install vertica under /opt/vertica/ folder you will find 3 tools vioperf, vnetperf, vcpuperf. At least you have to meet the requirements of the output of those tools ( see documentation for details) but keep in mind that if you use a 4 cores machine, do not expect to have 10 queries concurrently. Does make sense?
For what you mention it could be a multi-tentant database, maybe you can combine small customers in one cluster to reduce the host of hardware. You can do it with different schema per customer, but avoid to have too many objects in just one database. Objects are the number of tables etc, not the number of rows.
Hope this helps, there is not a fix formula, it'll always depend on different factors. However, if your system does a lot of updates, deletes, as a transactional system, Vertica may not be the right solution. Vertica is an Analytics Database.
Hope this helps.
Eugenia
This is just a heads up that we are migrating this Community to a new platform in the next couple of days. Please visit our new community at https://community.dev.hp.com/t5/Big-Data-and-Analytics/ct-p/bigdata_analytics on Thursday!
Thank you for your response.
Our application is a medical data warehouse and analysis tool. I am relatively new to the company(little less than a year) and am helping them redesign both the front and back end to something a bit more modern. The DWH design is definitely more snowflake than star, however this is something I can(and will likely) change in the new design. The base system currently has around 350 tables.(many of which can be combined but would result in a lot of null data) Out of curiosity how many objects is "too many" and do wide sparse tables(tables with a lot of columns and null data) generally perform better?
Updates are made nightly. Installations are currently all on site (and generally with no internet access for privacy/security reasons), and thus cannot easily share a cluster.
Originally the data was centered around surgery, and mostly used by doctors for research, however over time we have received requests to integrate other hospital systems. This is one of the factors driving our redesign, as the system generally wasn't designed for the amount of data that is contained in all these other systems.
Even in bigger hospitals concurrent usage is low, but could increase as we integrate more systems. Still the number of installed terminals in the hospital is limited and generally less than 10.
While generally the usage of our various customers is probably very similar, the amount of data varies quite a bit depending on the size of the hospital and the amount of historical data they have.
Going forward we want to offer our customer an option for cloud installation. That said, many hospitals in this country(we are outside the U.S.) are very reluctant to store their patient information in the cloud, and it on site installations will likely be the main configuration for some time still.
Also we are looking at providing a consolidated database in the cloud, so hospitals can compare their data with other hospitals. But we would "clean" any personal/private data and also allow the hospital the option of masking the source of the data.(so other hospitals can see their data, but don't know what hospital it came from) This is something that is far more acceptable here as long as sensitive data is removed or mask, and is also something we receive requests for.(as more data can lead to more accurate research)
We don't want to have to develop separate interfaces and databases for each scenario, so we really want something that performs well on the low end but we want to be able to scale it up to the national level(well our customers at the national level anyway).
So in the specific case above, would vertica be able to perform well with one node on the low end? (Data less than 100GB in most cases, less than 5 concurrent users) I am assuming the high end is fine with appropriate hardware.
I am really looking for something that scales well from the low end so we can go with the same platform for all of our customers/scenarios, and more expensive hardware for something that currently runs fine on the low end is something that won't go over well with customers.
Hi,
Sorry for the delay in response, with the change of community platform and I did not notice when you replied.
Going back to your issue, It will always depends of the design that you have. I can't tell you how much is big and how many are too many, but when you install the Management Console you will see how much memory the catalog is using. If the Catalog use more than 5% of your System Memory, beside changing the resource pool to reflect that, we need to investigate if it is related to Number of objects of fragmentation of the data that is stored in disk. There is some analysis to be done.
You can have wider tables in Vertica and should not be a problem. It is a columnar database so when you query it will chose the columns that you need, however, there is some loading implications as Vertica will need to reserve memory to be able to accommodate all the columns when you load, so the wider the table, less rows that it could process together, making it slowER, not slow compare to other solutions. But this will depend on the columns type and and length.
I think that you should benefit with some PS help. You should contact them just go get more advice related to your use case.
Eugenia