Questions about analyze_statistics

We have a database where data is inserted continuously and older data is deleted regularly. I think running ANALYZE_STATISTICS may help with the query performance. A couple of questions on ANALYZE_STATISTICS : 1. Does Vertica update statistics automatically based on the states of the data in the database? If so, how does it do it? 2. If we schedule to update statistics regularly, what would be the best practice to do that, such as how to determine when to run it, how to schedule it and to run it, etc.? Any sample scripts can be used to determine when statistics needs to be updated? Our system is up at all times, and there could be queries running against the database at any given time, ideally, updating the statistics should not affect the system much.

Comments

  • Hi Mason, Have you read the "Collecting Database Statistics" chapter of our Administrator's Guide? https://my.vertica.com/docs/6.1.x/PDF/HP_Vertica_6.1.x_AdminGuide.pdf It provides a much-more-detailed answer to your questions than would fit in a forum post. Could you look it over, and respond or post again if you have any further questions? Adam
  • Hi, Adam, Thank you so much for your reply! I looked at the documentation before posting the question, but it was not clear to me in what cases Vertica updates statistics automatically. For our system, as newest data is inserted and oldest data is deleted, the data in the middle does not change. Would ANALYZE_HISTOGRAM() be the most efficient way to update statistics? If so, what would be the best percentage? Also, our customers do not monitor or administer the databases, they only use our product, which is on Vertica, so we can not look at the database and determine when statistics is stale. We have to somehow figure out a way to update statistics automatically on the customers' systems. Any suggestions or best practices?
  • Hi Mason, Ah, I see what you're curious about. So, "statistics" in Vertica (like in many databases, to varying degrees) are not a black box that you push a button and stuff happens and then they're up-to-date. There are several kinds of statistics that we keep -- the min and max values in each container, for example, or a histogram describing the distribution of values in a column. Most types of Vertica statistics are (in recent Vertica versions) updated automatically. You can control that process if you need to, but we don't recommend it. The histogram data is the exception -- updating it is expensive so we don't update it automatically. The only way to do so is by calling ANALYZE_HISTOGRAM(). In Vertica, arguably the most visible way that histograms are used is in deciding what order to join tables in. So you should run ANALYZE_HISTOGRAM() if/when you see joins start to run way too slowly. You can do a less-expensive ANALYZE_HISTOGRAM() call by tuning its arguments. But the resulting histogram will be less precise. In particular, it will less-accurately represent highly skewed data. That's something where you'll have to experiment with your data and see what parameters work for you. In terms of how often to run automated maintenance, I'll let others here speak to that. I don't have a specific recommendation; it depends on your users' workload. Adam
  • It really helps, thank you for the information! One last question: is less-accurate histogram from running ANALYZE_HISTOGRAM() better than not running it at all?
  • Adam , Any option to run histograms on filter columns only ? this can probebly improve the performence of the collect .
  • Hey all, Mason: Well, it's usually better than not running it :-) The problem with statistics is, well, they're statistical. It's all about probabilities. For example, if you have 1,000 complicated queries, odds are one or three of them both just happened to get the optimal join order without any statistics by sheer random luck, and will actually get slower with statistics -- statistics summarize your data; if you have quirky data, sometimes the summary is misleading. But maybe 990 of your queries will get faster, some of them much faster. By tuning the accuracy, you're tuning these odds. Eli: Yes, you can run analyze_statistics() on specific columns only; I believe you can see the syntax if you look the function up in the documentation? (If not, let us know.) Adam
  • Hi Adam, Is there any parameter that can be set with analyze_statistics so that it runs automatically say on weekly basis? If not how to achieve it? Thanks in advance
  • Hi Abhishek, I don't believe that there's a way to do this built into Vertica. (Though I could be wrong; someone may tell me that we've added one :-) ) Otherwise, one straightforward approach would be to use the Linux 'cron' utility; have the dbadmin user's crontab run "/opt/vertica/bin/vsql -c 'select analyze_statistics();'" weekly and/or on whatever schedule you'd like. Adam
  • Thanks Adam. I have made a script and scheduled through Tidal which runs on daily basis. I have a doubt regarding health report. does Vertica have any health reports that it can send to us daily? In Oracle we have AWR report or can also send DB size, etc. Abhi
  • Hi Abhi, Vertica doesn't work quite the same way as Oracle here: If you want to view the health of the cluster at any time, the Vertica Management Console will give you a nice graphical overview of each machine. Vertica also integrates with SNMP to provide instant notifications through existing tools. Vertica doesn't currently provide a built-in way to send out daily reports like this. However, it does (extensively) track the information that you would need to generate these reports. See the documentation for the Data Collector, or the tables listed if you run "select * from data_collector;". We don't provide a built-in way to send notifications directly from Vertica, but there are open-source add-on packages that add such functionality: <> (Currently there are packages for e-mail messages and SMS text messages. If you are or have a developer, these packages can easily be extended to send reports via any proprietary/internal reporting service you might have.) So you can write a single SQL query that sends whatever sort of report you would like. Then you just have to run that query daily (or with whatever frequency you would like). I would typically recommend that you do so using "cron" since it's readily available (often preinstalled) in most environments, but if you have Tidal, I expect that would work great too. Adam
  • Thanks a ton Adam !!!!!

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file