Which data is exported with "EXPORT TO PARQUET" ? How to get the real size ?
Hello
when we use the EXPORT command to Parquet, which data is really downloaded ? The one of the super projection ? We try to figure out the size of an archive, and as it is difficult to get the file size by itself, we try to compute it from the ROS size with this kind of query (we work on partitions)
select p.table_schema,
pr.anchor_table_name,
p.partition_key,
ROUND(SUM(ros_size_bytes)/(1024^2)) as space_Mb
from partitions P,
projections PR
where p.table_schema = pr.projection_schema
and p.projection_name = pr.projection_name
group by p.table_schema, pr.anchor_table_name, p.partition_key;
But I am afraid we take too much data here, when a table has several projections ?
How can we estimate the exported size (on top of number of rows) ?
Thanks
Best Answer
-
swalkaus Vertica Employee Employee
If you are doing a simple " EXPORT TO PARQUET(directory = '...') AS SELECT * FROM table;" then you can estimate based on the storage size for a super projection of that table. Keep in mind that YMMV. You may want to create a table with a subset of the data and export that to compare the actual size to your estimate. Even then, your estimate might not be accurate if the sample wasn't uniform but this shouldn't happen frequently. Something like this ought to work (assuming no pre-join projections):
SELECT t.table_schema, t.table_name, SUM(sc.used_bytes)//1024//1024 as size_mb FROM v_catalog.tables t JOIN v_catalog.projections p ON t.table_id = p.anchor_table_id JOIN (SELECT DISTINCT projection_id, storage_oid, used_bytes FROM v_monitor.storage_containers) sc ON p.projection_id = sc.projection_id WHERE p.projection_id IN ( SELECT max(projection_id) FROM v_catalog.projections WHERE is_up_to_date AND is_super_projection AND NOT is_prejoin GROUP BY anchor_table_id) GROUP BY 1,2;
5
Answers
EXPORT TO PARQUET exports the results of the SQL query provided. For example, the following would export all of the rows from the "sales" table.
You can add a predicate to export a subset of the data, for example, only export sales from today.
Of course, the SQL clause can be much more complex. You could join to other tables, perform aggregations, etc. Here's a slightly more complex example:
My question was about the size of data exported. From the DB information (ROS size) can we estimate the file size that will be created ?
May be this is not possible and we can only measure the files size once created....
Having the files created on several nodes add complexity .
You can now monitor the exported information
https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/HadoopIntegrationGuide/NativeFormats/ExportMonitor.htm