Which data is exported with "EXPORT TO PARQUET" ? How to get the real size ?

Isabelle_Vela · April 2020

Hello
when we use the EXPORT command to Parquet, which data is really downloaded ? The one of the super projection ? We try to figure out the size of an archive, and as it is difficult to get the file size by itself, we try to compute it from the ROS size with this kind of query (we work on partitions)
select p.table_schema,
pr.anchor_table_name,
p.partition_key,
ROUND(SUM(ros_size_bytes)/(1024^2)) as space_Mb
from partitions P,
projections PR
where p.table_schema = pr.projection_schema
and p.projection_name = pr.projection_name
group by p.table_schema, pr.anchor_table_name, p.partition_key;
But I am afraid we take too much data here, when a table has several projections ?
How can we estimate the exported size (on top of number of rows) ?
Thanks

swalkaus · April 2020

If you are doing a simple " EXPORT TO PARQUET(directory = '...') AS SELECT * FROM table;" then you can estimate based on the storage size for a super projection of that table. Keep in mind that YMMV. You may want to create a table with a subset of the data and export that to compare the actual size to your estimate. Even then, your estimate might not be accurate if the sample wasn't uniform but this shouldn't happen frequently. Something like this ought to work (assuming no pre-join projections):

SELECT t.table_schema, t.table_name, SUM(sc.used_bytes)//1024//1024 as size_mb
   FROM v_catalog.tables t
     JOIN v_catalog.projections p ON t.table_id = p.anchor_table_id
     JOIN (SELECT DISTINCT projection_id, storage_oid, used_bytes FROM v_monitor.storage_containers) sc ON p.projection_id = sc.projection_id
WHERE p.projection_id IN (
          SELECT max(projection_id)
             FROM v_catalog.projections
          WHERE is_up_to_date AND is_super_projection AND NOT is_prejoin
          GROUP BY anchor_table_id)
GROUP BY 1,2;

swalkaus · April 2020

EXPORT TO PARQUET exports the results of the SQL query provided. For example, the following would export all of the rows from the "sales" table.

 EXPORT TO PARQUET(directory = 'hdfs:///sales')  AS SELECT * FROM sales;

You can add a predicate to export a subset of the data, for example, only export sales from today.

 EXPORT TO PARQUET(directory = 'hdfs:///sales/daily')  AS SELECT * FROM sales WHERE sale_date = now()::DATE;

Of course, the SQL clause can be much more complex. You could join to other tables, perform aggregations, etc. Here's a slightly more complex example:

EXPORT TO PARQUET(directory = 'hdfs:///sales/region/summary')  AS
  SELECT c.region, s.sale_date::date, sum(s.price)
     FROM sales s JOIN customers c ON s.customer_id = c.id
  GROUP BY 1, 2;

Isabelle_Vela · April 2020

My question was about the size of data exported. From the DB information (ROS size) can we estimate the file size that will be created ?
May be this is not possible and we can only measure the files size once created....
Having the files created on several nodes add complexity .

DeepakMajeti · April 2020

You can now monitor the exported information
https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/HadoopIntegrationGuide/NativeFormats/ExportMonitor.htm

We're Moving!

Create My New Community Account Now

Which data is exported with "EXPORT TO PARQUET" ? How to get the real size ?

Best Answer

Answers

Leave a Comment