Vertica external table(S3 based) performance issues.

Hi,

We are analyzing performance of vertica external tables, we have used parquet file format with snappy compression and we have around 119353 s3 partition objects having total 23.2 GB+ data, we are facing serious performance issues even when we are passing all the partition columns in where clause.
below is our external table script:

simple select statement with ~20 cols with all the partition cols in where clause takes around 30+ secs of time.

Select record_code,station_code, dim_date_time_id,hh_hut_put_impression,hh_universe_estimate,hh_impression,c25_hut_put_impression,c25_universe_estimate ,c25_impression,c611_hut_put_impression,c611_universe_estimate,c611_impression,m1214_hut_put_impression from common.fact_rating_quarter_hour_20181_qa where station_code=1001 and reporting_year=2018 and reporting_month=12 and market_code=202 and reporting_service=1 and sample_type_int=1 and sub_sample_indicator=2 and rating_stream_int=7 ;

create external table common.fact_rating_quarter_hour_20181_qa
(
record_code varchar(500),
playback_type varchar(500),
metroahousehold int ,
metrobhousehold int ,
hh_share float,
hh_hut_put float,
hh_hut_put_impression float,
hh_rating float,
hh_universe_estimate int ,
hh_impression float,
c25_share float,
c25_hut_put float,
c25_hut_put_impression float,
c25_rating float,
c25_universe_estimate int ,
c25_impression float,
c611_share float,
c611_hut_put float,
c611_hut_put_impression float,
c611_rating float,
c611_universe_estimate int ,
c611_impression float,
m1214_share float,
m1214_hut_put float,
m1214_hut_put_impression float,
m1214_rating float,
m1214_universe_estimate int ,
m1214_impression float,
m1517_share float,
m1517_hut_put float,
m1517_hut_put_impression float,
m1517_rating float,
m1517_universe_estimate int ,
m1517_impression float,
m1820_share float,
m1820_hut_put float,
m1820_hut_put_impression float,
m1820_rating float,
m1820_universe_estimate int ,
m1820_impression float,
m2124_share float,
m2124_hut_put float,
m2124_hut_put_impression float,
m2124_rating float,
m2124_universe_estimate int ,
m2124_impression float,
m2534_share float,
m2534_hut_put float,
m2534_hut_put_impression float,
m2534_rating float,
m2534_universe_estimate int ,
m2534_impression float,
m3549_share float,
m3549_hut_put float,
m3549_hut_put_impression float,
m3549_rating float,
m3549_universe_estimate int ,
m3549_impression float,
m5054_share float,
m5054_hut_put float,
m5054_hut_put_impression float,
m5054_rating float,
m5054_universe_estimate int ,
m5054_impression float,
m5564_share float,
m5564_hut_put float,
m5564_hut_put_impression float,
m5564_rating float,
m5564_universe_estimate int ,
m5564_impression float,
m65plus_share float,
m65plus_hut_put float,
m65plus_hut_put_impression float,
m65plus_rating float,
m65plus_universe_estimate int ,
m65plus_impression float,
f1214_share float,
f1214_hut_put float,
f1214_hut_put_impression float,
f1214_rating float,
f1214_universe_estimate int ,
f1214_impression float,
f1517_share float,
f1517_hut_put float,
f1517_hut_put_impression float,
f1517_rating float,
f1517_universe_estimate int ,
f1517_impression float,
f1820_share float,
f1820_hut_put float,
f1820_hut_put_impression float,
f1820_rating float,
f1820_universe_estimate int ,
f1820_impression float,
f2124_share float,
f2124_hut_put float,
f2124_hut_put_impression float,
f2124_rating float,
f2124_universe_estimate int ,
f2124_impression float,
f2534_share float,
f2534_hut_put float,
f2534_hut_put_impression float,
f2534_rating float,
f2534_universe_estimate int ,
f2534_impression float,
f3549_share float,
f3549_hut_put float,
f3549_hut_put_impression float,
f3549_rating float,
f3549_universe_estimate int ,
f3549_impression float,
f5054_share float,
f5054_hut_put float,
f5054_hut_put_impression float,
f5054_rating float,
f5054_universe_estimate int ,
f5054_impression float,
f5564_share float,
f5564_hut_put float,
f5564_hut_put_impression float,
f5564_rating float,
f5564_universe_estimate int ,
f5564_impression float,
f65plus_share float,
f65plus_hut_put float,
f65plus_hut_put_impression float,
f65plus_rating float,
f65plus_universe_estimate int ,
f65plus_impression float,
ww_share float,
ww_hut_put float,
ww_hut_put_impression float,
ww_rating float,
ww_universe_estimate int ,
ww_impression float,
daily_average_group varchar(500),
daily_group varchar(500),
daily_normal_group varchar(500),
daily_telecast_group varchar(500),
dow int,
excluded varchar(500),
monthly_average_group varchar(500),
monthly_group varchar(500),
monthly_normal_group varchar(500),
monthly_telecast_group varchar(500),
nielsenDow int,
program_code int ,
program_group varchar(500),
program_group_airing_time varchar(500),
program_name varchar(500),
programday_group varchar(500),
swing_break_type varchar(500),
week_number int,
weekly_average_group varchar(500),
weekly_group varchar(500),
weekly_normal_group varchar(500),
weekly_telecast_group varchar(500),
source_creation_epoch int ,
etl_timestamp TIMESTAMPTZ,
sample_type varchar(500),
dim_date_time_id int ,
job_run_id varchar(500),
rating_stream varchar(500),
half_hr_grain varchar(500),
hr_grain varchar(500),
dow_str varchar(500),
day_type varchar(500),
hour_minute int,
station_code int,
reporting_year integer,
reporting_month integer,
market_code integer,
reporting_service integer,
sample_type_int int,
sub_sample_indicator int,
rating_stream_int integer

) AS COPY FROM 's3://bucketname/1/nielsen/fact_rating_quarter_hour/////////*'
parquet(hive_partition_cols='station_code,reporting_year,reporting_month,market_code,reporting_service,sample_type_int,sub_sample_indicator,rating_stream_int');

Comments

  • Bryan_HBryan_H Vertica Employee Administrator

    With that many distinct files across partitions, you could probably speed up metadata processing by collecting statistics on the external table by running
    SELECT ANALYZE_EXTERNAL_ROW_COUNT('common.fact_rating_quarter_hour_20181_qa');
    This is a background task so you might not see an immediate improvement.

  • You mean to say, after executing the analyze command, it will not take effect immediately, even if the command executes successfully? and one more thing is there any way to find out the query execution time breakdown , like how much time it took for metadata processing/ partition pruning or how many rows it brought back to vertica db. Is there any system table that can give us hint about the query execution.

  • Bryan_HBryan_H Vertica Employee Administrator

    Yes, according to the documentation, ANALYZE_EXTERNAL_ROW_COUNT run in background. For timing, EXPLAIN and PROFILE should give more detail on execution plan cost (EXPLAIN) and actual run time and resource usage (PROFILE). EXPLAIN will also show whether statistics are ready. For example, before running ANALYZE, the plan showed (NO STATISTICS), but after running and completing it does not:

    dbadmin=> select analyze_external_row_count('public.d1090f_parquet');

    analyze_external_row_count

                          0
    

    (1 row)

    dbadmin=> explain select * from d1090f_parquet;

    QUERY PLAN


    QUERY PLAN DESCRIPTION:


    explain select * from d1090f_parquet;

    Access Path:
    +-LOAD EXTERNAL TABLE [Cost: 0, Rows: 177] (PATH ID: 1)
    | Table: d1090f_parquet
    | COPY FROM '/data1/dbadmin/parquet/d1090f/*.parquet' PARQUET

    And PROFILE shows exact execution data:

    dbadmin=> profile select * from d1090f_parquet;
    WARNING 8762: SQL TIMESTAMPTZ is more appropriate for PARQUET TIMESTAMP because values are stored in UTC
    NOTICE 4788: Statement is being profiled
    HINT: Select * from v_monitor.execution_engine_profiles where transaction_id=45035996290527406 and statement_id=2;
    NOTICE 3557: Initiator memory for query: [on pool general: 79889 KB, minimum: 79889 KB]
    NOTICE 5077: Total memory required by query: [79889 KB]

  • Ok, thanks for your quick response. I'll execute the analyze_external_row_count stats command and see if it helps in reducing the overall execution time. I'll post my results here. Thanks again :-)

  • mt_25mt_25
    edited July 2019

    Hi,
    I have ran the SELECT ANALYZE_EXTERNAL_ROW_COUNT('common.fact_rating_quarter_hour_20181_qa'); command but unfortunately there is no performance gain.
    Is there anything else I can do to get a better performance?

    Below is my explain plan ->

    QUERY PLAN DESCRIPTION:

    explain Select record_code,station_code, dim_date_time_id,hh_hut_put_impression,hh_universe_estimate,hh_impression,c25_hut_put_impression,c25_universe_estimate ,c25_impression,c611_hut_put_impression,c611_universe_estimate,c611_impression,m1214_hut_put_impression from common.fact_rating_quarter_hour_20181_qa where station_code=1001 and reporting_year=2018 and reporting_month=12 and market_code=202 and reporting_service=1 and sample_type_int=1 and sub_sample_indicator=2 and rating_stream_int=7

    Access Path:
    +-LOAD EXTERNAL TABLE [Cost: 0, Rows: 83M] (PATH ID: 1)
    | Table: fact_rating_quarter_hour_20181_qa
    | COPY FROM 's3://bucketname/1/nielsen/fact_rating_quarter_hour/////////*' parquet(hive_partition_cols='station_code,reporting_year,reporting_month,market_code,reporting_service,sample_type_int,sub_sample_indicator,rating_stream_int')
    | Filter: (fact_rating_quarter_hour_20181_qa.station_code = 1001)
    | Filter: (fact_rating_quarter_hour_20181_qa.reporting_year = 2018)
    | Filter: (fact_rating_quarter_hour_20181_qa.reporting_month = 12)
    | Filter: (fact_rating_quarter_hour_20181_qa.market_code = 202)
    | Filter: (fact_rating_quarter_hour_20181_qa.reporting_service = 1)
    | Filter: (fact_rating_quarter_hour_20181_qa.sample_type_int = 1)
    | Filter: (fact_rating_quarter_hour_20181_qa.sub_sample_indicator = 2)
    | Filter: (fact_rating_quarter_hour_20181_qa.rating_stream_int = 7)
    | Execute on: Query Initiator

    ------------------------------

    PLAN: BASE QUERY PLAN (GraphViz Format)

    digraph G {
    graph [rankdir=BT, label = "BASE QUERY PLAN\nQuery: explain Select record_code,station_code, dim_date_time_id,hh_hut_put_impression,hh_universe_estimate,hh_impression,c25_hut_put_impression,c25_universe_estimate ,c25_impression,c611_hut_put_impression,c611_universe_estimate,c611_impression,m1214_hut_put_impression from common.fact_rating_quarter_hour_20181_qa where station_code=1001 and reporting_year=2018 and reporting_month=12 and market_code=202 and reporting_service=1 and sample_type_int=1 and sub_sample_indicator=2 and rating_stream_int=7\n\nAll Nodes Vector: \n\n node[0]=v_operative_node0004 (initiator) Up\n node[1]=v_operative_node0005 (executor) Up\n node[2]=v_operative_node0006 (executor) Up\n", labelloc=t, labeljust=l ordering=out]
    0[label = "Root \nOutBlk=[UncTuple(13)]", color = "green", shape = "house"];
    1[label = "NewEENode \nOutBlk=[UncTuple(13)]", color = "green", shape = "box"];
    2[label = "LoadUnion\nUnc: Varchar(500)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Float(8)\nUnc: Integer(8)\nUnc: Float(8)\nUnc: Float(8)\nUnc: Integer(8)\nUnc: Float(8)\nUnc: Float(8)\nUnc: Integer(8)\nUnc: Float(8)\nUnc: Float(8)", color = "purple", shape = "box"];
    3[label = "ExprEval: \n fact_rating_quarter_hour_20181_qa.record_code\n fact_rating_quarter_hour_20181_qa.station_code\n fact_rating_quarter_hour_20181_qa.dim_date_time_id\n fact_rating_quarter_hour_20181_qa.hh_hut_put_impression\n fact_rating_quarter_hour_20181_qa.hh_universe_estimate\n fact_rating_quarter_hour_20181_qa.hh_impression\n fact_rating_quarter_hour_20181_qa.c25_hut_put_impression\n fact_rating_quarter_hour_20181_qa.c25_universe_estimate\n fact_rating_quarter_hour_20181_qa.c25_impression\n fact_rating_quarter_hour_20181_qa.c611_hut_put_impression\n fact_rating_quarter_hour_20181_qa.c611_universe_estimate\n fact_rating_quarter_hour_20181_qa.c611_impression\n fact_rating_quarter_hour_20181_qa.m1214_hut_put_impression\nUnc: Varchar(500)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Float(8)\nUnc: Integer(8)\nUnc: Float(8)\nUnc: Float(8)\nUnc: Integer(8)\nUnc: Float(8)\nUnc: Float(8)\nUnc: Integer(8)\nUnc: Float(8)\nUnc: Float(8)", color = "brown", shape = "box"];
    4[label = "FilterStep: \n(fact_rating_quarter_hour_20181_qa.station_code = 1001)\n(fact_rating_quarter_hour_20181_qa.reporting_year = 2018)\n(fact_rating_quarter_hour_20181_qa.reporting_month = 12)\n(fact_rating_quarter_hour_20181_qa.market_code = 202)\n(fact_rating_quarter_hour_20181_qa.reporting_service = 1)\n(fact_rating_quarter_hour_20181_qa.sample_type_int = 1)\n(fact_rating_quarter_hour_20181_qa.sub_sample_indicator = 2)\n(fact_rating_quarter_hour_20181_qa.rating_stream_int = 7)\nUnc: Varchar(500)\nUnc: Float(8)\nUnc: Integer(8)\nUnc: Float(8)\nUnc: Float(8)\nUnc: Integer(8)\nUnc: Float(8)\nUnc: Float(8)\nUnc: Integer(8)\nUnc: Float(8)\nUnc: Float(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)", color = "brown", shape = "box"];
    5[label = "ExprEval: \n record_code(VAL(0))\n hh_hut_put_impression(VAL(1))\n hh_universe_estimate(VAL(2))\n hh_impression(VAL(3))\n c25_hut_put_impression(VAL(4))\n c25_universe_estimate(VAL(5))\n c25_impression(VAL(6))\n c611_hut_put_impression(VAL(7))\n c611_universe_estimate(VAL(8))\n c611_impression(VAL(9))\n m1214_hut_put_impression(VAL(10))\n dim_date_time_id(VAL(11))\n station_code(VAL(12))\n reporting_year(VAL(13))\n reporting_month(VAL(14))\n market_code(VAL(15))\n reporting_service(VAL(16))\n sample_type_int(VAL(17))\n sub_sample_indicator(VAL(18))\n rating_stream_int(VAL(19))\nUnc: Varchar(500)\nUnc: Float(8)\nUnc: Integer(8)\nUnc: Float(8)\nUnc: Float(8)\nUnc: Integer(8)\nUnc: Float(8)\nUnc: Float(8)\nUnc: Integer(8)\nUnc: Float(8)\nUnc: Float(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)", color = "brown", shape = "box"];
    6[label = "LoadStep: fact_rating_quarter_hour_20181_qa\n (359 paths with unmatched Hive partition have been pruned from PARQUET source list; 0% of PARQUET data matched with co-located Vertica nodes; 0% of PARQUET data including co-located data can be loaded with rack locality)\n\nUnc: Varchar(500)\nUnc: Float(8)\nUnc: Integer(8)\nUnc: Float(8)\nUnc: Float(8)\nUnc: Integer(8)\nUnc: Float(8)\nUnc: Float(8)\nUnc: Integer(8)\nUnc: Float(8)\nUnc: Float(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)\nUnc: Integer(8)", color = "brown", shape = "box"];
    1->0 [label = "V[0] C=13",color = "black",style="bold", arrowtail="inv"];
    2->1 [label = "0",color = "blue"];
    3->2 [label = "0",color = "blue"];
    4->3 [label = "0",color = "blue"];
    5->4 [label = "0",color = "blue"];
    6->5 [label = "0",color = "blue"];
    }

  • Bryan_HBryan_H Vertica Employee Administrator

    Yes, try increasing the number of concurrent connections, for example:
    ALTER DATABASE DEFAULT SET AWSConnectionPoolSize = 4096;
    I've found that the default is 1024, which is likely too low in your case where you're trying to access >100K objects. However, you'll still be limited by any network limits on your instance type as well as any cap or throttling at the S3 level.
    S3 is considered a "User Defined File System", so you can find more details on bandwidth, connection issues or throttling etc. in the logging tables: "dc_udfs_events" and "dc_udfs_statistics"
    I'll check with the AWS project manager to learn if there are any additional tuning options.

  • Hi Bryan,

    Still it is same after increasing the awsconnectionpoolsize to 4096.
    In your prior experience with external table (s3 based). With these many different s3 partition objects, what could be the ideal execution time that we can achieve?
    If we add up more nodes will it help? or changing the instance type ?

    Current vertica Instance type: r3.4xlarge
    number of nodes:3

  • Bryan_HBryan_H Vertica Employee Administrator

    Changing node types might help since r3 instances are not guaranteed a bandwidth (merely "medium" or "high" performance) while e.g. r4.8xlarge is guaranteed 10 Gbps. Increasing number of nodes also increases parallelism though I see from your explain plan that it's running on initiator node only.
    Maybe the more pressing question is, since your query filters on every available partition key, how many files and what size is in the folder that is selected? Is it just one file, or several hundred? Are they small files or large files? Whether it's one large file or a hundred small files likely makes a difference. Adding more nodes or threads won't help if there's only one file to read, but increasing network bandwidth might. Also, it might be worth checking whether you are reading from a centralized S3 endpoint, or you have mapped an S3 gateway to the VPC where your cluster resides, which has made a difference for some users running thousands of concurrent S3 API calls.
    What version of Vertica is running in your cluster? Cloud storage has been a major focus of development recently, and newer versions have significant performance improvements. I recommend upgrading during downtime if you choose to try an r4 or r5 instance type for CPU, I/O, network performance boost.

  • how many files and what size is in the folder that is selected?Is it just one file, or several hundred

    only one file being hit by the query which I gave you above, every partition is having one file which is not more than 1MB.

    Are they small files or large files?

    Small files of > 1MB

    whether you are reading from a centralized S3 endpoint, or you have mapped an S3 gateway to the VPC where your cluster resides?

    We are reading from a centralized S3 endpoint but we are surely not running many concurrent S3 API calls. The S3 calls are minimal at this point of time

    vertica version we are using is : Vertica Analytic Database v9.2.1-1

  • edit file size:
    less than 1 MB.

  • Bryan_HBryan_H Vertica Employee Administrator

    Hi, I did some tests on my AWS bucket generating several thousand small files across four partition columns to try to replicate this. A few findings:
    When I set a predicate WHERE clause that matches partition columns, performance improves steadily: ~30 seconds with no partition, ~5-20 seconds setting one or more partitions, and as good as 2 seconds with all partitions specified. However, I generated my Parquet files using Vertica EXPORT TO PARQUET so I would expect Vertica to be able to read them back.
    I verified predicate pushdown by setting

    select set_debug_log('SAL','ALL');

    This sends S3 file access to the vertica.log. I would recommend setting this, then run a query on the external table, then check vertica.log to see whether Vertica is scanning the expected files and partition folders. Our engineers suggested there may be an issue with partition files or keys, or perhaps Vertica is doing something unexpected, in any case they'd like to see the log file.

  • mt_25mt_25
    edited July 2019

    removed log file

  • DM you the execution logs

  • Hi,

    To avoid all partition to be scanned while execting queries, we moved from S3 storage to HDFS.
    Although, we got better performance but encountered a problem here as explained below:

    case1:
    select * from common.fact_rating_quarter_hour_qa_hdfs1 where (station_code,market_code) in ((6130,111),(7445,111));

    Expalin Plan:

    `QUERY PLAN DESCRIPTION:

    explain select * from common.fact_rating_quarter_hour_qa_hdfs1 where (station_code,market_code) in ((6130,111),(7445,111))


    QUERY PLAN DESCRIPTION:

    explain select * from common.fact_rating_quarter_hour_qa_hdfs1 where (station_code,market_code) in ((6130,111),(7445,111))

    Access Path:
    +-LOAD EXTERNAL TABLE [Cost: 0, Rows: 0] (PATH ID: 1)
    | Table: fact_rating_quarter_hour_qa_hdfs1
    | COPY FROM 'hdfs:///user/hadoop/9/nielsen/fact_rating_quarter_hour/////////*' parquet(hive_partition_cols='station_code,reporting_year,reporting_month,market_code,reporting_service,sample_type_int,sub_sample_indicator,rating_stream_int' ,allow_no_match='true')
    | Filter: (fact_rating_quarter_hour_qa_hdfs1.market_code = 111)
    | Filter: ((fact_rating_quarter_hour_qa_hdfs1.station_code = 6130) OR (fact_rating_quarter_hour_qa_hdfs1.station_code = 7445))
    | Execute on: Query Initiator`

    When I am running the above query it is not honoring the partition column and going against all the station_code.

    I have tried this with another simple query as in case 2 just to test partition pruning where it is happens as expected.
    case2:
    select * from common.fact_rating_quarter_hour_qa_hdfs1 where station_code=6130;it Worked fine and hit the correct path

    where as the case 1 query:

    select * from common.fact_rating_quarter_hour_qa_hdfs1 where station_code=6130 OR station_code=7445; leads to all partiton scan.

    Bottom Line is that whenever OR clause is coming into picture, it is dis-honoring the partition cols and going for all partition scan.

    Table Defination:

    CREATE EXTERNAL TABLE common.fact_rating_quarter_hour_qa_hdfs1 ( record_code varchar(500), hour_minute int, station_code int, reporting_year int, reporting_month int, market_code int, reporting_service int, sample_type_int int, sub_sample_indicator int, rating_stream_int int ) AS COPY FROM 'hdfs:///user/hadoop/9/nielsen/fact_rating_quarter_hour/*/*/*/*/*/*/*/*/*' parquet(hive_partition_cols='station_code,reporting_year,reporting_month,market_code,reporting_service,sample_type_int,sub_sample_indicator,rating_stream_int' ,allow_no_match='true');

  • Bryan_HBryan_H Vertica Employee Administrator

    Hi, does the query "select * from common.fact_rating_quarter_hour_qa_hdfs1 where station_code=6130 OR station_code=7445;" actually cause a full scan? The log only shows the IN clause on multiple fields.

    Have you tried alternate syntax for the query e.g. "WHERE market_code = 111 and station_code IN (6130,7445)" since this is how Vertica parses your predicate? It's possible the partition pruning doesn't understand the IN clause, but might understand an expanded boolean.

  • mt_25mt_25
    edited August 2019

    "select * from common.fact_rating_quarter_hour_qa_hdfs1 where station_code=6130 OR station_code=7445;"

    Yes it goes for full partition scan,whenever OR clause is coming into picture, attached log for above query as well.

    WHERE market_code = 111 and station_code IN (6130,7445).

    Yes the above query doesn't go for full scan but this is not a valid scenario for us. Below is the valid scenario for us:

    WHERE (market_code = 112 and station_code = 6192) OR (market_code = 111 and station_code IN (6130,7445));

    As soon as we have OR clause in our query, it goes for full scan again.

  • Just for information to other community members:
    OR clause doesn't participate in partition pruning
    https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Partitions/EliminatingPartitions.htm

    @Bryan_H Is there any workaround for this?

  • Bryan_HBryan_H Vertica Employee Administrator

    Logical OR could be simulated with UNION ALL, e.g.
    SELECT columns FROM t WHERE VALUE IN (a,b);
    becomes:
    SELECT columns FROM t WHERE VALUE = a
    UNION ALL
    SELECT columns FROM t WHERE VALUE = b;
    The UNION ALL might offset any gain from partition pruning, so please test whether this performs well enough for your use case.

  • relireli Vertica Customer

    Also, it might be worth checking whether you are reading from a centralized S3 endpoint, or you have mapped an S3 gateway to the VPC where your cluster resides, which has made a difference for some users running thousands of concurrent S3 API calls

    What is the best setting?
    query run on 1-2 files each one in the size 1.5 MB

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file