Select from table vs. Select from Projection

Shadi_Mari · January 2015

I have a difference in performance when selecting from an anchor table as opposed to selecting directly from a projection although both query plans shows that the right projection is used.

For instance, It takes 8 seconds to execute this query:

select dateid, adid, countryid, deviceid, sum(requests) as requests,
sum(impressions) as impressions,
sum(appsiterevenue) as revenue,
sum(clicks) as clicks,
sum(spend) as spend,
sum(conversions) as conversions
from fact_stat2 where date_trunc('month', dateid) = '2014-10-01' and advaccountid = 62130
group by dateid, adid, countryid, deviceid
order by dateid, adid, countryid, deviceid
limit 10 offset 0;

whereas it takes 1 second to execute the same query from a projection.

select dateid, adid, countryid, deviceid, sum(requests) as requests,
sum(impressions) as impressions,
sum(appsiterevenue) as revenue,
sum(clicks) as clicks,
sum(spend) as spend,
sum(conversions) as conversions
from fact_stat2_adid_dev where date_trunc('month', dateid) = '2014-10-01' and advaccountid = 62130
group by dateid, adid, countryid, deviceid
order by dateid, adid, countryid, deviceid
limit 10 offset 0;

Please advise if there is something missing.

Thanks

Query plan for Stm #1:
--------------------------------------------------------------------------------------------
Access Path: +-SELECT LIMIT 10 [Cost: 345726.000000, Rows: 10.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000] [OutRowSz (B): 80] (PATH ID: 0)
| Output Only: 10 tuples
| Execute on: Query Initiator
| Sort Key: (fact_stat2.DateId, fact_stat2.AdId, fact_stat2.CountryId, fact_stat2.DeviceId)
| LDISTRIB_UNSEGMENTED
| +---> GROUPBY HASH (SORT OUTPUT) (LOCAL RESEGMENT GROUPS) [Cost: 345726.000000, Rows: 2747.000000 Disk(B): 0.000000 CPU(B): 4034295040.000000 Memory(B): 483472.000000 Netwrk(B): 0.000000
Parallelism: 2.000000] [OutRowSz (B): 80] (PATH ID: 2)
| | Aggregates: sum(fact_stat2.Requests), sum(fact_stat2.Impressions), sum(fact_stat2.AppSiteRevenue), sum(fact_stat2.Clicks), sum(fact_stat2.Spend), sum(fact_stat2.Conversions)
| | Group By: fact_stat2.DateId, fact_stat2.AdId, fact_stat2.CountryId, fact_stat2.DeviceId
| | Execute on: All Nodes
| | Sort Key: (fact_stat2.DateId, fact_stat2.AdId, fact_stat2.CountryId, fact_stat2.DeviceId)
| | LDISTRIB_SEGMENTED
| | +---> STORAGE ACCESS for fact_stat2 [Cost: 296467.000000, Rows: 50428688.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 2.000000] [OutRo
wSz (B): 80] (PATH ID: 3)
| | | Column Cost Aspects: [ Disk(B): 1702315015.160839 CPU(B): 1306595.356643 Memory(B): 3630868705.958018 Netwrk(B): 0.000000 Parallelism: 2.000000 ]
| | | Projection: public.fact_stat2_adid_dev
| | | Materialize: fact_stat2.DateId, fact_stat2.AdId, fact_stat2.DeviceId, fact_stat2.CountryId, fact_stat2.Requests, fact_stat2.Impressions, fact_stat2.Clicks, fact_stat2.Conversion
s, fact_stat2.Spend, fact_stat2.AppSiteRevenue
| | | Filter: (fact_stat2.AdvAccountId = 62130)/* sel=0.398601 ndv= 2 */
| | | Filter: (date_trunc('month', fact_stat2.DateId) = '2014-10-01 00:00:00'::timestamp)/* sel=1.000000 ndv= 61 */
| | | Execute on: All Nodes
| | | Sort Key: (fact_stat2.DateId, fact_stat2.AdId, fact_stat2.DeviceId, fact_stat2.OperatorId, fact_stat2.CountryId, fact_stat2.RegionId, fact_stat2.PlatformId, fact_stat2.Requests,
fact_stat2.Impressions, fact_stat2.Clicks, fact_stat2.Conversions, fact_stat2.Spend, fact_stat2.AppSiteRevenue, fact_stat2.AdFalconRevenue)
| | | LDISTRIB_SEGMENTED

--------------------------------------------------------------------------------------------
Query plan for Stm #2:
--------------------------------------------------------------------------------------------
Access Path: +-SELECT LIMIT 10 [Cost: 346769.000000, Rows: 10.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000] [OutRowSz (B): 80] (PATH ID: 0)
| Output Only: 10 tuples
| Execute on: Query Initiator
| Sort Key: (fact_stat2_adid_dev.DateId, fact_stat2_adid_dev.AdId, fact_stat2_adid_dev.CountryId, fact_stat2_adid_dev.DeviceId, fact_stat2_adid_dev.RegionId, fact_stat2_adid_dev.OperatorI
d, fact_stat2_adid_dev.PlatformId, fact_stat2_adid_dev.Requests, fact_stat2_adid_dev.Impressions, fact_stat2_adid_dev.Clicks, fact_stat2_adid_dev.Conversions, fact_stat2_adid_dev.Spend, fac
t_stat2_adid_dev.AppSiteRevenue, fact_stat2_adid_dev.AdFalconRevenue)
| LDISTRIB_UNSEGMENTED
| +---> GROUPBY PIPELINED [Cost: 346769.000000, Rows: 2747.000000 Disk(B): 0.000000 CPU(B): 4034295040.000000 Memory(B): 219760.000000 Netwrk(B): 0.000000 Parallelism: 2.000000] [OutRowSz
(B): 80] (PATH ID: 2)
| | Aggregates: sum(fact_stat2_adid_dev.Requests), sum(fact_stat2_adid_dev.Impressions), sum(fact_stat2_adid_dev.AppSiteRevenue), sum(fact_stat2_adid_dev.Clicks), sum(fact_stat2_adid_
dev.Spend), sum(fact_stat2_adid_dev.Conversions)
| | Group By: fact_stat2_adid_dev.DateId, fact_stat2_adid_dev.AdId, fact_stat2_adid_dev.CountryId, fact_stat2_adid_dev.DeviceId
| | Execute on: All Nodes
| | Sort Key: (fact_stat2_adid_dev.DateId, fact_stat2_adid_dev.AdId, fact_stat2_adid_dev.CountryId, fact_stat2_adid_dev.DeviceId, fact_stat2_adid_dev.RegionId, fact_stat2_adid_dev.Ope
ratorId, fact_stat2_adid_dev.PlatformId, fact_stat2_adid_dev.Requests, fact_stat2_adid_dev.Impressions, fact_stat2_adid_dev.Clicks, fact_stat2_adid_dev.Conversions, fact_stat2_adid_dev.Spen
d, fact_stat2_adid_dev.AppSiteRevenue, fact_stat2_adid_dev.AdFalconRevenue)
| | LDISTRIB_UNSEGMENTED
| | +---> STORAGE ACCESS for fact_stat2_adid_dev [Cost: 297516.000000, Rows: 50428688.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 2.00000
0] [OutRowSz (B): 80] (PATH ID: 3)
| | | Column Cost Aspects: [ Disk(B): 1710909397.034965 CPU(B): 1306595.356643 Memory(B): 3630868705.958018 Netwrk(B): 0.000000 Parallelism: 2.000000 ]
| | | Projection: public.fact_stat2_adid_dev
| | | Materialize: fact_stat2_adid_dev.DateId, fact_stat2_adid_dev.AdId, fact_stat2_adid_dev.CountryId, fact_stat2_adid_dev.DeviceId, fact_stat2_adid_dev.Requests, fact_stat2_adid_dev
.Impressions, fact_stat2_adid_dev.Clicks, fact_stat2_adid_dev.Conversions, fact_stat2_adid_dev.Spend, fact_stat2_adid_dev.AppSiteRevenue
| | | Filter: (fact_stat2_adid_dev.AdvAccountId = 62130)/* sel=0.398601 ndv= 2 */
| | | Filter: (date_trunc('month', fact_stat2_adid_dev.DateId) = '2014-10-01 00:00:00'::timestamp)/* sel=1.000000 ndv= 61 */
| | | Execute on: All Nodes
| | | Sort Key: (fact_stat2_adid_dev.DateId, fact_stat2_adid_dev.AdId, fact_stat2_adid_dev.CountryId, fact_stat2_adid_dev.DeviceId, fact_stat2_adid_dev.RegionId, fact_stat2_adid_dev.O
peratorId, fact_stat2_adid_dev.PlatformId, fact_stat2_adid_dev.Requests, fact_stat2_adid_dev.Impressions, fact_stat2_adid_dev.Clicks, fact_stat2_adid_dev.Conversions, fact_stat2_adid_dev.Sp
end, fact_stat2_adid_dev.AppSiteRevenue, fact_stat2_adid_dev.AdFalconRevenue)
| | | LDISTRIB_SEGMENTED
--------------------------------------------------------------------------------------------

id10t · January 2015

Hi!

Sorting...

I think we already discussed about GBY HASH vs GBY PIPELINED.

Table:

GROUPBY HASH (SORT OUTPUT) (LOCAL RESEGMENT GROUPS) [
    Cost: 345726.000000,
    Rows: 2747.000000
    Disk(B): 0.000000
    CPU(B): 4034295040.000000
    Memory(B): 483472.000000
    Netwrk(B): 0.000000
    Parallelism: 2.000000]
[OutRowSz(B): 80] (PATH ID: 2)

Projection:

GROUPBY PIPELINED [
    Cost: 346769.000000,
    Rows: 2747.000000
    Disk(B): 0.000000
    CPU(B): 4034295040.000000
    Memory(B): 219760.000000
    Netwrk(B): 0.000000
    Parallelism: 2.000000]
[OutRowSz(B): 80] (PATH ID: 2)

Query against table performs sorting. Are you sure queries are identical?

Sort key (PATH ID: 3):

+=================+=================+ |  TABLE          | PROJECTION      | +=================+=================+ |  DateId         | DateId          | |  AdId           | AdId            | |  DeviceId       | CountryId       | <-- not match |  OperatorId     | DeviceId        | <-- not match |  CountryId      | RegionId        | <-- not match |  RegionId       | OperatorId      | <-- not match |  PlatformId     | PlatformId      | |  Requests       | Requests        | |  Impressions    | Impressions     | |  Clicks         | Clicks          | |  Conversions    | Conversions     | |  Spend          | Spend           | |  AppSiteRevenue | AppSiteRevenue  | |  AdFalconRevenue| AdFalconRevenue | +=================+=================+

Can you post table and projection definition?

Shadi_Mari · January 2015

I still do not understand why Vertica would do this. I thought I do not have to select directly from a certain projection; Vertica should choose the right projection and sort order for my case accordingly. I dont want to write code so that a different projection is queried per each parameter set.

Queries are identical as I sent in my question and the only difference is the FROM statement.

Below is the Table Super Projection Definition and Custom Projection Definition.

CREATE PROJECTION fact_stat2_mysuper(
AdvAccountId ENCODING RLE,
PubAccountId ENCODING COMMONDELTA_COMP,
AppSiteId ENCODING BLOCKDICT_COMP,
CampaignId ENCODING RLE,
AdGroupId ENCODING COMMONDELTA_COMP,
AdId ENCODING COMMONDELTA_COMP,
CountryId ENCODING COMMONDELTA_COMP,
RegionId ENCODING COMMONDELTA_COMP,
OperatorId ENCODING COMMONDELTA_COMP,
ManufacturerId ENCODING COMMONDELTA_COMP,
DeviceId ENCODING COMMONDELTA_COMP,
PlatformId ENCODING COMMONDELTA_COMP,
DateId ENCODING RLE,
TimeId ENCODING RLE,
Requests ENCODING DELTARANGE_COMP,
Impressions ENCODING DELTARANGE_COMP,
Clicks ENCODING COMMONDELTA_COMP,
Conversions ENCODING RLE,
AppSiteRevenue ENCODING RLE,
Spend ENCODING COMMONDELTA_COMP,
AdFalconRevenue ENCODING RLE
)
AS
SELECT AdvAccountId,
PubAccountId,
AppSiteId,
CampaignId,
AdGroupId,
AdId,
CountryId,
RegionId,
OperatorId,
ManufacturerId,
DeviceId,
PlatformId,
DateId,
TimeId,
Requests,
Impressions,
Clicks,
Conversions,
AppSiteRevenue,
Spend,
AdFalconRevenue
FROM public.fact_stat2
ORDER BY AdvAccountId,
DateId,
TimeId,
CampaignId,
AdGroupId,
AdId,
DeviceId,
PlatformId,
ManufacturerId,
OperatorId,
RegionId,
CountryId,
Conversions,
Clicks,
AdFalconRevenue,
AppSiteRevenue,
Spend,
CountryId,
AdGroupId,
AppSiteId,
Impressions,
Requests,
PubAccountId,
AppSiteId
SEGMENTED BY MODULARHASH (AdvAccountId) ALL NODES OFFSET 0;

CREATE PROJECTION fact_stat2_adid_dev(
AdvAccountId ENCODING RLE,
DateId ENCODING RLE,
AdId ENCODING RLE,
CountryId ENCODING RLE,
DeviceId ENCODING RLE,
RegionId ENCODING RLE,
OperatorId ENCODING RLE,
PlatformId ENCODING RLE,
Requests ENCODING COMMONDELTA_COMP,
Impressions ENCODING COMMONDELTA_COMP,
Clicks ENCODING BLOCKDICT_COMP,
Conversions ENCODING RLE,
AppSiteRevenue ENCODING RLE,
Spend ENCODING COMMONDELTA_COMP,
AdFalconRevenue ENCODING RLE
)
AS
SELECT AdvAccountId,
DateId,
AdId,
DeviceId,
OperatorId,
CountryId,
RegionId,
PlatformId,
Requests,
Impressions,
Clicks,
Conversions,
AppSiteRevenue,
Spend,
AdFalconRevenue
FROM public.fact_stat2
ORDER BY AdvAccountId,
DateId,
AdId,
DeviceId,
OperatorId,
CountryId,
RegionId,
PlatformId,
Requests,
Impressions,
Clicks,
Conversions,
Spend,
AppSiteRevenue,
AdFalconRevenue
SEGMENTED BY HASH (AdId) ALL NODES OFFSET 0;

id10t · January 2015

Vertica should choose the right projection and sort order for my case accordingly.

1. By what I see now everything is correct, nothing wrong with Vertica.
Vertica is cost-optimized database, it means that Vertica will use in plan that uses in less resources(we are dealing with BigData, right?). That is a reason why plan with GBY HASH is chosen and not against a projection with GBY PIPELINED.

2. Your super and dev projections are different (encoding, columns order), so don't expect same time.

But now I see another bug:

Queries are identical as I sent in my question and the only difference is the FROM statement.

If queries are identical, so why columns order sort in PATH 3 isn't same(see my post above)? Columns order has impact:

ORDER BY col1, col2, col3

is not equals to

ORDER BY col3, col2, col1

If you are EE customer you should open a ticket on this issue.
If you are CE customer - we can only hope for the best.

Shadi_Mari · January 2015

Yes, queries are identical (and thus sort order), and am facing the same issue with other queries. Could it be a bug in Vertica?

To bypass this, Is it a good practice to query from a projection directly?! I dont think so because this will complicate reporting UI logic...

Shadi_Mari · January 2015

This is going to drive me crazy... Is there anything i am missing?!

Note the sort Order is different in each Query Plan...

Query1:
select dateid, adgroupname, regionid, deviceid, sum(requests) as requests,
sum(impressions) as impressions,
sum(appsiterevenue) as revenue,
sum(clicks) as clicks,
sum(spend) as spend,
sum(conversions) as conversions
from fact_stat2_adv_adgroupid where date_trunc('month', dateid) = '2014-10-01' and advaccountid = 62130
group by dateid, adgroupname, regionid, deviceid
order by dateid, adgroupname, regionid, deviceid
limit 10 offset 0;

Query1: Execution Plan:
----------------------------------------------------------------------------------------------------
Access Path: +-SELECT LIMIT 10 [Cost: 286105.000000, Rows: 10.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000] [OutRowSz (B): 80] (PATH ID: 0)
| Output Only: 10 tuples
| Execute on: Query Initiator
| Sort Key: (fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.AdGroupName, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.DeviceId)
| LDISTRIB_UNSEGMENTED
| +---> GROUPBY HASH (SORT OUTPUT) (LOCAL RESEGMENT GROUPS) [Cost: 286105.000000, Rows: 489.000000 Disk(B): 0.000000 CPU(B): 4373392480.000000 Memory(B): 86064.000000 Netwrk(B): 0.000000 P
arallelism: 3.000000] [OutRowSz (B): 80] (PATH ID: 2)
| | Aggregates: sum(fact_stat2_adv_adgroupid.Requests), sum(fact_stat2_adv_adgroupid.Impressions), sum(fact_stat2_adv_adgroupid.AppSiteRevenue), sum(fact_stat2_adv_adgroupid.Clicks),
sum(fact_stat2_adv_adgroupid.Spend), sum(fact_stat2_adv_adgroupid.Conversions)
| | Group By: fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.AdGroupName, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.DeviceId
| | Execute on: All Nodes
| | Sort Key: (fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.AdGroupName, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.DeviceId)
| | LDISTRIB_SEGMENTED
| | +---> STORAGE ACCESS for fact_stat2_adv_adgroupid [Cost: 250512.000000, Rows: 54667406.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 3.
000000] [OutRowSz (B): 80] (PATH ID: 3)
| | | Column Cost Aspects: [ Disk(B): 2290805877.251908 CPU(B): 2129669.862595 Memory(B): 3936057440.366373 Netwrk(B): 0.000000 Parallelism: 3.000000 ]
| | | Projection: public.fact_stat2_adv_adgroupid
| | | Materialize: fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.AdGroupName, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.DeviceId, fact_stat2_adv_adgroupi
d.Requests, fact_stat2_adv_adgroupid.Impressions, fact_stat2_adv_adgroupid.Clicks, fact_stat2_adv_adgroupid.Conversions, fact_stat2_adv_adgroupid.Spend, fact_stat2_adv_adgroupid.AppSiteReve
nue
| | | Filter: (fact_stat2_adv_adgroupid.AdvAccountId = 62130)/* sel=0.526718 ndv= 1 */
| | | Filter: (date_trunc('month', fact_stat2_adv_adgroupid.DateId) = '2014-10-01 00:00:00'::timestamp)/* sel=1.000000 ndv= 61 */
| | | Execute on: All Nodes
| | | Sort Key: (fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.TimeId, fact_stat2_adv_adgroupid.AdGroupName, fact_stat2_adv_adgroupid.AdGroupId, fact_stat2_adv_adgroupid.C
ountryId, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.OperatorId, fact_stat2_adv_adgroupid.ManufacturerId, fact_stat2_adv_adgroupid.DeviceId, fact_stat2_adv_adgroupid.Platfo
rmId, fact_stat2_adv_adgroupid.Requests, fact_stat2_adv_adgroupid.Impressions, fact_stat2_adv_adgroupid.Clicks, fact_stat2_adv_adgroupid.Conversions, fact_stat2_adv_adgroupid.Spend, fact_st
at2_adv_adgroupid.AppSiteRevenue, fact_stat2_adv_adgroupid.AdFalconRevenue)
| | | LDISTRIB_SEGMENTED

-----------------------------------------------------------------------------------------------------
Query2 (anchor table):
select dateid, f.name, regionid, deviceid, sum(requests) as requests,
sum(impressions) as impressions,
sum(appsiterevenue) as revenue,
sum(clicks) as clicks,
sum(spend) as spend,
sum(conversions) as conversions
from fact_stat2_adv a inner join id_name_adgroup f on a.adgroupid = f.id where date_trunc('month', dateid) = '2014-10-01' and advaccountid = 62130
group by dateid, f.name, regionid, deviceid
order by dateid, f.name, regionid, deviceid
limit 10 offset 0;

Query2 Execution Plan:
---------------------------------------------------------------------------------------------------------
Access Path: +-SELECT LIMIT 10 [Cost: 4689075.000000, Rows: 10.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000] [OutRowSz (B): 335] (PATH ID: 0)
| Output Only: 10 tuples
| Execute on: Query Initiator
| Sort Key: (fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.AdGroupId, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.DeviceId)
| LDISTRIB_UNSEGMENTED
| +---> GROUPBY HASH (SORT OUTPUT) (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 4689075.000000, Rows: 333.000000 Disk(B): 0.000000 CPU(B): 18313581010.000000 Memory(B): 228438
.000000 Netwrk(B): 18313692565.000000 Parallelism: 3.000000] [OutRowSz (B): 335] (PATH ID: 2)
| | Aggregates: sum(fact_stat2_adv_adgroupid.Requests), sum(fact_stat2_adv_adgroupid.Impressions), sum(fact_stat2_adv_adgroupid.AppSiteRevenue), sum(fact_stat2_adv_adgroupid.Clicks),
sum(fact_stat2_adv_adgroupid.Spend), sum(fact_stat2_adv_adgroupid.Conversions)
| | Group By: fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.AdGroupId, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.DeviceId
| | Execute on: All Nodes
| | Sort Key: (fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.AdGroupId, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.DeviceId)
| | LDISTRIB_SEGMENTED
| | +---> STORAGE ACCESS for <No Alias> [Cost: 470324.000000, Rows: 54667406.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 3.000000] [OutRo
wSz (B): 335] (PATH ID: 3)
| | | Column Cost Aspects: [ Disk(B): 2291289142.717557 CPU(B): 2129669.862595 Memory(B): 17438906556.442711 Netwrk(B): 0.000000 Parallelism: 3.000000 ]
| | | Projection: public.fact_stat2_adv_adgroupid
| | | Materialize: fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.AdGroupId, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.DeviceId, fact_stat2_adv_adgroupid.
Requests, fact_stat2_adv_adgroupid.Impressions, fact_stat2_adv_adgroupid.Clicks, fact_stat2_adv_adgroupid.Conversions, fact_stat2_adv_adgroupid.Spend, fact_stat2_adv_adgroupid.AppSiteRevenu
e
| | | Filter: (fact_stat2_adv_adgroupid.AdvAccountId = 62130)/* sel=0.526718 ndv= 1 */
| | | Filter: (date_trunc('month', fact_stat2_adv_adgroupid.DateId) = '2014-10-01 00:00:00'::timestamp)/* sel=1.000000 ndv= 61 */
| | | Execute on: All Nodes
| | | Sort Key: (fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.TimeId, fact_stat2_adv_adgroupid.AdGroupName, fact_stat2_adv_adgroupid.AdGroupId, fact_stat2_adv_adgroupid.C
ountryId, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.OperatorId, fact_stat2_adv_adgroupid.ManufacturerId, fact_stat2_adv_adgroupid.DeviceId, fact_stat2_adv_adgroupid.Platfo
rmId, fact_stat2_adv_adgroupid.Requests, fact_stat2_adv_adgroupid.Impressions, fact_stat2_adv_adgroupid.Clicks, fact_stat2_adv_adgroupid.Conversions, fact_stat2_adv_adgroupid.Spend, fact_st
at2_adv_adgroupid.AppSiteRevenue, fact_stat2_adv_adgroupid.AdFalconRevenue)
| | | LDISTRIB_SEGMENTED

----------------------------------------------------------------------------------------------------------

Projection Definition:

CREATE PROJECTION fact_stat2_adv_adgroupid(
AdvAccountId ENCODING RLE,
DateId ENCODING RLE,
TimeId ENCODING RLE,
AdGroupName ENCODING RLE,
AdGroupId ENCODING RLE,
CountryId ENCODING RLE,
RegionId ENCODING RLE,
OperatorId ENCODING RLE,
ManufacturerId ENCODING COMMONDELTA_COMP,
DeviceId ENCODING COMMONDELTA_COMP,
PlatformId ENCODING COMMONDELTA_COMP,
Requests ENCODING COMMONDELTA_COMP,
Impressions ENCODING COMMONDELTA_COMP,
Clicks ENCODING BLOCKDICT_COMP,
Conversions ENCODING RLE,
AppSiteRevenue ENCODING RLE,
Spend ENCODING COMMONDELTA_COMP,
AdFalconRevenue ENCODING RLE
)
AS
SELECT AdvAccountId,  DateId, TimeId, AdGroupId, f.Name as AdGroupName,
CountryId,   RegionId,   OperatorId,   ManufacturerId,   DeviceId,
PlatformId,   Requests,   Impressions,   Clicks,   Conversions,
AppSiteRevenue,   Spend,   AdFalconRevenue
FROM
public.fact_stat2_adv a INNER JOIN public.id_name_adgroup f ON a.AdGroupId = f.Id

ORDER BY AdvAccountId, DateId, TimeId, AdGroupId,
AdGroupName, CountryId, RegionId, OperatorId, ManufacturerId,
DeviceId, PlatformId, Requests, Impressions, Clicks,
Conversions, Spend, AppSiteRevenue, AdFalconRevenue
SEGMENTED BY HASH (AdGroupId) ALL NODES OFFSET 0;

id10t · January 2015

Hi!

Could it be a bug in Vertica?

Im waiting that someone another will take a look. Is it possible that we are missing something.

Is it a good practice to query from a projection directly?!

I don't see any issue/problem with it, but in general no, its a bad practice - by querying projection directly you actually and so you limits optimizer in alternatives.
Optimizer should choose a best plan by testing a different projections, but you do not gives him a choice.

I thought I do not have to select directly from a certain projection; Vertica should choose the right projection and sort order for my case accordingly. I dont want to write code so that a different projection is queried per each parameter set.

Agree.

This is going to drive me crazy... Is there anything i am missing?!

Me too and yes is it possible that both are missing something. As I wrote - will hope for a best.

*************************

I will hope that some employee will take a look and will find our mistake, because if its a bug... its not good.

*************************

Please, lets investigate an impact of this issue, its very important:

Can you check - does output of both queries are identical? Run both queries as sub-queries and in outer query sort by all columns while columns order are same and preserved. Check:

output is identical (num of rows, all rows are identical in both sets)
post explain verbose

Thank you very much.

id10t · January 2015

Hi!

Could it be a bug in Vertica?

Im waiting that someone another will take a look. Is it possible that we are missing something.

Is it a good practice to query from a projection directly?!

I don't see any issue/problem with it, but in general no, its a bad practice - by querying projection directly you actually and so you limits optimizer in alternatives.
Optimizer should choose a best plan by testing a different projections, but you do not gives him a choice.

I thought I do not have to select directly from a certain projection; Vertica should choose the right projection and sort order for my case accordingly. I dont want to write code so that a different projection is queried per each parameter set.

Agree.

This is going to drive me crazy... Is there anything i am missing?!

Me too and yes is it possible that both are missing something. As I wrote - will hope for a best.

*************************

I will hope that some employee will take a look and will find our mistake, because if its a bug... its not good.

*************************

Please, lets investigate an impact of this issue, its very important:

Can you check - does output of both queries are identical? Run both queries as sub-queries and in outer query sort by all columns while columns order are same and preserved. Check:

output is identical (num of rows, all rows are identical in both sets)
post explain verbose

Thank you very much.

id10t · January 2015

UPDATE
or you can check output by creating (temporary)tables from result and:

select * from output1 minus select * from output2

any output will indicate on difference.

Example:

daniel=> (select * from tbl limit 5) minus (select * from tbl limit 4);
 a | b |             c             
---+---+---------------------------
 e | 0 | 2015-01-06 12:47:28.83284
(1 row)

Shadi_Mari · January 2015

I executed the following query:

(select dateid, adid, regionid, deviceid, sum(requests) as requests,
sum(impressions) as impressions,
sum(appsiterevenue) as revenue,
sum(clicks) as clicks,
sum(spend) as spend,
sum(conversions) as conversions
from fact_stat where date_trunc('month', dateid) = '2014-10-01' and advaccountid = 62130
group by dateid, adid, regionid, deviceid
order by dateid, adid, regionid, deviceid
limit 5 offset 0)
minus
(select dateid, adid, regionid, deviceid,
sum(requests) as requests,
sum(impressions) as impressions,
sum(appsiterevenue) as revenue,
sum(clicks) as clicks,
sum(spend) as spend,
sum(conversions) as conversions
from fact_stat_adid_b0 where date_trunc('month', dateid) = '2014-10-01' and advaccountid = 62130
group by dateid, adid, regionid, deviceid
order by dateid, adid, regionid, deviceid
limit 4 offset 0);

Results are:

dateid | adid | regionid | deviceid | requests | impressions | revenue | clicks | spend | conversions------------+-------+----------+----------+----------+-------------+-----------+--------+-----------+-------------
2014-10-01 | 61118 | | 20202 | 2 | 2 | 0.0000000 | 0 | 0.0000000 | 0
2014-10-01 | 61118 | | 21008 | 1 | 1 | 0.0000000 | 0 | 0.0000000 | 0
2014-10-01 | 61118 | | 20200 | 1 | 1 | 0.0000000 | 0 | 0.0000000 | 0
2014-10-01 | 61118 | | 21016 | 2 | 2 | 0.0000000 | 0 | 0.0000000 | 0
2014-10-01 | 61118 | | | 81 | 55 | 0.0000000 | 0 | 0.0000000 | 0
(5 rows)

This indicates that results are not identical....

Trying out the same query but with the second buddy projection; results in different order:

(select dateid, adid, regionid, deviceid, sum(requests) as requests,
sum(impressions) as impressions,
sum(appsiterevenue) as revenue,
sum(clicks) as clicks,
sum(spend) as spend,
sum(conversions) as conversions
from fact_stat where date_trunc('month', dateid) = '2014-10-01' and advaccountid = 62130
group by dateid, adid, regionid, deviceid
order by dateid, adid, regionid, deviceid
limit 5 offset 0)
minus
(select dateid, adid, regionid, deviceid,
sum(requests) as requests,
sum(impressions) as impressions,
sum(appsiterevenue) as revenue,
sum(clicks) as clicks,
sum(spend) as spend,
sum(conversions) as conversions
from fact_stat_adid_b1 where date_trunc('month', dateid) = '2014-10-01' and advaccountid = 62130
group by dateid, adid, regionid, deviceid
order by dateid, adid, regionid, deviceid
limit 4 offset 0);

dateid | adid | regionid | deviceid | requests | impressions | revenue | clicks | spend | conversions------------+-------+----------+----------+----------+-------------+-----------+--------+-----------+-------------
2014-10-01 | 61118 | | 21016 | 2 | 2 | 0.0000000 | 0 | 0.0000000 | 0
2014-10-01 | 61118 | | | 81 | 55 | 0.0000000 | 0 | 0.0000000 | 0
2014-10-01 | 61118 | | 20202 | 2 | 2 | 0.0000000 | 0 | 0.0000000 | 0
2014-10-01 | 61118 | | 21008 | 1 | 1 | 0.0000000 | 0 | 0.0000000 | 0
2014-10-01 | 61118 | | 20200 | 1 | 1 | 0.0000000 | 0 | 0.0000000 | 0
(5 rows)

However, B0 - B1 gives:

(select dateid, adid, regionid, deviceid, sum(requests) as requests,
sum(impressions) as impressions,
sum(appsiterevenue) as revenue,
sum(clicks) as clicks,
sum(spend) as spend,
sum(conversions) as conversions
from fact_stat_adid_b0 where date_trunc('month', dateid) = '2014-10-01' and advaccountid = 62130
group by dateid, adid, regionid, deviceid
order by dateid, adid, regionid, deviceid
limit 5 offset 0)
minus
(select dateid, adid, regionid, deviceid,
sum(requests) as requests,
sum(impressions) as impressions,
sum(appsiterevenue) as revenue,
sum(clicks) as clicks,
sum(spend) as spend,
sum(conversions) as conversions
from fact_stat_adid_b1 where date_trunc('month', dateid) = '2014-10-01' and advaccountid = 62130
group by dateid, adid, regionid, deviceid
order by dateid, adid, regionid, deviceid
limit 4 offset 0);

dateid | adid | regionid | deviceid | requests | impressions | revenue | clicks | spend | conversions------------+-------+----------+----------+----------+-------------+-----------+--------+-----------+-------------
2014-10-01 | 61118 | 2 | | 63 | 42 | 0.0000000 | 0 | 0.0000000 | 0
(1 row)

explain verbose select dateid, adid, regionid, deviceid, sum(requests) as requests,
sum(impressions) as impressions,
sum(appsiterevenue) as revenue,
sum(clicks) as clicks,
sum(spend) as spend,
sum(conversions) as conversions
from fact_stat_adid_b0 where date_trunc('month', dateid) = '2014-10-01' and advaccountid = 62130
group by dateid, adid, regionid, deviceid
order by dateid, adid, regionid, deviceid
limit 5 offset 0;

Access Path: +-SELECT LIMIT 5 [Cost: 239245.000000, Rows: 5.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000] [OutRowSz (B): 80] (PATH ID: 0)
| Output Only: 5 tuples
| Execute on: Query Initiator
| Sort Key: (fact_stat_adid_b0.DateId, fact_stat_adid_b0.AdId, fact_stat_adid_b0.RegionId, fact_stat_adid_b0.DeviceId)
| LDISTRIB_UNSEGMENTED
| +---> GROUPBY HASH (SORT OUTPUT) (LOCAL RESEGMENT GROUPS) [Cost: 239245.000000, Rows: 2721.000000 Disk(B): 0.000000 CPU(B): 4034295040.000000 Memory(B): 478896.000000 Netwrk(B): 0.000000
Parallelism: 3.000000] [OutRowSz (B): 80] (PATH ID: 2)
| | Aggregates: sum(fact_stat_adid_b0.Requests), sum(fact_stat_adid_b0.Impressions), sum(fact_stat_adid_b0.AppSiteRevenue), sum(fact_stat_adid_b0.Clicks), sum(fact_stat_adid_b0.Spend)
, sum(fact_stat_adid_b0.Conversions)
| | Group By: fact_stat_adid_b0.DateId, fact_stat_adid_b0.AdId, fact_stat_adid_b0.RegionId, fact_stat_adid_b0.DeviceId
| | Execute on: All Nodes
| | Sort Key: (fact_stat_adid_b0.DateId, fact_stat_adid_b0.AdId, fact_stat_adid_b0.RegionId, fact_stat_adid_b0.DeviceId)
| | LDISTRIB_SEGMENTED
| | +---> STORAGE ACCESS for fact_stat_adid_b0 [Cost: 206406.000000, Rows: 50428688.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 3.000000]
[OutRowSz (B): 80] (PATH ID: 3)
| | | Column Cost Aspects: [ Disk(B): 1809927876.923077 CPU(B): 1816217.958042 Memory(B): 3630868715.524451 Netwrk(B): 0.000000 Parallelism: 3.000000 ]
| | | Projection: public.fact_stat_adid_b0
| | | Materialize: fact_stat_adid_b0.DateId, fact_stat_adid_b0.AdId, fact_stat_adid_b0.RegionId, fact_stat_adid_b0.DeviceId, fact_stat_adid_b0.Requests, fact_stat_adid_b0.Impressions,
fact_stat_adid_b0.Clicks, fact_stat_adid_b0.Conversions, fact_stat_adid_b0.AppSiteRevenue, fact_stat_adid_b0.Spend
| | | Filter: (fact_stat_adid_b0.AdvAccountId = 62130)/* sel=0.398601 ndv= 1 */
| | | Filter: (date_trunc('month', fact_stat_adid_b0.DateId) = '2014-10-01 00:00:00'::timestamp)/* sel=1.000000 ndv= 61 */
| | | Execute on: All Nodes
| | | Sort Key: (fact_stat_adid_b0.DateId, fact_stat_adid_b0.AdId, fact_stat_adid_b0.CountryId, fact_stat_adid_b0.RegionId, fact_stat_adid_b0.OperatorId, fact_stat_adid_b0.Manufacture
rId, fact_stat_adid_b0.DeviceId, fact_stat_adid_b0.PlatformId, fact_stat_adid_b0.Requests, fact_stat_adid_b0.Impressions, fact_stat_adid_b0.Clicks, fact_stat_adid_b0.Conversions, fact_stat_
adid_b0.AppSiteRevenue, fact_stat_adid_b0.Spend, fact_stat_adid_b0.AdFalconRevenue)
| | | LDISTRIB_SEGMENTED

explain verbose select dateid, adid, regionid, deviceid, sum(requests) as requests,
sum(impressions) as impressions,
sum(appsiterevenue) as revenue,
sum(clicks) as clicks,
sum(spend) as spend,
sum(conversions) as conversions
from fact_stat where date_trunc('month', dateid) = '2014-10-01' and advaccountid = 62130
group by dateid, adid, regionid, deviceid
order by dateid, adid, regionid, deviceid
limit 5 offset 0;

Access Path: +-SELECT LIMIT 5 [Cost: 239445.000000, Rows: 5.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000] [OutRowSz (B): 80] (PATH ID: 0)
| Output Only: 5 tuples
| Execute on: Query Initiator
| Sort Key: (fact_stat.DateId, fact_stat.AdId, fact_stat.RegionId, fact_stat.DeviceId)
| LDISTRIB_UNSEGMENTED
| +---> GROUPBY HASH (SORT OUTPUT) (LOCAL RESEGMENT GROUPS) [Cost: 239445.000000, Rows: 2721.000000 Disk(B): 0.000000 CPU(B): 4034295040.000000 Memory(B): 478896.000000 Netwrk(B): 0.000000
Parallelism: 3.000000] [OutRowSz (B): 80] (PATH ID: 2)
| | Aggregates: sum(fact_stat.Requests), sum(fact_stat.Impressions), sum(fact_stat.AppSiteRevenue), sum(fact_stat.Clicks), sum(fact_stat.Spend), sum(fact_stat.Conversions)
| | Group By: fact_stat.DateId, fact_stat.AdId, fact_stat.RegionId, fact_stat.DeviceId
| | Execute on: All Nodes
| | Sort Key: (fact_stat.DateId, fact_stat.AdId, fact_stat.RegionId, fact_stat.DeviceId)
| | LDISTRIB_SEGMENTED
| | +---> STORAGE ACCESS for fact_stat [Cost: 206606.000000, Rows: 50428688.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 3.000000] [OutRow
Sz (B): 80] (PATH ID: 3)
| | | Column Cost Aspects: [ Disk(B): 1812383414.601399 CPU(B): 1816217.958042 Memory(B): 3630868715.524451 Netwrk(B): 0.000000 Parallelism: 3.000000 ]
| | | Projection: public.fact_stat_adid_b0
| | | Materialize: fact_stat.DateId, fact_stat.AdId, fact_stat.DeviceId, fact_stat.RegionId, fact_stat.Requests, fact_stat.Impressions, fact_stat.Clicks, fact_stat.Conversions, fact_s
tat.AppSiteRevenue, fact_stat.Spend
| | | Filter: (fact_stat.AdvAccountId = 62130)/* sel=0.398601 ndv= 1 */
| | | Filter: (date_trunc('month', fact_stat.DateId) = '2014-10-01 00:00:00'::timestamp)/* sel=1.000000 ndv= 61 */
| | | Execute on: All Nodes
| | | Sort Key: (fact_stat.DateId, fact_stat.AdId, fact_stat.DeviceId, fact_stat.ManufacturerId, fact_stat.CountryId, fact_stat.OperatorId, fact_stat.RegionId, fact_stat.PlatformId, f
act_stat.Requests, fact_stat.Impressions, fact_stat.Clicks, fact_stat.Conversions, fact_stat.AppSiteRevenue, fact_stat.Spend, fact_stat.AdFalconRevenue)
| | | LDISTRIB_SEGMENTED

id10t · January 2015

Hi!

Thank you a lot!!!

This indicates that results are not identical....

Not good, so its a bug, I expected that results will be identically.

And now comes a most important questions:

How we can be sure that Vertica's output is correct?
How we can prove that Vertica calculus/analysis is correct?

If you are CE customer, so I will recommend to use in other database, because we have no response from Vertica Team and this bug is too much critical.
Of cause you can wait for fix/patch(but take in mind that Vertica Team ignores this question) but I don't think that you or your customers will be happy with incorrect results/reports/graphs.

Regards.

Shadi_Mari · January 2015

Thank you so much for your follow up. As a matter of fact, querying the projection gives incorrect results whereas querying the anchor table which uses the projection in question gives the correct results... It is unfortunate that Vertica team is not paying attention to this issue, it seems support is their last worries... Regards

id10t · January 2015

Hi!

whereas querying the anchor table which uses the projection in question gives the correct results...

By querying anchor table you do not get all LAP benefits

And you have to point to all your users to use in anchor table explicitly.

The problem here is - if by mistake I will query not an anchor table:
I can't determine if output is correct or wrong, no errors, everything will look correct.
And mistakes occurs...

PS
BTW: is it acceptable by you? And think twice.

Take in mind its an Enterprise application with big vendor like HP.
- You always should remember what you can query and what you can not.
- As I understand project still in DEV stage and you already fall in bug, what next?
- You pointed by yourself: it seems support is their last worries...
and your issue demonstrates how Support help is critical and important.

Regards

Shadi_Mari · January 2015

No it is not acceptable and i still do not understand why this issue appears in some queries and dont appear in the others. The projection does not sort according to the specified dimensions, i asked to sort by country and instead i get data sorted by device, most likely its a bug.

I fully agree with you that support is crucial, this is why Ent license is helpful, however i thought this migh affect all customers if it was confirmed as a bug by Vertica team.

Cheers

id10t · January 2015

Hi!

i thought this migh affect all customers if it was confirmed as a bug by Vertica team.

This is my point, there are no difference between CE and EE instances.
Bug reported by CE customer will appear to EE customer.

Will hope on our mistake, because Vertica is wonderful product.
IMO: its HP policy spoils everything.

We're Moving!

Create My New Community Account Now

Select from table vs. Select from Projection

Comments

Leave a Comment