Select from table vs. Select from Projection
I have a difference in performance when selecting from an anchor table as opposed to selecting directly from a projection although both query plans shows that the right projection is used.
For instance, It takes 8 seconds to execute this query:
select dateid, adid, countryid, deviceid, sum(requests) as requests,
sum(impressions) as impressions,
sum(appsiterevenue) as revenue,
sum(clicks) as clicks,
sum(spend) as spend,
sum(conversions) as conversions
from fact_stat2 where date_trunc('month', dateid) = '2014-10-01' and advaccountid = 62130
group by dateid, adid, countryid, deviceid
order by dateid, adid, countryid, deviceid
limit 10 offset 0;
whereas it takes 1 second to execute the same query from a projection.
select dateid, adid, countryid, deviceid, sum(requests) as requests,
sum(impressions) as impressions,
sum(appsiterevenue) as revenue,
sum(clicks) as clicks,
sum(spend) as spend,
sum(conversions) as conversions
from fact_stat2_adid_dev where date_trunc('month', dateid) = '2014-10-01' and advaccountid = 62130
group by dateid, adid, countryid, deviceid
order by dateid, adid, countryid, deviceid
limit 10 offset 0;
Please advise if there is something missing.
Thanks
Query plan for Stm #1:
--------------------------------------------------------------------------------------------
Access Path: +-SELECT LIMIT 10 [Cost: 345726.000000, Rows: 10.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000] [OutRowSz (B): 80] (PATH ID: 0)
| Output Only: 10 tuples
| Execute on: Query Initiator
| Sort Key: (fact_stat2.DateId, fact_stat2.AdId, fact_stat2.CountryId, fact_stat2.DeviceId)
| LDISTRIB_UNSEGMENTED
| +---> GROUPBY HASH (SORT OUTPUT) (LOCAL RESEGMENT GROUPS) [Cost: 345726.000000, Rows: 2747.000000 Disk(B): 0.000000 CPU(B): 4034295040.000000 Memory(B): 483472.000000 Netwrk(B): 0.000000
Parallelism: 2.000000] [OutRowSz (B): 80] (PATH ID: 2)
| | Aggregates: sum(fact_stat2.Requests), sum(fact_stat2.Impressions), sum(fact_stat2.AppSiteRevenue), sum(fact_stat2.Clicks), sum(fact_stat2.Spend), sum(fact_stat2.Conversions)
| | Group By: fact_stat2.DateId, fact_stat2.AdId, fact_stat2.CountryId, fact_stat2.DeviceId
| | Execute on: All Nodes
| | Sort Key: (fact_stat2.DateId, fact_stat2.AdId, fact_stat2.CountryId, fact_stat2.DeviceId)
| | LDISTRIB_SEGMENTED
| | +---> STORAGE ACCESS for fact_stat2 [Cost: 296467.000000, Rows: 50428688.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 2.000000] [OutRo
wSz (B): 80] (PATH ID: 3)
| | | Column Cost Aspects: [ Disk(B): 1702315015.160839 CPU(B): 1306595.356643 Memory(B): 3630868705.958018 Netwrk(B): 0.000000 Parallelism: 2.000000 ]
| | | Projection: public.fact_stat2_adid_dev
| | | Materialize: fact_stat2.DateId, fact_stat2.AdId, fact_stat2.DeviceId, fact_stat2.CountryId, fact_stat2.Requests, fact_stat2.Impressions, fact_stat2.Clicks, fact_stat2.Conversion
s, fact_stat2.Spend, fact_stat2.AppSiteRevenue
| | | Filter: (fact_stat2.AdvAccountId = 62130)/* sel=0.398601 ndv= 2 */
| | | Filter: (date_trunc('month', fact_stat2.DateId) = '2014-10-01 00:00:00'::timestamp)/* sel=1.000000 ndv= 61 */
| | | Execute on: All Nodes
| | | Sort Key: (fact_stat2.DateId, fact_stat2.AdId, fact_stat2.DeviceId, fact_stat2.OperatorId, fact_stat2.CountryId, fact_stat2.RegionId, fact_stat2.PlatformId, fact_stat2.Requests,
fact_stat2.Impressions, fact_stat2.Clicks, fact_stat2.Conversions, fact_stat2.Spend, fact_stat2.AppSiteRevenue, fact_stat2.AdFalconRevenue)
| | | LDISTRIB_SEGMENTED
--------------------------------------------------------------------------------------------
Query plan for Stm #2:
--------------------------------------------------------------------------------------------
Access Path: +-SELECT LIMIT 10 [Cost: 346769.000000, Rows: 10.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000] [OutRowSz (B): 80] (PATH ID: 0)
| Output Only: 10 tuples
| Execute on: Query Initiator
| Sort Key: (fact_stat2_adid_dev.DateId, fact_stat2_adid_dev.AdId, fact_stat2_adid_dev.CountryId, fact_stat2_adid_dev.DeviceId, fact_stat2_adid_dev.RegionId, fact_stat2_adid_dev.OperatorI
d, fact_stat2_adid_dev.PlatformId, fact_stat2_adid_dev.Requests, fact_stat2_adid_dev.Impressions, fact_stat2_adid_dev.Clicks, fact_stat2_adid_dev.Conversions, fact_stat2_adid_dev.Spend, fac
t_stat2_adid_dev.AppSiteRevenue, fact_stat2_adid_dev.AdFalconRevenue)
| LDISTRIB_UNSEGMENTED
| +---> GROUPBY PIPELINED [Cost: 346769.000000, Rows: 2747.000000 Disk(B): 0.000000 CPU(B): 4034295040.000000 Memory(B): 219760.000000 Netwrk(B): 0.000000 Parallelism: 2.000000] [OutRowSz
(B): 80] (PATH ID: 2)
| | Aggregates: sum(fact_stat2_adid_dev.Requests), sum(fact_stat2_adid_dev.Impressions), sum(fact_stat2_adid_dev.AppSiteRevenue), sum(fact_stat2_adid_dev.Clicks), sum(fact_stat2_adid_
dev.Spend), sum(fact_stat2_adid_dev.Conversions)
| | Group By: fact_stat2_adid_dev.DateId, fact_stat2_adid_dev.AdId, fact_stat2_adid_dev.CountryId, fact_stat2_adid_dev.DeviceId
| | Execute on: All Nodes
| | Sort Key: (fact_stat2_adid_dev.DateId, fact_stat2_adid_dev.AdId, fact_stat2_adid_dev.CountryId, fact_stat2_adid_dev.DeviceId, fact_stat2_adid_dev.RegionId, fact_stat2_adid_dev.Ope
ratorId, fact_stat2_adid_dev.PlatformId, fact_stat2_adid_dev.Requests, fact_stat2_adid_dev.Impressions, fact_stat2_adid_dev.Clicks, fact_stat2_adid_dev.Conversions, fact_stat2_adid_dev.Spen
d, fact_stat2_adid_dev.AppSiteRevenue, fact_stat2_adid_dev.AdFalconRevenue)
| | LDISTRIB_UNSEGMENTED
| | +---> STORAGE ACCESS for fact_stat2_adid_dev [Cost: 297516.000000, Rows: 50428688.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 2.00000
0] [OutRowSz (B): 80] (PATH ID: 3)
| | | Column Cost Aspects: [ Disk(B): 1710909397.034965 CPU(B): 1306595.356643 Memory(B): 3630868705.958018 Netwrk(B): 0.000000 Parallelism: 2.000000 ]
| | | Projection: public.fact_stat2_adid_dev
| | | Materialize: fact_stat2_adid_dev.DateId, fact_stat2_adid_dev.AdId, fact_stat2_adid_dev.CountryId, fact_stat2_adid_dev.DeviceId, fact_stat2_adid_dev.Requests, fact_stat2_adid_dev
.Impressions, fact_stat2_adid_dev.Clicks, fact_stat2_adid_dev.Conversions, fact_stat2_adid_dev.Spend, fact_stat2_adid_dev.AppSiteRevenue
| | | Filter: (fact_stat2_adid_dev.AdvAccountId = 62130)/* sel=0.398601 ndv= 2 */
| | | Filter: (date_trunc('month', fact_stat2_adid_dev.DateId) = '2014-10-01 00:00:00'::timestamp)/* sel=1.000000 ndv= 61 */
| | | Execute on: All Nodes
| | | Sort Key: (fact_stat2_adid_dev.DateId, fact_stat2_adid_dev.AdId, fact_stat2_adid_dev.CountryId, fact_stat2_adid_dev.DeviceId, fact_stat2_adid_dev.RegionId, fact_stat2_adid_dev.O
peratorId, fact_stat2_adid_dev.PlatformId, fact_stat2_adid_dev.Requests, fact_stat2_adid_dev.Impressions, fact_stat2_adid_dev.Clicks, fact_stat2_adid_dev.Conversions, fact_stat2_adid_dev.Sp
end, fact_stat2_adid_dev.AppSiteRevenue, fact_stat2_adid_dev.AdFalconRevenue)
| | | LDISTRIB_SEGMENTED
--------------------------------------------------------------------------------------------
For instance, It takes 8 seconds to execute this query:
select dateid, adid, countryid, deviceid, sum(requests) as requests,
sum(impressions) as impressions,
sum(appsiterevenue) as revenue,
sum(clicks) as clicks,
sum(spend) as spend,
sum(conversions) as conversions
from fact_stat2 where date_trunc('month', dateid) = '2014-10-01' and advaccountid = 62130
group by dateid, adid, countryid, deviceid
order by dateid, adid, countryid, deviceid
limit 10 offset 0;
whereas it takes 1 second to execute the same query from a projection.
select dateid, adid, countryid, deviceid, sum(requests) as requests,
sum(impressions) as impressions,
sum(appsiterevenue) as revenue,
sum(clicks) as clicks,
sum(spend) as spend,
sum(conversions) as conversions
from fact_stat2_adid_dev where date_trunc('month', dateid) = '2014-10-01' and advaccountid = 62130
group by dateid, adid, countryid, deviceid
order by dateid, adid, countryid, deviceid
limit 10 offset 0;
Please advise if there is something missing.
Thanks
Query plan for Stm #1:
--------------------------------------------------------------------------------------------
Access Path: +-SELECT LIMIT 10 [Cost: 345726.000000, Rows: 10.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000] [OutRowSz (B): 80] (PATH ID: 0)
| Output Only: 10 tuples
| Execute on: Query Initiator
| Sort Key: (fact_stat2.DateId, fact_stat2.AdId, fact_stat2.CountryId, fact_stat2.DeviceId)
| LDISTRIB_UNSEGMENTED
| +---> GROUPBY HASH (SORT OUTPUT) (LOCAL RESEGMENT GROUPS) [Cost: 345726.000000, Rows: 2747.000000 Disk(B): 0.000000 CPU(B): 4034295040.000000 Memory(B): 483472.000000 Netwrk(B): 0.000000
Parallelism: 2.000000] [OutRowSz (B): 80] (PATH ID: 2)
| | Aggregates: sum(fact_stat2.Requests), sum(fact_stat2.Impressions), sum(fact_stat2.AppSiteRevenue), sum(fact_stat2.Clicks), sum(fact_stat2.Spend), sum(fact_stat2.Conversions)
| | Group By: fact_stat2.DateId, fact_stat2.AdId, fact_stat2.CountryId, fact_stat2.DeviceId
| | Execute on: All Nodes
| | Sort Key: (fact_stat2.DateId, fact_stat2.AdId, fact_stat2.CountryId, fact_stat2.DeviceId)
| | LDISTRIB_SEGMENTED
| | +---> STORAGE ACCESS for fact_stat2 [Cost: 296467.000000, Rows: 50428688.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 2.000000] [OutRo
wSz (B): 80] (PATH ID: 3)
| | | Column Cost Aspects: [ Disk(B): 1702315015.160839 CPU(B): 1306595.356643 Memory(B): 3630868705.958018 Netwrk(B): 0.000000 Parallelism: 2.000000 ]
| | | Projection: public.fact_stat2_adid_dev
| | | Materialize: fact_stat2.DateId, fact_stat2.AdId, fact_stat2.DeviceId, fact_stat2.CountryId, fact_stat2.Requests, fact_stat2.Impressions, fact_stat2.Clicks, fact_stat2.Conversion
s, fact_stat2.Spend, fact_stat2.AppSiteRevenue
| | | Filter: (fact_stat2.AdvAccountId = 62130)/* sel=0.398601 ndv= 2 */
| | | Filter: (date_trunc('month', fact_stat2.DateId) = '2014-10-01 00:00:00'::timestamp)/* sel=1.000000 ndv= 61 */
| | | Execute on: All Nodes
| | | Sort Key: (fact_stat2.DateId, fact_stat2.AdId, fact_stat2.DeviceId, fact_stat2.OperatorId, fact_stat2.CountryId, fact_stat2.RegionId, fact_stat2.PlatformId, fact_stat2.Requests,
fact_stat2.Impressions, fact_stat2.Clicks, fact_stat2.Conversions, fact_stat2.Spend, fact_stat2.AppSiteRevenue, fact_stat2.AdFalconRevenue)
| | | LDISTRIB_SEGMENTED
--------------------------------------------------------------------------------------------
Query plan for Stm #2:
--------------------------------------------------------------------------------------------
Access Path: +-SELECT LIMIT 10 [Cost: 346769.000000, Rows: 10.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000] [OutRowSz (B): 80] (PATH ID: 0)
| Output Only: 10 tuples
| Execute on: Query Initiator
| Sort Key: (fact_stat2_adid_dev.DateId, fact_stat2_adid_dev.AdId, fact_stat2_adid_dev.CountryId, fact_stat2_adid_dev.DeviceId, fact_stat2_adid_dev.RegionId, fact_stat2_adid_dev.OperatorI
d, fact_stat2_adid_dev.PlatformId, fact_stat2_adid_dev.Requests, fact_stat2_adid_dev.Impressions, fact_stat2_adid_dev.Clicks, fact_stat2_adid_dev.Conversions, fact_stat2_adid_dev.Spend, fac
t_stat2_adid_dev.AppSiteRevenue, fact_stat2_adid_dev.AdFalconRevenue)
| LDISTRIB_UNSEGMENTED
| +---> GROUPBY PIPELINED [Cost: 346769.000000, Rows: 2747.000000 Disk(B): 0.000000 CPU(B): 4034295040.000000 Memory(B): 219760.000000 Netwrk(B): 0.000000 Parallelism: 2.000000] [OutRowSz
(B): 80] (PATH ID: 2)
| | Aggregates: sum(fact_stat2_adid_dev.Requests), sum(fact_stat2_adid_dev.Impressions), sum(fact_stat2_adid_dev.AppSiteRevenue), sum(fact_stat2_adid_dev.Clicks), sum(fact_stat2_adid_
dev.Spend), sum(fact_stat2_adid_dev.Conversions)
| | Group By: fact_stat2_adid_dev.DateId, fact_stat2_adid_dev.AdId, fact_stat2_adid_dev.CountryId, fact_stat2_adid_dev.DeviceId
| | Execute on: All Nodes
| | Sort Key: (fact_stat2_adid_dev.DateId, fact_stat2_adid_dev.AdId, fact_stat2_adid_dev.CountryId, fact_stat2_adid_dev.DeviceId, fact_stat2_adid_dev.RegionId, fact_stat2_adid_dev.Ope
ratorId, fact_stat2_adid_dev.PlatformId, fact_stat2_adid_dev.Requests, fact_stat2_adid_dev.Impressions, fact_stat2_adid_dev.Clicks, fact_stat2_adid_dev.Conversions, fact_stat2_adid_dev.Spen
d, fact_stat2_adid_dev.AppSiteRevenue, fact_stat2_adid_dev.AdFalconRevenue)
| | LDISTRIB_UNSEGMENTED
| | +---> STORAGE ACCESS for fact_stat2_adid_dev [Cost: 297516.000000, Rows: 50428688.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 2.00000
0] [OutRowSz (B): 80] (PATH ID: 3)
| | | Column Cost Aspects: [ Disk(B): 1710909397.034965 CPU(B): 1306595.356643 Memory(B): 3630868705.958018 Netwrk(B): 0.000000 Parallelism: 2.000000 ]
| | | Projection: public.fact_stat2_adid_dev
| | | Materialize: fact_stat2_adid_dev.DateId, fact_stat2_adid_dev.AdId, fact_stat2_adid_dev.CountryId, fact_stat2_adid_dev.DeviceId, fact_stat2_adid_dev.Requests, fact_stat2_adid_dev
.Impressions, fact_stat2_adid_dev.Clicks, fact_stat2_adid_dev.Conversions, fact_stat2_adid_dev.Spend, fact_stat2_adid_dev.AppSiteRevenue
| | | Filter: (fact_stat2_adid_dev.AdvAccountId = 62130)/* sel=0.398601 ndv= 2 */
| | | Filter: (date_trunc('month', fact_stat2_adid_dev.DateId) = '2014-10-01 00:00:00'::timestamp)/* sel=1.000000 ndv= 61 */
| | | Execute on: All Nodes
| | | Sort Key: (fact_stat2_adid_dev.DateId, fact_stat2_adid_dev.AdId, fact_stat2_adid_dev.CountryId, fact_stat2_adid_dev.DeviceId, fact_stat2_adid_dev.RegionId, fact_stat2_adid_dev.O
peratorId, fact_stat2_adid_dev.PlatformId, fact_stat2_adid_dev.Requests, fact_stat2_adid_dev.Impressions, fact_stat2_adid_dev.Clicks, fact_stat2_adid_dev.Conversions, fact_stat2_adid_dev.Sp
end, fact_stat2_adid_dev.AppSiteRevenue, fact_stat2_adid_dev.AdFalconRevenue)
| | | LDISTRIB_SEGMENTED
--------------------------------------------------------------------------------------------
0
Comments
Sorting...
I think we already discussed about GBY HASH vs GBY PIPELINED.
Table: Projection: Query against table performs sorting. Are you sure queries are identical?
Sort key (PATH ID: 3): Can you post table and projection definition?
Queries are identical as I sent in my question and the only difference is the FROM statement.
Below is the Table Super Projection Definition and Custom Projection Definition.
CREATE PROJECTION fact_stat2_mysuper(
AdvAccountId ENCODING RLE,
PubAccountId ENCODING COMMONDELTA_COMP,
AppSiteId ENCODING BLOCKDICT_COMP,
CampaignId ENCODING RLE,
AdGroupId ENCODING COMMONDELTA_COMP,
AdId ENCODING COMMONDELTA_COMP,
CountryId ENCODING COMMONDELTA_COMP,
RegionId ENCODING COMMONDELTA_COMP,
OperatorId ENCODING COMMONDELTA_COMP,
ManufacturerId ENCODING COMMONDELTA_COMP,
DeviceId ENCODING COMMONDELTA_COMP,
PlatformId ENCODING COMMONDELTA_COMP,
DateId ENCODING RLE,
TimeId ENCODING RLE,
Requests ENCODING DELTARANGE_COMP,
Impressions ENCODING DELTARANGE_COMP,
Clicks ENCODING COMMONDELTA_COMP,
Conversions ENCODING RLE,
AppSiteRevenue ENCODING RLE,
Spend ENCODING COMMONDELTA_COMP,
AdFalconRevenue ENCODING RLE
)
AS
SELECT AdvAccountId,
PubAccountId,
AppSiteId,
CampaignId,
AdGroupId,
AdId,
CountryId,
RegionId,
OperatorId,
ManufacturerId,
DeviceId,
PlatformId,
DateId,
TimeId,
Requests,
Impressions,
Clicks,
Conversions,
AppSiteRevenue,
Spend,
AdFalconRevenue
FROM public.fact_stat2
ORDER BY AdvAccountId,
DateId,
TimeId,
CampaignId,
AdGroupId,
AdId,
DeviceId,
PlatformId,
ManufacturerId,
OperatorId,
RegionId,
CountryId,
Conversions,
Clicks,
AdFalconRevenue,
AppSiteRevenue,
Spend,
CountryId,
AdGroupId,
AppSiteId,
Impressions,
Requests,
PubAccountId,
AppSiteId
SEGMENTED BY MODULARHASH (AdvAccountId) ALL NODES OFFSET 0;
CREATE PROJECTION fact_stat2_adid_dev(
AdvAccountId ENCODING RLE,
DateId ENCODING RLE,
AdId ENCODING RLE,
CountryId ENCODING RLE,
DeviceId ENCODING RLE,
RegionId ENCODING RLE,
OperatorId ENCODING RLE,
PlatformId ENCODING RLE,
Requests ENCODING COMMONDELTA_COMP,
Impressions ENCODING COMMONDELTA_COMP,
Clicks ENCODING BLOCKDICT_COMP,
Conversions ENCODING RLE,
AppSiteRevenue ENCODING RLE,
Spend ENCODING COMMONDELTA_COMP,
AdFalconRevenue ENCODING RLE
)
AS
SELECT AdvAccountId,
DateId,
AdId,
DeviceId,
OperatorId,
CountryId,
RegionId,
PlatformId,
Requests,
Impressions,
Clicks,
Conversions,
AppSiteRevenue,
Spend,
AdFalconRevenue
FROM public.fact_stat2
ORDER BY AdvAccountId,
DateId,
AdId,
DeviceId,
OperatorId,
CountryId,
RegionId,
PlatformId,
Requests,
Impressions,
Clicks,
Conversions,
Spend,
AppSiteRevenue,
AdFalconRevenue
SEGMENTED BY HASH (AdId) ALL NODES OFFSET 0;
Vertica is cost-optimized database, it means that Vertica will use in plan that uses in less resources(we are dealing with BigData, right?). That is a reason why plan with GBY HASH is chosen and not against a projection with GBY PIPELINED.
2. Your super and dev projections are different (encoding, columns order), so don't expect same time.
But now I see another bug: If queries are identical, so why columns order sort in PATH 3 isn't same(see my post above)? Columns order has impact: is not equals to If you are EE customer you should open a ticket on this issue.
If you are CE customer - we can only hope for the best.
To bypass this, Is it a good practice to query from a projection directly?! I dont think so because this will complicate reporting UI logic...
Note the sort Order is different in each Query Plan...
Query1:
select dateid, adgroupname, regionid, deviceid, sum(requests) as requests,
sum(impressions) as impressions,
sum(appsiterevenue) as revenue,
sum(clicks) as clicks,
sum(spend) as spend,
sum(conversions) as conversions
from fact_stat2_adv_adgroupid where date_trunc('month', dateid) = '2014-10-01' and advaccountid = 62130
group by dateid, adgroupname, regionid, deviceid
order by dateid, adgroupname, regionid, deviceid
limit 10 offset 0;
Query1: Execution Plan:
----------------------------------------------------------------------------------------------------
Access Path: +-SELECT LIMIT 10 [Cost: 286105.000000, Rows: 10.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000] [OutRowSz (B): 80] (PATH ID: 0)
| Output Only: 10 tuples
| Execute on: Query Initiator
| Sort Key: (fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.AdGroupName, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.DeviceId)
| LDISTRIB_UNSEGMENTED
| +---> GROUPBY HASH (SORT OUTPUT) (LOCAL RESEGMENT GROUPS) [Cost: 286105.000000, Rows: 489.000000 Disk(B): 0.000000 CPU(B): 4373392480.000000 Memory(B): 86064.000000 Netwrk(B): 0.000000 P
arallelism: 3.000000] [OutRowSz (B): 80] (PATH ID: 2)
| | Aggregates: sum(fact_stat2_adv_adgroupid.Requests), sum(fact_stat2_adv_adgroupid.Impressions), sum(fact_stat2_adv_adgroupid.AppSiteRevenue), sum(fact_stat2_adv_adgroupid.Clicks),
sum(fact_stat2_adv_adgroupid.Spend), sum(fact_stat2_adv_adgroupid.Conversions)
| | Group By: fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.AdGroupName, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.DeviceId
| | Execute on: All Nodes
| | Sort Key: (fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.AdGroupName, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.DeviceId)
| | LDISTRIB_SEGMENTED
| | +---> STORAGE ACCESS for fact_stat2_adv_adgroupid [Cost: 250512.000000, Rows: 54667406.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 3.
000000] [OutRowSz (B): 80] (PATH ID: 3)
| | | Column Cost Aspects: [ Disk(B): 2290805877.251908 CPU(B): 2129669.862595 Memory(B): 3936057440.366373 Netwrk(B): 0.000000 Parallelism: 3.000000 ]
| | | Projection: public.fact_stat2_adv_adgroupid
| | | Materialize: fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.AdGroupName, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.DeviceId, fact_stat2_adv_adgroupi
d.Requests, fact_stat2_adv_adgroupid.Impressions, fact_stat2_adv_adgroupid.Clicks, fact_stat2_adv_adgroupid.Conversions, fact_stat2_adv_adgroupid.Spend, fact_stat2_adv_adgroupid.AppSiteReve
nue
| | | Filter: (fact_stat2_adv_adgroupid.AdvAccountId = 62130)/* sel=0.526718 ndv= 1 */
| | | Filter: (date_trunc('month', fact_stat2_adv_adgroupid.DateId) = '2014-10-01 00:00:00'::timestamp)/* sel=1.000000 ndv= 61 */
| | | Execute on: All Nodes
| | | Sort Key: (fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.TimeId, fact_stat2_adv_adgroupid.AdGroupName, fact_stat2_adv_adgroupid.AdGroupId, fact_stat2_adv_adgroupid.C
ountryId, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.OperatorId, fact_stat2_adv_adgroupid.ManufacturerId, fact_stat2_adv_adgroupid.DeviceId, fact_stat2_adv_adgroupid.Platfo
rmId, fact_stat2_adv_adgroupid.Requests, fact_stat2_adv_adgroupid.Impressions, fact_stat2_adv_adgroupid.Clicks, fact_stat2_adv_adgroupid.Conversions, fact_stat2_adv_adgroupid.Spend, fact_st
at2_adv_adgroupid.AppSiteRevenue, fact_stat2_adv_adgroupid.AdFalconRevenue)
| | | LDISTRIB_SEGMENTED
-----------------------------------------------------------------------------------------------------
Query2 (anchor table):
select dateid, f.name, regionid, deviceid, sum(requests) as requests,
sum(impressions) as impressions,
sum(appsiterevenue) as revenue,
sum(clicks) as clicks,
sum(spend) as spend,
sum(conversions) as conversions
from fact_stat2_adv a inner join id_name_adgroup f on a.adgroupid = f.id where date_trunc('month', dateid) = '2014-10-01' and advaccountid = 62130
group by dateid, f.name, regionid, deviceid
order by dateid, f.name, regionid, deviceid
limit 10 offset 0;
Query2 Execution Plan:
---------------------------------------------------------------------------------------------------------
Access Path: +-SELECT LIMIT 10 [Cost: 4689075.000000, Rows: 10.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000] [OutRowSz (B): 335] (PATH ID: 0)
| Output Only: 10 tuples
| Execute on: Query Initiator
| Sort Key: (fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.AdGroupId, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.DeviceId)
| LDISTRIB_UNSEGMENTED
| +---> GROUPBY HASH (SORT OUTPUT) (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 4689075.000000, Rows: 333.000000 Disk(B): 0.000000 CPU(B): 18313581010.000000 Memory(B): 228438
.000000 Netwrk(B): 18313692565.000000 Parallelism: 3.000000] [OutRowSz (B): 335] (PATH ID: 2)
| | Aggregates: sum(fact_stat2_adv_adgroupid.Requests), sum(fact_stat2_adv_adgroupid.Impressions), sum(fact_stat2_adv_adgroupid.AppSiteRevenue), sum(fact_stat2_adv_adgroupid.Clicks),
sum(fact_stat2_adv_adgroupid.Spend), sum(fact_stat2_adv_adgroupid.Conversions)
| | Group By: fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.AdGroupId, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.DeviceId
| | Execute on: All Nodes
| | Sort Key: (fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.AdGroupId, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.DeviceId)
| | LDISTRIB_SEGMENTED
| | +---> STORAGE ACCESS for <No Alias> [Cost: 470324.000000, Rows: 54667406.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 3.000000] [OutRo
wSz (B): 335] (PATH ID: 3)
| | | Column Cost Aspects: [ Disk(B): 2291289142.717557 CPU(B): 2129669.862595 Memory(B): 17438906556.442711 Netwrk(B): 0.000000 Parallelism: 3.000000 ]
| | | Projection: public.fact_stat2_adv_adgroupid
| | | Materialize: fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.AdGroupId, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.DeviceId, fact_stat2_adv_adgroupid.
Requests, fact_stat2_adv_adgroupid.Impressions, fact_stat2_adv_adgroupid.Clicks, fact_stat2_adv_adgroupid.Conversions, fact_stat2_adv_adgroupid.Spend, fact_stat2_adv_adgroupid.AppSiteRevenu
e
| | | Filter: (fact_stat2_adv_adgroupid.AdvAccountId = 62130)/* sel=0.526718 ndv= 1 */
| | | Filter: (date_trunc('month', fact_stat2_adv_adgroupid.DateId) = '2014-10-01 00:00:00'::timestamp)/* sel=1.000000 ndv= 61 */
| | | Execute on: All Nodes
| | | Sort Key: (fact_stat2_adv_adgroupid.DateId, fact_stat2_adv_adgroupid.TimeId, fact_stat2_adv_adgroupid.AdGroupName, fact_stat2_adv_adgroupid.AdGroupId, fact_stat2_adv_adgroupid.C
ountryId, fact_stat2_adv_adgroupid.RegionId, fact_stat2_adv_adgroupid.OperatorId, fact_stat2_adv_adgroupid.ManufacturerId, fact_stat2_adv_adgroupid.DeviceId, fact_stat2_adv_adgroupid.Platfo
rmId, fact_stat2_adv_adgroupid.Requests, fact_stat2_adv_adgroupid.Impressions, fact_stat2_adv_adgroupid.Clicks, fact_stat2_adv_adgroupid.Conversions, fact_stat2_adv_adgroupid.Spend, fact_st
at2_adv_adgroupid.AppSiteRevenue, fact_stat2_adv_adgroupid.AdFalconRevenue)
| | | LDISTRIB_SEGMENTED
----------------------------------------------------------------------------------------------------------
Projection Definition:
CREATE PROJECTION fact_stat2_adv_adgroupid(
AdvAccountId ENCODING RLE,
DateId ENCODING RLE,
TimeId ENCODING RLE,
AdGroupName ENCODING RLE,
AdGroupId ENCODING RLE,
CountryId ENCODING RLE,
RegionId ENCODING RLE,
OperatorId ENCODING RLE,
ManufacturerId ENCODING COMMONDELTA_COMP,
DeviceId ENCODING COMMONDELTA_COMP,
PlatformId ENCODING COMMONDELTA_COMP,
Requests ENCODING COMMONDELTA_COMP,
Impressions ENCODING COMMONDELTA_COMP,
Clicks ENCODING BLOCKDICT_COMP,
Conversions ENCODING RLE,
AppSiteRevenue ENCODING RLE,
Spend ENCODING COMMONDELTA_COMP,
AdFalconRevenue ENCODING RLE
)
AS
SELECT AdvAccountId, DateId, TimeId, AdGroupId, f.Name as AdGroupName,
CountryId, RegionId, OperatorId, ManufacturerId, DeviceId,
PlatformId, Requests, Impressions, Clicks, Conversions,
AppSiteRevenue, Spend, AdFalconRevenue
FROM
public.fact_stat2_adv a INNER JOIN public.id_name_adgroup f ON a.AdGroupId = f.Id
ORDER BY AdvAccountId, DateId, TimeId, AdGroupId,
AdGroupName, CountryId, RegionId, OperatorId, ManufacturerId,
DeviceId, PlatformId, Requests, Impressions, Clicks,
Conversions, Spend, AppSiteRevenue, AdFalconRevenue
SEGMENTED BY HASH (AdGroupId) ALL NODES OFFSET 0;
I don't see any issue/problem with it, but in general no, its a bad practice - by querying projection directly you actually and so you limits optimizer in alternatives.
Optimizer should choose a best plan by testing a different projections, but you do not gives him a choice. Agree. Me too and yes is it possible that both are missing something. As I wrote - will hope for a best.
*************************
I will hope that some employee will take a look and will find our mistake, because if its a bug... its not good.
*************************
Please, lets investigate an impact of this issue, its very important:
Can you check - does output of both queries are identical? Run both queries as sub-queries and in outer query sort by all columns while columns order are same and preserved. Check:
- output is identical (num of rows, all rows are identical in both sets)
- post explain verbose
Thank you very much.I don't see any issue/problem with it, but in general no, its a bad practice - by querying projection directly you actually and so you limits optimizer in alternatives.
Optimizer should choose a best plan by testing a different projections, but you do not gives him a choice. Agree. Me too and yes is it possible that both are missing something. As I wrote - will hope for a best.
*************************
I will hope that some employee will take a look and will find our mistake, because if its a bug... its not good.
*************************
Please, lets investigate an impact of this issue, its very important:
Can you check - does output of both queries are identical? Run both queries as sub-queries and in outer query sort by all columns while columns order are same and preserved. Check:
- output is identical (num of rows, all rows are identical in both sets)
- post explain verbose
Thank you very much.or you can check output by creating (temporary)tables from result and: any output will indicate on difference.
Example:
dateid | adid | regionid | deviceid | requests | impressions | revenue | clicks | spend | conversions------------+-------+----------+----------+----------+-------------+-----------+--------+-----------+-------------
2014-10-01 | 61118 | | 20202 | 2 | 2 | 0.0000000 | 0 | 0.0000000 | 0
2014-10-01 | 61118 | | 21008 | 1 | 1 | 0.0000000 | 0 | 0.0000000 | 0
2014-10-01 | 61118 | | 20200 | 1 | 1 | 0.0000000 | 0 | 0.0000000 | 0
2014-10-01 | 61118 | | 21016 | 2 | 2 | 0.0000000 | 0 | 0.0000000 | 0
2014-10-01 | 61118 | | | 81 | 55 | 0.0000000 | 0 | 0.0000000 | 0
(5 rows)
This indicates that results are not identical....
Trying out the same query but with the second buddy projection; results in different order: dateid | adid | regionid | deviceid | requests | impressions | revenue | clicks | spend | conversions------------+-------+----------+----------+----------+-------------+-----------+--------+-----------+-------------
2014-10-01 | 61118 | | 21016 | 2 | 2 | 0.0000000 | 0 | 0.0000000 | 0
2014-10-01 | 61118 | | | 81 | 55 | 0.0000000 | 0 | 0.0000000 | 0
2014-10-01 | 61118 | | 20202 | 2 | 2 | 0.0000000 | 0 | 0.0000000 | 0
2014-10-01 | 61118 | | 21008 | 1 | 1 | 0.0000000 | 0 | 0.0000000 | 0
2014-10-01 | 61118 | | 20200 | 1 | 1 | 0.0000000 | 0 | 0.0000000 | 0
(5 rows)
However, B0 - B1 gives: dateid | adid | regionid | deviceid | requests | impressions | revenue | clicks | spend | conversions------------+-------+----------+----------+----------+-------------+-----------+--------+-----------+-------------
2014-10-01 | 61118 | 2 | | 63 | 42 | 0.0000000 | 0 | 0.0000000 | 0
(1 row) Access Path: +-SELECT LIMIT 5 [Cost: 239245.000000, Rows: 5.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000] [OutRowSz (B): 80] (PATH ID: 0)
| Output Only: 5 tuples
| Execute on: Query Initiator
| Sort Key: (fact_stat_adid_b0.DateId, fact_stat_adid_b0.AdId, fact_stat_adid_b0.RegionId, fact_stat_adid_b0.DeviceId)
| LDISTRIB_UNSEGMENTED
| +---> GROUPBY HASH (SORT OUTPUT) (LOCAL RESEGMENT GROUPS) [Cost: 239245.000000, Rows: 2721.000000 Disk(B): 0.000000 CPU(B): 4034295040.000000 Memory(B): 478896.000000 Netwrk(B): 0.000000
Parallelism: 3.000000] [OutRowSz (B): 80] (PATH ID: 2)
| | Aggregates: sum(fact_stat_adid_b0.Requests), sum(fact_stat_adid_b0.Impressions), sum(fact_stat_adid_b0.AppSiteRevenue), sum(fact_stat_adid_b0.Clicks), sum(fact_stat_adid_b0.Spend)
, sum(fact_stat_adid_b0.Conversions)
| | Group By: fact_stat_adid_b0.DateId, fact_stat_adid_b0.AdId, fact_stat_adid_b0.RegionId, fact_stat_adid_b0.DeviceId
| | Execute on: All Nodes
| | Sort Key: (fact_stat_adid_b0.DateId, fact_stat_adid_b0.AdId, fact_stat_adid_b0.RegionId, fact_stat_adid_b0.DeviceId)
| | LDISTRIB_SEGMENTED
| | +---> STORAGE ACCESS for fact_stat_adid_b0 [Cost: 206406.000000, Rows: 50428688.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 3.000000]
[OutRowSz (B): 80] (PATH ID: 3)
| | | Column Cost Aspects: [ Disk(B): 1809927876.923077 CPU(B): 1816217.958042 Memory(B): 3630868715.524451 Netwrk(B): 0.000000 Parallelism: 3.000000 ]
| | | Projection: public.fact_stat_adid_b0
| | | Materialize: fact_stat_adid_b0.DateId, fact_stat_adid_b0.AdId, fact_stat_adid_b0.RegionId, fact_stat_adid_b0.DeviceId, fact_stat_adid_b0.Requests, fact_stat_adid_b0.Impressions,
fact_stat_adid_b0.Clicks, fact_stat_adid_b0.Conversions, fact_stat_adid_b0.AppSiteRevenue, fact_stat_adid_b0.Spend
| | | Filter: (fact_stat_adid_b0.AdvAccountId = 62130)/* sel=0.398601 ndv= 1 */
| | | Filter: (date_trunc('month', fact_stat_adid_b0.DateId) = '2014-10-01 00:00:00'::timestamp)/* sel=1.000000 ndv= 61 */
| | | Execute on: All Nodes
| | | Sort Key: (fact_stat_adid_b0.DateId, fact_stat_adid_b0.AdId, fact_stat_adid_b0.CountryId, fact_stat_adid_b0.RegionId, fact_stat_adid_b0.OperatorId, fact_stat_adid_b0.Manufacture
rId, fact_stat_adid_b0.DeviceId, fact_stat_adid_b0.PlatformId, fact_stat_adid_b0.Requests, fact_stat_adid_b0.Impressions, fact_stat_adid_b0.Clicks, fact_stat_adid_b0.Conversions, fact_stat_
adid_b0.AppSiteRevenue, fact_stat_adid_b0.Spend, fact_stat_adid_b0.AdFalconRevenue)
| | | LDISTRIB_SEGMENTED
Access Path: +-SELECT LIMIT 5 [Cost: 239445.000000, Rows: 5.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000] [OutRowSz (B): 80] (PATH ID: 0)
| Output Only: 5 tuples
| Execute on: Query Initiator
| Sort Key: (fact_stat.DateId, fact_stat.AdId, fact_stat.RegionId, fact_stat.DeviceId)
| LDISTRIB_UNSEGMENTED
| +---> GROUPBY HASH (SORT OUTPUT) (LOCAL RESEGMENT GROUPS) [Cost: 239445.000000, Rows: 2721.000000 Disk(B): 0.000000 CPU(B): 4034295040.000000 Memory(B): 478896.000000 Netwrk(B): 0.000000
Parallelism: 3.000000] [OutRowSz (B): 80] (PATH ID: 2)
| | Aggregates: sum(fact_stat.Requests), sum(fact_stat.Impressions), sum(fact_stat.AppSiteRevenue), sum(fact_stat.Clicks), sum(fact_stat.Spend), sum(fact_stat.Conversions)
| | Group By: fact_stat.DateId, fact_stat.AdId, fact_stat.RegionId, fact_stat.DeviceId
| | Execute on: All Nodes
| | Sort Key: (fact_stat.DateId, fact_stat.AdId, fact_stat.RegionId, fact_stat.DeviceId)
| | LDISTRIB_SEGMENTED
| | +---> STORAGE ACCESS for fact_stat [Cost: 206606.000000, Rows: 50428688.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 3.000000] [OutRow
Sz (B): 80] (PATH ID: 3)
| | | Column Cost Aspects: [ Disk(B): 1812383414.601399 CPU(B): 1816217.958042 Memory(B): 3630868715.524451 Netwrk(B): 0.000000 Parallelism: 3.000000 ]
| | | Projection: public.fact_stat_adid_b0
| | | Materialize: fact_stat.DateId, fact_stat.AdId, fact_stat.DeviceId, fact_stat.RegionId, fact_stat.Requests, fact_stat.Impressions, fact_stat.Clicks, fact_stat.Conversions, fact_s
tat.AppSiteRevenue, fact_stat.Spend
| | | Filter: (fact_stat.AdvAccountId = 62130)/* sel=0.398601 ndv= 1 */
| | | Filter: (date_trunc('month', fact_stat.DateId) = '2014-10-01 00:00:00'::timestamp)/* sel=1.000000 ndv= 61 */
| | | Execute on: All Nodes
| | | Sort Key: (fact_stat.DateId, fact_stat.AdId, fact_stat.DeviceId, fact_stat.ManufacturerId, fact_stat.CountryId, fact_stat.OperatorId, fact_stat.RegionId, fact_stat.PlatformId, f
act_stat.Requests, fact_stat.Impressions, fact_stat.Clicks, fact_stat.Conversions, fact_stat.AppSiteRevenue, fact_stat.Spend, fact_stat.AdFalconRevenue)
| | | LDISTRIB_SEGMENTED
Thank you a lot!!! Not good, so its a bug, I expected that results will be identically.
And now comes a most important questions:
- How we can be sure that Vertica's output is correct?
- How we can prove that Vertica calculus/analysis is correct?
If you are CE customer, so I will recommend to use in other database, because we have no response from Vertica Team and this bug is too much critical.Of cause you can wait for fix/patch(but take in mind that Vertica Team ignores this question) but I don't think that you or your customers will be happy with incorrect results/reports/graphs.
Regards.
And you have to point to all your users to use in anchor table explicitly.
The problem here is - if by mistake I will query not an anchor table:
I can't determine if output is correct or wrong, no errors, everything will look correct.
And mistakes occurs...
PS
BTW: is it acceptable by you? And think twice.
Take in mind its an Enterprise application with big vendor like HP.
- You always should remember what you can query and what you can not.
- As I understand project still in DEV stage and you already fall in bug, what next?
- You pointed by yourself: it seems support is their last worries...
and your issue demonstrates how Support help is critical and important.
Regards
I fully agree with you that support is crucial, this is why Ent license is helpful, however i thought this migh affect all customers if it was confirmed as a bug by Vertica team.
Cheers
Bug reported by CE customer will appear to EE customer.
Will hope on our mistake, because Vertica is wonderful product.
IMO: its HP policy spoils everything.