Vertica & Clickstream Analysis ?
Hello,
I am trying to use Vertica in clickstream analysis scenario and i need to make sure am using Vertica in the right way!
I am having Vertica installed on two "r3.4xlarge" AWS instances w/ 8 disks each configured as RAID 0 and having 10 GB of raw data which is very small compared to how much data size Vertica is supposed to handle i guess.
I have one anchor table with many dimensions e.g. account, app, date, time (0-23), country, region, operator, device, platform, manufacturer and many facts such as requests, impressions, ,clicks, and conversions.
i have two deployed projections proposed by the database designer, and i am trying to use Vertica to execute queries of varying dimensions and of varying orders aggregated on a daily, weekly and monthly basis, however when i execute the following daily aggregated query:
SELECT app, date as dayno, countryid, operatorid, regionid, deviceid, platformid, manufacturerid, sum(requests) , sum(impressions) , sum(clicks), sum(conversions)
FROM fact_stat
WHERE date >= '2014-10-01' and date <= '2014-10-30' and accountid = 99999
GROUP BY app, dayno, countryid, operatorid, regionid, deviceid, platformid, manufacturerid
ORDER BY app, dayno, countryid, operatorid, regionid, deviceid, platformid, manufacturerid,
LIMIT 10 offset 0;
It takes up to 11 seconds to execute! while weekly aggregation takes up to 3 second. Is there anything i can do to optimize performance, given that # of dimensions might increase and their order might vary.
Below is the Query Plan:
-------------------------------
Access Path: +-SELECT LIMIT 10 [Cost: 565K, Rows: 10] (PATH ID: 0)
| Output Only: 10 tuples
| Execute on: Query Initiator
| +---> GROUPBY HASH (SORT OUTPUT) (LOCAL RESEGMENT GROUPS) [Cost: 565K, Rows: 482] (PATH ID: 2)
| | Aggregates: sum(fact_stat.Requests), sum(fact_stat.Impressions), sum(fact_stat.Clicks), sum(fact_stat.Conversions)
| | Group By: fact_stat.AppSiteId, fact_stat.DateId, fact_stat.CountryId, fact_stat.OperatorId, fact_stat.RegionId, fact_stat.DeviceId, fact_stat.PlatformId
, fact_stat.ManufacturerId
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for fact_stat [Cost: 511K, Rows: 42M] (PATH ID: 3)
| | | Projection: public.fact_stat_DBD_1_seg_test3_b0
| | | Materialize: fact_stat.DateId, fact_stat.AppSiteId, fact_stat.CountryId, fact_stat.OperatorId, fact_stat.RegionId, fact_stat.DeviceId, fact_stat.Platf
ormId, fact_stat.ManufacturerId, fact_stat.Requests, fact_stat.Impressions, fact_stat.Clicks, fact_stat.Conversions
| | | Filter: (fact_stat.AccountId = 99999)
| | | Filter: ((fact_stat.DateId >= '2014-10-01'::date) AND (fact_stat.DateId <= '2014-10-30'::date))
| | | Execute on: All Nodes
-------------------------------
Thank you
I am trying to use Vertica in clickstream analysis scenario and i need to make sure am using Vertica in the right way!
I am having Vertica installed on two "r3.4xlarge" AWS instances w/ 8 disks each configured as RAID 0 and having 10 GB of raw data which is very small compared to how much data size Vertica is supposed to handle i guess.
I have one anchor table with many dimensions e.g. account, app, date, time (0-23), country, region, operator, device, platform, manufacturer and many facts such as requests, impressions, ,clicks, and conversions.
i have two deployed projections proposed by the database designer, and i am trying to use Vertica to execute queries of varying dimensions and of varying orders aggregated on a daily, weekly and monthly basis, however when i execute the following daily aggregated query:
SELECT app, date as dayno, countryid, operatorid, regionid, deviceid, platformid, manufacturerid, sum(requests) , sum(impressions) , sum(clicks), sum(conversions)
FROM fact_stat
WHERE date >= '2014-10-01' and date <= '2014-10-30' and accountid = 99999
GROUP BY app, dayno, countryid, operatorid, regionid, deviceid, platformid, manufacturerid
ORDER BY app, dayno, countryid, operatorid, regionid, deviceid, platformid, manufacturerid,
LIMIT 10 offset 0;
It takes up to 11 seconds to execute! while weekly aggregation takes up to 3 second. Is there anything i can do to optimize performance, given that # of dimensions might increase and their order might vary.
Below is the Query Plan:
-------------------------------
Access Path: +-SELECT LIMIT 10 [Cost: 565K, Rows: 10] (PATH ID: 0)
| Output Only: 10 tuples
| Execute on: Query Initiator
| +---> GROUPBY HASH (SORT OUTPUT) (LOCAL RESEGMENT GROUPS) [Cost: 565K, Rows: 482] (PATH ID: 2)
| | Aggregates: sum(fact_stat.Requests), sum(fact_stat.Impressions), sum(fact_stat.Clicks), sum(fact_stat.Conversions)
| | Group By: fact_stat.AppSiteId, fact_stat.DateId, fact_stat.CountryId, fact_stat.OperatorId, fact_stat.RegionId, fact_stat.DeviceId, fact_stat.PlatformId
, fact_stat.ManufacturerId
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for fact_stat [Cost: 511K, Rows: 42M] (PATH ID: 3)
| | | Projection: public.fact_stat_DBD_1_seg_test3_b0
| | | Materialize: fact_stat.DateId, fact_stat.AppSiteId, fact_stat.CountryId, fact_stat.OperatorId, fact_stat.RegionId, fact_stat.DeviceId, fact_stat.Platf
ormId, fact_stat.ManufacturerId, fact_stat.Requests, fact_stat.Impressions, fact_stat.Clicks, fact_stat.Conversions
| | | Filter: (fact_stat.AccountId = 99999)
| | | Filter: ((fact_stat.DateId >= '2014-10-01'::date) AND (fact_stat.DateId <= '2014-10-30'::date))
| | | Execute on: All Nodes
-------------------------------
Thank you
0
Comments
Did you tried Live Aggregate Projections?
http://vertica.tips/2014/07/30/first-look-at-live-aggregate-projections/
http://www.vertica.com/2014/09/10/whats-new-in-dragline-7-1-0-live-aggregate-projections/
http://www.vertica.com/category/aggregate-projections/
http://vertica.tips/2014/11/25/data-movement-live-aggregate-projections-and-database-designer-from-d...
and of cause documentation:
http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/AdministratorsGuide/Projections/Retrieving...
http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/AdministratorsGuide/Projections/AggregateP...
http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/AdministratorsGuide/Projections/SupportedA...
**********
1. you can try to simplify your filter, for example: can be re-written as Example: **********
2. Why GROUPBY HASH (PATH 2)? You should try to optimize your projection for GROUPBY PIPELINED
http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/AdministratorsGuide/EXPLAIN/ViewingTheGROU... **********
3. What is your PARTITION BY expression? By months? If not try partitions by month - it will help to filter data by "partition elimination"
https://community.vertica.com/vertica/topics/partition_elimination_testing
https://my.vertica.com/docs/4.1/HTML/Master/13731.htm
**********
So you have to try:
- optimize your projection for PIPELINED
- re-write a filter
- define partitioning
- and if query still returns in unacceptable time => so create Live Aggregate Projection
PSJust for example, we got LAP queries (SUM only, but complex enough) between 0.01 - 0.07 secs on 40TB of data.
In a next time, post please EXPLAIN VERBOSE <query>, not just EXPLAIN <query>.
Try for example:
A: This is not going to work in my case, first because of the so many constraints included in the live aggregation projections including the inability to add dimensions once projections are deployed. In addition, i need to make queries using GROUP BY of up to 20 dimensions of varying orders and so the number of projections need to be made is so much high. lets say i have a, b, c as dimensions, they i need to be able to group by (a,b,c), (a,c), (b,c), (a, b), (c, a), (b,a) , (c, b, a) , (b, a, c) and so on.
2) Why GROUPBY HASH (PATH 2)? You should try to optimize your projection for GROUPBY PIPELINED
A: GROUP BY PIPELINED is not always feasible in my case, because i can have queries of varying dimensions orders and less than the dimensions included in the projections order by, thus Vertica must revert to GROUP BY HASH, for instance, i can have a, b, c in projection ORDER BY and can have queries such as group by c, a or group by c, b, a or group by c, b , in all such cases GROUP BY PIPELINED can not be achieved.
3) Rewriting the filter did not achieve any performance difference.
4) I have partitioning by month...
Can you let me know if LAP still can be used in my case. There are plenty of Ad companies who are using Vertica before LAP is released and i wonder how they are able to achieve good performance results on large data volumes !
Follows is the results of explain verbose:
---------------------------------------
Access Path: +-SELECT LIMIT 10 [Cost: 551704.000000, Rows: 10.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.0000
00 Parallelism: 1.000000] [OutRowSz (B): 96] (PATH ID: 0)
| Output Only: 10 tuples
| Execute on: Query Initiator
| Sort Key: (fact_stat.AppSiteId, V(104,-3), fact_stat.CountryId, fact_stat.OperatorId, fact_stat.RegionId, fact_stat.DeviceId,
fact_stat.PlatformId, fact_stat.ManufacturerId)
| LDISTRIB_UNSEGMENTED
| +---> GROUPBY HASH (SORT OUTPUT) (LOCAL RESEGMENT GROUPS) [Cost: 551704.000000, Rows: 516.000000 Disk(B): 0.000000 CPU(B): 4041
903744.000000 Memory(B): 107328.000000 Netwrk(B): 0.000000 Parallelism: 2.000000] [OutRowSz (B): 96] (PATH ID: 2)
| | Aggregates: sum(fact_stat.Requests), sum(fact_stat.Impressions), sum(fact_stat.AppSiteRevenue), sum(fact_stat.Clicks)
| | Group By: fact_stat.AppSiteId, date_part('day', fact_stat.DateId), fact_stat.CountryId, fact_stat.OperatorId, fact_stat.
RegionId, fact_stat.DeviceId, fact_stat.PlatformId, fact_stat.ManufacturerId
| | Execute on: All Nodes
| | Sort Key: (fact_stat.AppSiteId, V(104,-3), fact_stat.CountryId, fact_stat.OperatorId, fact_stat.RegionId, fact_stat.Devi
ceId, fact_stat.PlatformId, fact_stat.ManufacturerId)
| | LDISTRIB_SEGMENTED
| | +---> STORAGE ACCESS for fact_stat [Cost: 502361.000000, Rows: 42103164.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B):
0.000000 Netwrk(B): 0.000000 Parallelism: 2.000000] [OutRowSz (B): 96] (PATH ID: 3)
| | | Column Cost Aspects: [ Disk(B): 3441009904.536914 CPU(B): 4348247.624161 Memory(B): 3369201879.351852 Netwrk(B): 0.000
000 Parallelism: 2.000000 ]
| | | Projection: public.fact_stat_DBD_1_seg_test3_b0
| | | Materialize: fact_stat.DateId, fact_stat.AppSiteId, fact_stat.CountryId, fact_stat.OperatorId, fact_stat.RegionId, fac
t_stat.DeviceId, fact_stat.PlatformId, fact_stat.ManufacturerId, fact_stat.Requests, fact_stat.Impressions, fact_stat.Clicks, fact
_stat.AppSiteRevenue
| | | Filter: (fact_stat.AccountId = 99999)/* sel=0.342282 ndv= 8 */
| | | Filter: (date_trunc('month', fact_stat.DateId) = '2014-10-01 00:00:00'::timestamp)/* sel=1.000000 ndv= 59 */
| | | Execute on: All Nodes
| | | Sort Key: (fact_stat.AppSiteId, fact_stat.DateId, fact_stat.TimeId, fact_stat.CountryId, fact_stat.OperatorId, fact_st
at.RegionId, fact_stat.DeviceId, fact_stat.PlatformId, fact_stat.ManufacturerId)
| | | LDISTRIB_SEGMENTED
Only one exception - you have too much different ordering variations (more than 3).
Of cause it has main impact on license and loads, but I don't know your limitations. By query provided by you - yes, sure. And even more - you can create a single anchor projection and few LAPs (depends on grouping fields). 1. That's one of a reasons why LAP are created
2. And don't try to understand. Vertica gives engineers for every big company like FB, Zynga. Vertica engineers have access to internal information and can access to source code and so Vertica architecture for these companies are non-standard.
http://vertica.tips/2014/08/27/how-vertica-met-facebooks-35tb-per-hour-ingest-sla/
For FB Vertica created an architecture very similar to PADB (IMO): compute nodes, data nodes (some nodes at FB do parsing only - compute nodes, while other nodes responses to requests - data node).
PS
I hate GetSatisfaction!
@moderator
feel free to remove post or ban me a hater!
So the conclusion i guess, Vertica does not have a magic stick to solve nearly every use case. I will have to choose few queries that can be optimized through a small set of projections or LAP.
Yes, Vertica is RDBMS and not some "wizard". Vertica has it own niche. Agree.
Best, idiot.
Do you mean how other companies do it with BigQuery?
You will laugh a long time, but BigQuery is not for BigData. Paradox? Yes.
Google BigQuery can't do join on more than 2 bil rows.
It is what Google Clowns calls BigData.
Excellence slides about BigQuery: http://www.slideshare.net/gabehamilton/how-bigquery-broke-my-heart (250GB raw data - and BigQuery fails)
))))
Regards, ID10T.
But Google's BigQuery is the reason behind Map/Reduce !! I am confused.
Best
They can't co-exist in same system.
Vertica is more suitable for OLAP conditions.
Hope this helps.
NC
Did you take a look on slides? 250GB fails. Of cause Google can give better performance/less limitations, but you will require to pay a lot of money to get all benefits of Google Platform. No, I mean what I mean:))) Yahoo for example in 2011 had 42K cluster: And what about Google? and finally some numbers (year 2004!!!): Lets check some customers? 2400 nodes for scalability test only? Nice. So, 10K - its not a big number for Google.
I belive that Analytics runs on cluster with more that 10K nodes, its only for BigQuery ~10K nodes.
By your EXPLAIN:
Projection is optimized. Fully or not fully I don't know, but network is 0 till last path - it means each node do it own work (by the way, looks like you have 2 nodes cluster)
But take a look on CPU and MEMORY(mem - expected, because you do GBY HASH) - sorting kills your performance. As I told - its a "trade off" and you should decide what is best for you. 5 keys for sorting.
***
How fast data grows? Currently you have about 43Mil rows. Be careful from case when query data do not fit a memory.
Best, idiot.