Vertica & Clickstream Analysis ?

Shadi_Mari · December 2014

Hello,

I am trying to use Vertica in clickstream analysis scenario and i need to make sure am using Vertica in the right way!

I am having Vertica installed on two "r3.4xlarge" AWS instances w/ 8 disks each configured as RAID 0 and having 10 GB of raw data which is very small compared to how much data size Vertica is supposed to handle i guess.

I have one anchor table with many dimensions e.g. account, app, date, time (0-23), country, region, operator, device, platform, manufacturer and many facts such as requests, impressions, ,clicks, and conversions.

i have two deployed projections proposed by the database designer, and i am trying to use Vertica to execute queries of varying dimensions and of varying orders aggregated on a daily, weekly and monthly basis, however when i execute the following daily aggregated query:

SELECT app, date as dayno, countryid, operatorid, regionid, deviceid, platformid, manufacturerid, sum(requests) , sum(impressions) , sum(clicks), sum(conversions)

FROM fact_stat

WHERE date >= '2014-10-01' and date <= '2014-10-30' and accountid = 99999

GROUP BY app, dayno, countryid, operatorid, regionid, deviceid, platformid, manufacturerid

ORDER BY app, dayno, countryid, operatorid, regionid, deviceid, platformid, manufacturerid,

LIMIT 10 offset 0;

It takes up to 11 seconds to execute! while weekly aggregation takes up to 3 second. Is there anything i can do to optimize performance, given that # of dimensions might increase and their order might vary.

Below is the Query Plan:
-------------------------------
Access Path: +-SELECT LIMIT 10 [Cost: 565K, Rows: 10] (PATH ID: 0)
| Output Only: 10 tuples
| Execute on: Query Initiator
| +---> GROUPBY HASH (SORT OUTPUT) (LOCAL RESEGMENT GROUPS) [Cost: 565K, Rows: 482] (PATH ID: 2)
| | Aggregates: sum(fact_stat.Requests), sum(fact_stat.Impressions), sum(fact_stat.Clicks), sum(fact_stat.Conversions)
| | Group By: fact_stat.AppSiteId, fact_stat.DateId, fact_stat.CountryId, fact_stat.OperatorId, fact_stat.RegionId, fact_stat.DeviceId, fact_stat.PlatformId
, fact_stat.ManufacturerId
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for fact_stat [Cost: 511K, Rows: 42M] (PATH ID: 3)
| | | Projection: public.fact_stat_DBD_1_seg_test3_b0
| | | Materialize: fact_stat.DateId, fact_stat.AppSiteId, fact_stat.CountryId, fact_stat.OperatorId, fact_stat.RegionId, fact_stat.DeviceId, fact_stat.Platf
ormId, fact_stat.ManufacturerId, fact_stat.Requests, fact_stat.Impressions, fact_stat.Clicks, fact_stat.Conversions
| | | Filter: (fact_stat.AccountId = 99999)
| | | Filter: ((fact_stat.DateId >= '2014-10-01'::date) AND (fact_stat.DateId <= '2014-10-30'::date))
| | | Execute on: All Nodes
-------------------------------
Thank you

id10t · December 2014

Hi!

Did you tried Live Aggregate Projections?

http://vertica.tips/2014/07/30/first-look-at-live-aggregate-projections/
http://www.vertica.com/2014/09/10/whats-new-in-dragline-7-1-0-live-aggregate-projections/
http://www.vertica.com/category/aggregate-projections/
http://vertica.tips/2014/11/25/data-movement-live-aggregate-projections-and-database-designer-from-d...

and of cause documentation:

http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/AdministratorsGuide/Projections/Retrieving...
http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/AdministratorsGuide/Projections/AggregateP...
http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/AdministratorsGuide/Projections/SupportedA...

**********

1. you can try to simplify your filter, for example:

date >= '2014-10-01' and date <= '2014-10-30'

can be re-written as

date_trunc('MONTH', date) = '2014-10-01'

Example:

daniel=> select date_trunc('MONTH', date('2014-10-15'));
     date_trunc      
---------------------
 2014-10-01 00:00:00
(1 row)

**********

2. Why GROUPBY HASH (PATH 2)? You should try to optimize your projection for GROUPBY PIPELINED
http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/AdministratorsGuide/EXPLAIN/ViewingTheGROU...

GROUPBY PIPELINED requires that inputs be presorted on the columns specified in the group, which means that HP Vertica need only retain data in the current group in memory. GROUPBY PIPELINED operations are preferred because they are generally faster and require less memory than GROUPBY HASH. GROUPBY PIPELINED is especially useful for queries that process large numbers of high-cardinality group by columns or DISTINCT aggregates.

**********

3. What is your PARTITION BY expression? By months? If not try partitions by month - it will help to filter data by "partition elimination"
https://community.vertica.com/vertica/topics/partition_elimination_testing
https://my.vertica.com/docs/4.1/HTML/Master/13731.htm

**********

So you have to try:

optimize your projection for PIPELINED
re-write a filter
define partitioning
and if query still returns in unacceptable time => so create Live Aggregate Projection

PS
Just for example, we got LAP queries (SUM only, but complex enough) between 0.01 - 0.07 secs on 40TB of data.

In a next time, post please EXPLAIN VERBOSE <query>, not just EXPLAIN <query>.
Try for example:

explain select 1+1;
explain verbose select 1+1;

Shadi_Mari · December 2014

1) Did you try Live Aggregate Projections?

A: This is not going to work in my case, first because of the so many constraints included in the live aggregation projections including the inability to add dimensions once projections are deployed. In addition, i need to make queries using GROUP BY of up to 20 dimensions of varying orders and so the number of projections need to be made is so much high. lets say i have a, b, c as dimensions, they i need to be able to group by (a,b,c), (a,c), (b,c), (a, b), (c, a), (b,a) , (c, b, a) , (b, a, c) and so on.

2) Why GROUPBY HASH (PATH 2)? You should try to optimize your projection for GROUPBY PIPELINED

A: GROUP BY PIPELINED is not always feasible in my case, because i can have queries of varying dimensions orders and less than the dimensions included in the projections order by, thus Vertica must revert to GROUP BY HASH, for instance, i can have a, b, c in projection ORDER BY and can have queries such as group by c, a or group by c, b, a or group by c, b , in all such cases GROUP BY PIPELINED can not be achieved.

3) Rewriting the filter did not achieve any performance difference.

4) I have partitioning by month...

Can you let me know if LAP still can be used in my case. There are plenty of Ad companies who are using Vertica before LAP is released and i wonder how they are able to achieve good performance results on large data volumes !

Follows is the results of explain verbose:

---------------------------------------

Access Path: +-SELECT LIMIT 10 [Cost: 551704.000000, Rows: 10.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.0000
00 Parallelism: 1.000000] [OutRowSz (B): 96] (PATH ID: 0)
| Output Only: 10 tuples
| Execute on: Query Initiator
| Sort Key: (fact_stat.AppSiteId, V(104,-3), fact_stat.CountryId, fact_stat.OperatorId, fact_stat.RegionId, fact_stat.DeviceId,
fact_stat.PlatformId, fact_stat.ManufacturerId)
| LDISTRIB_UNSEGMENTED
| +---> GROUPBY HASH (SORT OUTPUT) (LOCAL RESEGMENT GROUPS) [Cost: 551704.000000, Rows: 516.000000 Disk(B): 0.000000 CPU(B): 4041
903744.000000 Memory(B): 107328.000000 Netwrk(B): 0.000000 Parallelism: 2.000000] [OutRowSz (B): 96] (PATH ID: 2)
| | Aggregates: sum(fact_stat.Requests), sum(fact_stat.Impressions), sum(fact_stat.AppSiteRevenue), sum(fact_stat.Clicks)
| | Group By: fact_stat.AppSiteId, date_part('day', fact_stat.DateId), fact_stat.CountryId, fact_stat.OperatorId, fact_stat.
RegionId, fact_stat.DeviceId, fact_stat.PlatformId, fact_stat.ManufacturerId
| | Execute on: All Nodes
| | Sort Key: (fact_stat.AppSiteId, V(104,-3), fact_stat.CountryId, fact_stat.OperatorId, fact_stat.RegionId, fact_stat.Devi
ceId, fact_stat.PlatformId, fact_stat.ManufacturerId)
| | LDISTRIB_SEGMENTED
| | +---> STORAGE ACCESS for fact_stat [Cost: 502361.000000, Rows: 42103164.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B):
0.000000 Netwrk(B): 0.000000 Parallelism: 2.000000] [OutRowSz (B): 96] (PATH ID: 3)
| | | Column Cost Aspects: [ Disk(B): 3441009904.536914 CPU(B): 4348247.624161 Memory(B): 3369201879.351852 Netwrk(B): 0.000
000 Parallelism: 2.000000 ]
| | | Projection: public.fact_stat_DBD_1_seg_test3_b0
| | | Materialize: fact_stat.DateId, fact_stat.AppSiteId, fact_stat.CountryId, fact_stat.OperatorId, fact_stat.RegionId, fac
t_stat.DeviceId, fact_stat.PlatformId, fact_stat.ManufacturerId, fact_stat.Requests, fact_stat.Impressions, fact_stat.Clicks, fact
_stat.AppSiteRevenue
| | | Filter: (fact_stat.AccountId = 99999)/* sel=0.342282 ndv= 8 */
| | | Filter: (date_trunc('month', fact_stat.DateId) = '2014-10-01 00:00:00'::timestamp)/* sel=1.000000 ndv= 59 */
| | | Execute on: All Nodes
| | | Sort Key: (fact_stat.AppSiteId, fact_stat.DateId, fact_stat.TimeId, fact_stat.CountryId, fact_stat.OperatorId, fact_st
at.RegionId, fact_stat.DeviceId, fact_stat.PlatformId, fact_stat.ManufacturerId)
| | | LDISTRIB_SEGMENTED

id10t · December 2014

Hi!

GROUP BY PIPELINED is not always feasible in my case

Sorting is one of most expensive actions { O(n*log(n)) } and GROUPBY HASH require sorting. You should choose your "trade of".

 | +---> GROUPBY HASH (SORT OUTPUT)

I still recommend to use in PIPELINED, I don't know your business requirements, but think about case when dimension doesn't fit RAM or GROUPBY HASH doesn't fit memory. So spill on disk? Bad solution.

in all such cases GROUP BY PIPELINED can not be achieved.

Not agree. One more time - I don't know your business requirements and limitations, but you can create several projections to table with different ordering.
Only one exception - you have too much different ordering variations (more than 3).
Of cause it has main impact on license and loads, but I don't know your limitations.

Can you let me know if LAP still can be used in my case.

By query provided by you - yes, sure. And even more - you can create a single anchor projection and few LAPs (depends on grouping fields).

how they are able to achieve good performance results on large data volumes !

1. That's one of a reasons why LAP are created
2. And don't try to understand. Vertica gives engineers for every big company like FB, Zynga. Vertica engineers have access to internal information and can access to source code and so Vertica architecture for these companies are non-standard.

http://vertica.tips/2014/08/27/how-vertica-met-facebooks-35tb-per-hour-ingest-sla/

For FB Vertica created an architecture very similar to PADB (IMO): compute nodes, data nodes (some nodes at FB do parsing only - compute nodes, while other nodes responses to requests - data node).

PS
I hate GetSatisfaction!

@moderator
feel free to remove post or ban me a hater!

Shadi_Mari · December 2014

I am not able to create too many projections for my case, i have 10 dimensions might increase to 20, and i have 100 ordering varieties, so neither LAP nor GROUP BY PIPELINE will work for all cases, they can work for few cases though, i understand!

So the conclusion i guess, Vertica does not have a magic stick

to solve nearly every use case. I will have to choose few queries that can be optimized through a small set of projections or LAP.

id10t · December 2014

Hi!

So the conclusion i guess, Vertica does not have a magic stick to solve nearly every use case.

Can you point on some product that solves every or nearly every case?

))))
Yes, Vertica is RDBMS and not some "wizard". Vertica has it own niche.

I will have to choose few queries that can be optimized through a small set of projections or LAP.

Agree.

Best, idiot.

Shadi_Mari · December 2014

I wonder how Google Analytics is doing such jobs with ease no matter of the # of dimensions and no matter of their order!

id10t · December 2014

Hi!

I wonder how Google Analytics is doing such jobs

What do you what from Google? Google has cluster ~10K nodes.

Do you mean how other companies do it with BigQuery?

You will laugh a long time, but BigQuery is not for BigData. Paradox? Yes.
Google BigQuery can't do join on more than 2 bil rows.
It is what Google Clowns calls BigData.

Excellence slides about BigQuery: http://www.slideshare.net/gabehamilton/how-bigquery-broke-my-heart (250GB raw data - and BigQuery fails)

))))

Regards, ID10T.

Shadi_Mari · December 2014

10K cluster nodes for Google Web Analytics only?!! This sounds like a large number to believe

probably you mean 10K nodes for all their Big Data Analysis stuff they're doing.

But Google's BigQuery is the reason behind Map/Reduce !! I am confused.

Best

Navin_C · December 2014

Lets not try to compare Vertica and BigQuery.

They can't co-exist in same system.

Vertica is more suitable for OLAP conditions.

Hope this helps.
NC

id10t · December 2014

Hi!

Lets not try to compare Vertica and BigQuery.

You are right, no place for comparison, but we do not, we are just discussing about Google Clowns. You need to abase yourself to work at Google:

But Google's BigQuery is the reason behind Map/Reduce !! I am confused.

Did you take a look on slides?

250GB fails. Of cause Google can give better performance/less limitations, but you will require to pay a lot of money to get all benefits of Google Platform.

probably you mean 10K nodes for all their Big Data Analysis stuff they're doing.

No, I mean what I mean:))) Yahoo for example in 2011 had 42K cluster:

Of the companies listed, Yahoo! has by far the most number of nodes in its massive Hadoop clusters at over 42,000 nodes as of July 2011.

And what about Google?

Google keeps its data center architecture a closely held secret, but occasionally journalists get a glimpse of the behind-the-scenes and take a guess at the full size of its infrastructure.

and finally some numbers (year 2004!!!):

And in a 2004 presentation, Dean said, one system withstood a failure of 1,600 servers in a 1,800-unit cluster.

Lets check some customers?

Delightfully, we were able to operate a cluster of 2400 nodes. We are extremely satisfied with this result, which validates that Akka can be used for large clusters on GCE. - See more at: http://typesafe.com/blog/running-a-2400-akka-nodes-cluster-on-google-compute-engine#sthash.fiXxuYe4....

2400 nodes for scalability test only? Nice. So, 10K - its not a big number for Google.

I belive that Analytics runs on cluster with more that 10K nodes, its only for BigQuery ~10K nodes.

id10t · December 2014

Hi!

By your EXPLAIN:

Projection is optimized. Fully or not fully I don't know, but network is 0 till last path - it means each node do it own work (by the way, looks like you have 2 nodes cluster)

But take a look on CPU and MEMORY(mem - expected, because you do GBY HASH) - sorting kills your performance. As I told - its a "trade off" and you should decide what is best for you.

Sort Key: (fact_stat.AppSiteId, V(104,-3), fact_stat.CountryId, fact_stat.OperatorId, fact_stat.RegionId, fact_stat.DeviceId,
fact_stat.PlatformId, fact_stat.ManufacturerId)

5 keys for sorting.

***

How fast data grows? Currently you have about 43Mil rows. Be careful from case when query data do not fit a memory.

Best, idiot.

We're Moving!

Create My New Community Account Now

Vertica & Clickstream Analysis ?

Comments

Leave a Comment

We're Moving!

Create My New Community Account Now

Vertica &amp; Clickstream Analysis ?

Comments

Leave a Comment

Vertica & Clickstream Analysis ?