How to get ANY row in aggregate function
When writing SQL with GROUP BY, quite often I want to get value for some column from any row.
For example, I am looking to get SQL that was executed most often in last several hours (by digest), and see one SQL as sample:
select count() cnt, min(request) req
where time > current_timestamp - 0.1
and request_type = 'QUERY'
group by digest
order by count() desc;
I am using MIN as it is GROUP BY.
I actually do not care what rows will be picked by MIN function, I would be fine with ANY. MIN function consume CPU resources, and taking MIN from very long strings like SQL is resource - intensive.
Is there a way to ask Vertica to return ANY row value in GROUP BY?
I can easily implement this ANY function as c++ UDX aggregate (UDAFs to be exact). Just pick up first value in partition, save it, ignore rest rows, and return saved value at the end. Would be trivial implementation.
mosheg Vertica Employee Administrator
With Vertica ANY and ALL operators, you can make comparisons on subqueries that return multiple rows.
As shown here: https://www.vertica.com/docs/11.1.x/HTML/Content/Authoring/AnalyzingData/Queries/Subqueries/ANYSOMEAndALL.htm
To get the query digest count without using Min on the Query field, consider the following example
with list1 as ( select digest, count(1) as count from dc_requests_issued where request_type='QUERY' group by digest) select m.digest, list1.count, left(m.request,60) from dc_requests_issued m, list1 where list1.digest=m.digest LIMIT 1 OVER (PARTITION BY m.digest ORDER BY m.digest);0
I'm not sure I follow what you're trying to do here. This is a simpler way to get the most frequent queries over the last hour, for example:
select request, count(request) cnt
where time > current_timestamp - interval '1 hour'
and request_type = 'QUERY'
group by 1
order by 2 desc;
Otherwise, the FIRST_VALUE [Analytic] or TS_FIRST_VALUE might get the result you're looking for.
Thanks for taking time to answer.
Yes, I did not explain well enough what I am trying to achieve.
FYI you SQL will provide count of identical SQL. In real life nobody is running same SQL again and again.
Users are running same SQL with different literal constant.
For example, right now one of user is running batch job, issued over 85000 SQL in last 8 hours, and those SQL are identical and differ only in value, like symbol = 'MSFT'. Instead of MSFT, it is each time different symbol. Your SQL will consider those queries different, while they are same for my purpose. Vertica has new column "digest" in dc_request_issued, it is calculated hash from static for of request, with whitespace removed, all lowercase, literal constant replaced with some enumerated strings. Digest is same for all 85000 SQL issued by my user, despite requests are not literally identical. That allow to identify most often issued requests in database much better.
(Above comment is not related to question).
More details what I was looking for.
Request column is quite long, users easily can create multipage SQL over 100K long. Digest group requests nicely together, and I want to see any SQL that is inside group. because I am using group by digest, and request column is not in group by, it should be some aggregate applied to request column. I am using MIN but it requires a bunch of comparing. It works just fine, I was just curious is there any way to get any row from group by, without involving calculations like MIN. FIRST_VALUE analytical would do but need sorting, which is unnecessary.
When I am doing GROUP BY, for columns not participating in GROUP BY clause I need to apply some aggregate. Vertica have aggregate functions MIN, MAX. I was looking for aggregate function ANY, that would return value from any row in group. Ideally should be no overhead associated with call to ANY, (i.e. no sorting).
Just curious if something like this exist.
Thanks, that would do.