Extremely Slow ST_Contains Query Across 20 Billion Rows in Vertica Place
Hello,
I have a table with about 20 Billion rows in Vertica Place. Each row has a point(long, lat) and a property value.
I need to find all points that fall within a small rectangle (long1 lat1, long 2 lat 2).
I converted the point column in the table to a geometry type, and then I am trying to run an ST_Contains query.
Essentially:
select * from table_a where ST_Contains(ST_GeomFromText('Polygon(long1, lat1, long2, lat2.......)), pointColumnInTable_A_of_geometrytype)
This query seemed pretty fast on a few hundred million rows, but then seems to have slowed down considerably as the rows in the table increased.
I want to understand if there's a way to speed-up the query to deal with 20Billion rows? Would some kind of indexing help? I'd appreciate any pointers.
One possibility might be to split the table and then query in appropriate table, but ideally I wouldn't want to do that. I'd imagine Vertica Place would have a way to take care of this problem.
Thanks.
I have a table with about 20 Billion rows in Vertica Place. Each row has a point(long, lat) and a property value.
I need to find all points that fall within a small rectangle (long1 lat1, long 2 lat 2).
I converted the point column in the table to a geometry type, and then I am trying to run an ST_Contains query.
Essentially:
select * from table_a where ST_Contains(ST_GeomFromText('Polygon(long1, lat1, long2, lat2.......)), pointColumnInTable_A_of_geometrytype)
This query seemed pretty fast on a few hundred million rows, but then seems to have slowed down considerably as the rows in the table increased.
I want to understand if there's a way to speed-up the query to deal with 20Billion rows? Would some kind of indexing help? I'd appreciate any pointers.
One possibility might be to split the table and then query in appropriate table, but ideally I wouldn't want to do that. I'd imagine Vertica Place would have a way to take care of this problem.
Thanks.
0
Comments
I think, I can split the tables and do range queries on long/lat columns to speed up things in a normal database.
But, I thought Vertica Place simplifies working with Geographical and geometrical data. Am I completely on the wrong track?
Hm... Do you have a spatial index on this table? If not, have you considered creating one?
Vertica does not support traditional indices in part because, for a variety of common Vertica workloads at least, projections are a better tool for the job. But good spatial indices are a fundamentally different data structure; they are optimized for looking up a region in an N-space, rather than for a single value or a linear range of values. So we do support spatial indices.
Usage is as documented in the Place documentation PDF, included with the Place download. Look in particular for the "STV_Create_Index" function:
SELECT STV_Create_Index( [table.]gid, [table.]geom
USING PARAMETERS index='index_name'
[, overwrite={'true' | 'false' } ]
[, max_mem_mb=maxmem_value] )
OVER()
[ AS (polygons, min_x, min_y, max_x, max_y, size_bytes) ]
FROM table
Without an index, the ST_Contains() function has to check every possible row in your data to see if it is a match. You can improve performance by having it check fewer rows, by adding additional predicates on columns that are not geospatial columns.
Indices are expensive to create and maintain, and they may not help much if you do have these sorts of mixed geospatial/regular queries. For some workloads, the extra weight just slows things down. So we strongly recommend testing them for your particular data and overall workload before deploying them widely.
Adam
I did consider creating spatial index for the table. However, the column with Geometry data in my table contains data for POINTs (long, lat) and not a Polygon. As I understood from Vertica documentation, STV_Create_Index works with Polygons and MultiPolygons and not with Points.
For reference and to provide more context my table resembles the following:
CREATE TABLE Table_A (id INTEGER, rawlonglat varchar(500), loc_property INTEGER, coordinates geometry(10000));
where coordinates is the geometry representation of the Point(long, lat).
This table has close to 20B rows.
Now given a small bounding rectangle ((long1, lat1), (long2, lat2)), I want to find out which rows from the table fall into the bounding rectangle and for this I am trying to run the ST_Contains query (which as I mentioned before is terribly slow as it is probably touching all rows).
Am I doing something completely wrong here?
Thanks.
Vertica is indeed designed to handle large volumes of data. You can and should actually use spatial indexes to get good performance querying 20 billion rows.
Here are the details. But first one general recommendation is to size your geometry column according to the shape type you plan to store. In your case, if you're storing points, I would use geometry(100) sized column at most:
CREATE TABLE Table_A (id INTEGER, rawlonglat varchar(500), loc_property INTEGER, coordinates geometry(100));
The next step is to create an index on the (rectangle) polygon. STV_Create_Index takes as arguments a polygon identifier (interger) and the polygon shape (geometry). It is OK to create an index on a single polygon input. You just need to come up with a polygon id for it (in the example below 1, but could be any integer). Note that the input polygon geometry could be any valid, complex polygon in general; a rectangle is just a very simple polygon instance.
You'd invoke the create index API as follows:
SELECT STV_Create_Index(1, ST_GeomFromText('Polygon((long1, lat1, long2, lat2.......))') USING PARAMETERS index='rect_idx') OVER();
Once you have indexed the rectangle polygon, you can use it (instead of the original polygon geometry) to perform fast intersections. The query that finds the points inside the rectangle then becomes:
SELECT STV_Intersect(coordinates USING PARAMETERS index='rect_idx') FROM Table_A WHERE STV_Intersect(coordinates USING PARAMETERS index='rect_idx') IS NOT NULL;
In this example, STV_Intersect will output a '1' (the polygon id we chose at index creation time) every time 'coordinates' interect with 'Polygon((long1, lat1, long2, lat2.......))', and NULL otherwise. Hence we want to add a query predicate to filter out the NULL results.
An optional step you could execute to further improve the response time is to keep your point data sorted in the table. That will help prune points that are far away from the target rectangle. Additionally, if you're running on a cluster environment (and I hope you are), you could segment your points across all the nodes to use all the computing power of your cluster during the intersection operation. The create table statement would look something like this.
CREATE TABLE Table_A (id INTEGER, rawlonglat varchar(500), loc_property INTEGER, coordinates geometry(100)) ORDER BY coordinates SEGMENTED BY HASH(id) ALL NODES;
Ariel
When i try to use this function,I get this error,any idea what should i do here to resolve it..?
Error calling processBlock() in User Function ST_Contains at [src/UnfencedBoolOperator.h:246], error code: 0, message: Could not compute the result. This usually happens when input geometries are too complex\n (5861) (SQLExecDirectW)')
Hi Vankadar,
Could you please share your query that is generating the error? And also share some information about the table you are querying? I would also read through Ariel's previous post about using a spatial index. Spatial indexes allow HP Vertica Place to optimize spatial joins.
-Casey
I know this post is old.
Question the STV_Intersect you have just described in here, does it take all the points INSIDE the Polygon, or just the points that intersects/crosses the Polygon?
thx, Markus.