WHERE IN based on set of primary keys across multiple columns

gpascale · August 2017

Hi,

Let's say I have a table in which the primary key is composed of multiple columns - state and city. I have another service which produces a list of state, city pairs. I want to take that list, and retrieve the full row for each state, city pair in my table. So I do something like this.

SELECT * FROM MyTable WHERE (State, City) in (('New York', 'New York'), ('Washington', 'Seattle'), ('California', 'Los Angeles'), ...);

What I'm finding i that this query is vastly slower than just using one of the columns, as in

SELECT * FROM MyTable WHERE (City) in ('New York', 'Seattle', 'Los Angeles', ...);

The first query is taking 2 minutes while the second one takes under a second.

For reference, my table contains about 18 million rows and I'm passing roughly 100 state, city pairs in my WHERE IN clause. Because of the data, both queries in this particular example end up returning the same data. In theory, though, there could be multiple cities with the same name in different states, which is why I wrote the first query to match on tuples.

Any thoughts on what could be making the first query so expensive? Is there a better way to write that? My table has the primary key (State, City) but I haven't experimented with segmenting or anything else.

Thanks,
Greg

TomM · August 2017

The first thing that comes to my mind is why you would use city and state as a PK?

The second thing is have you tried running EXPLAIN on both queries to see the Optimizer's execution plan? Compare the two and you will start on the path to determining what's going on with the two queries.

gpascale · August 2017

Thanks, Tom

This is admittedly a contrived example, but the city to state is a close approximation of how my data is keyed. I used these two as the key because those two things together define a unique piece of data in my world. There may be multiple cities with the same name, but Portland, ME and Portland, OR are two separate entities.

My understanding is that this is precisely why primary key can span multiple columns. Would a better approach be to create a separate column that combines city and state, say by concatenation or hashing, and using that as the key?

The optimizer is something I definitely need to dig into, though I'm also curious about general best practices regarding laying out and accessing this kind of data.

Greg

sKwa · August 2017

Hi!

Would a better approach be to create a separate column that combines city and state, say by concatenation or hashing, and using that as the key?

Yes, hash(State, City) should improve performance significantly, but it will change querying experience. Concatenation also should improve(but significantly? Not sure).

Also you can try to create an additional projection with GROUPED clause for (State, City). Trial and error

(without EXPLAIN plan and projection/s it's hard for us to help )
GROUPED clause - https://my.vertica.com/docs/8.0.x/HTML/index.htm#Authoring/SQLReferenceManual/Statements/GROUPEDClause.htm

marcothesane · August 2017

I would try;
WITH
search(state,city) AS (
SELECT 'New York','New York'
UNION ALL SELECT 'Washington','Seattle'
UNION ALL SELECT 'California','Sacramento'
)
SELECT
*
FROM mytable JOIN search USING(state,city)

I'm almost completely sure that, this way, you change the explain plan from a complex Boolean OR(... AND ...) OR (... AND ..) (etc.) expression to an inner join with an in-line table.

Jim_Knicely · August 2017

Hi,

Have you tried Database Designer yet?

I have a 4 node cluster running on my laptop and created 2 tables, one named "mytable_just_super" using the lazy default super projection created by Vertica, and the other named "mytable_after_dbd" using the query specific design generated by Vertica's Database Designer ... Each table has 20,639,441 rows.

I ran the following query against each table getting the run times:

SELECT State, City FROM XXXX WHERE (State, City) in (
 ('Az', 'Wikieup'),
 ('Pa', 'Cresson'),
 ('Nc', 'Rocky Mount'),
 ('Ky', 'Bloomfield'),
 ('In', 'Terre Haute'),
 ('Ny', 'Amsterdam'),
 ('Al', 'Moundville'),
 ('Md', 'Massey'),
 ('Mi', 'Decker'),
 ('Tn', 'Pickwick Dam'),
 ('Nc', 'East Bend'),
 ('Wy', 'Tie Siding'),
 ('Ok', 'Rufe'),
 ('Mt', 'Huntley'),
 ('Mo', 'La Grange'),
 ('Ne', 'Mc Cool Junction'),
 ('Ms', 'Belden'),
 ('Il', 'Hardin'),
 ('Md', 'Bowie'),
 ('Va', 'Appomattox'),
 ('Il', 'Towanda'),
 ('Mn', 'Young America'),
 ('Ny', 'New Rochelle'),
 ('Pa', 'Hawk Run'),
 ('In', 'Laconia'),
 ('Tx', 'Milano'),
 ('Ga', 'Newnan'),
 ('Ny', 'Corfu'),
 ('Tx', 'Kildare'),
 ('Pa', 'Reading'),
 ('Mo', 'Mill Spring'),
 ('Ia', 'Marathon'),
 ('Ca', 'Sultana'),
 ('Mi', 'Paradise'),
 ('In', 'Tell City'),
 ('Tx', 'Quitaque'),
 ('Ny', 'Moriches'),
 ('Ca', 'City Of Industry'),
 ('Id', 'Idaho City'),
 ('Ky', 'Bradfordsville'),
 ('Co', 'Durango'),
 ('Ks', 'Arma'),
 ('Tx', 'Euless'),
 ('Ca', 'Mc Kittrick'),
 ('Md', 'Charlotte Hall'),
 ('Tx', 'Lindale'),
 ('Ia', 'Homestead'),
 ('Oh', 'Vincent'),
 ('Ca', 'Northridge'),
 ('In', 'Valparaiso'),
 ('Tx', 'Chester'),
 ('Pr', 'Yabucoa'),
 ('Ok', 'Tulsa'),
 ('In', 'Plymouth'),
 ('Ny', 'Hancock'),
 ('Va', 'Radford'),
 ('Tn', 'Silerton'),
 ('Ma', 'Foxboro'),
 ('Mn', 'Isle'),
 ('Sc', 'Smoaks'),
 ('Wa', 'Silverdale'),
 ('Mn', 'Gaylord'),
 ('Mh', 'Majuro'),
 ('Vt', 'Fairfield'),
 ('Tx', 'San Angelo'),
 ('Ga', 'Clinchfield'),
 ('Ny', 'West Hurley'),
 ('Nj', 'Stratford'),
 ('Ny', 'North Bay'),
 ('Ia', 'Lanesboro'),
 ('Mo', 'Columbia'),
 ('Mi', 'Little Lake'),
 ('Ga', 'Baconton'),
 ('Wi', 'Elroy'),
 ('Sd', 'Huron'),
 ('Pa', 'Calvin'),
 ('Tn', 'Eidson'),
 ('Nj', 'Rochelle Park'),
 ('In', 'Winslow'),
 ('Id', 'Emmett'),
 ('Ca', 'Boulevard'),
 ('Ky', 'Lexington'),
 ('Pa', 'Danboro'),
 ('Nc', 'Kinston'),
 ('La', 'Kelly'),
 ('Sd', 'Harrisburg'),
 ('La', 'Mora'),
 ('Pa', 'Keisterville'),
 ('Or', 'Salem'),
 ('Mo', 'Stella'),
 ('Al', 'Myrtlewood'),
 ('Wi', 'Windsor'),
 ('Pa', 'Stewartstown'),
 ('Va', 'Craddockville'),
 ('Mi', 'Atlantic Mine'),
 ('Va', 'Montpelier Station'),
 ('Wa', 'Stanwood'),
 ('Nv', 'Searchlight'),
 ('Ky', 'Robinson Creek'),
 ('Al', 'Anniston')
 );

Timings:
mytable_just_super --> 25973.419 ms
mytable_after_dbd --> 356.023 ms

So I'd recommend trying DBD!

gpascale · August 2017

Thanks all for the suggestions.

@sKwa Yes, I figured it would, but was hoping to avoid having to manage an extra column. I will give the GROUP idea a try.

@marcothesane I did see by running explain that my query generates a complex OR (x AND Y) OR (xx AND YY)... statement as you mentioned. The city-only query generates an ANY(ARRAY(... instead. So that could be the ticket.

Not sure if I have access to database designer - I'm in a rather restricted corporate environment - but I will see!

sKwa · August 2017

Hi!

@gpascale

Not sure if I have access to database designer...

Actually you can emulate a DBD from client(vsql for example).

I will give the GROUP idea a try.

Create projection manually when all encoding are AUTO, after it run:

SELECT dbd_design_projection_encodings('<projection_name>', '/tmp/encodings_output.sql', FALSE);

In the file /tmp/encodings_output.sql you will find a DDL for better encodings.

We're Moving!

Create My New Community Account Now

WHERE IN based on set of primary keys across multiple columns

Comments

Leave a Comment