User-User similarity?

Hello, I received the following question from a Vertica community user. I am testing Vertica with R right now. I went through the example SDK for R and looking to see what I can do with it. I also read about presto but I don't see it being available just yet. I am also trying to build user-user similarity with Vertica. My input table is 3.9m rows of 372k users x itms. For example - user itm wt 1,000,156 19 1 1,000,156 11 25 1,000,156 18 1 1,000,156 1 1 As first step I am calculating the sum of weights for each user, or co-occurrence if you ignore weights. Following your example "http://www.vertica.com/2011/09/21/counting-triangles/", I ran - create table usr_cooccurence as select mat1.uid mat1_uid, mat2.uid mat2_uid, sum(mat1.wt + mat2.wt) total_wt from v_user_item_matrix mat1 inner join v_user_item_matrix mat2 on (mat1.style = mat2.style and mat1.uid < mat2.uid) group by 2,1 segmented by hash(mat1_uid, mat2_uid) all nodes; But I run out of temp space (increased it to 32G so that's all I can do). I can do this on Hadoop and am wondering if that is the right way to go. But so far, I like the convenience of SQL in Vertica. I think you have done a very good job and want to see if this is the right tool for the job. Can somebody please take a look at this and provide a recommendation? Thanks, Matt

Comments

  • Thanks for asking Matt, we will look into this. In the mean time, there is a blog post from February on Presto that states "Today, you can use Presto on top of Vertica to accelerate your data mining analysis. Soon we will support in-database operations as well. Stay tuned." (http://www.vertica.com/2013/02/21/presto-distributed-r-for-big-data/)
  • can anybody take a stab at this?
  • Hi Matt: I am checking on getting a response for you. Kanti
  • Hey Matt, Hm... If you stick an EXPLAIN in front of that SELECT statement, what does the result look like? Also, have you run this query through the DBD?; are your projections reasonable for it? In particular, it looks like you're hashing by id but joining by 'style'. Looks like that example is careful to hash and join on the same field so adjacent fields tend to end up on the same DB node. Adam
  • Also, how many tuples do you expect to get out of this thing? (What does "select count(distinct style) from v_user_item_matrix;" return?)
  • Adam, Thank you for your help. The user has responded with the following: "I tried the alternate route of partitioning by style but it still failed with the same error. Meanwhile, I am trying to play with the R UDX capabilities, as I am looking for a replacement for Hadoop. I was able to run through the basic examples but the default BLAS is slow so I went ahead and updated it. It works from simple R but does not work from Vertica. This is crucial as even in parallel, it would be too slow otherwise to run any simulations on." Any guidance here?
  • I think this needs more than just re-partitioning... I asked about a few different things; I'd need some responses to the rest of the questions in order to help more there. This user is exploring and is trying some new ideas, on a system that's at least not overprovisioned for the work at hand; it'll likely take some fiddling (and/or running the DBD) to get it right. Regarding R, I'm not the best person for guidance there; I work on our C++ and (occasionally) Java UDx functionality, but not so much R. An obvious question, though, would be "what do you mean by 'does not work'?" Also, "how, specifically, did you go about updating the package?" This user could also try contacting the BLAS user forum -- I'm certainly glad to try to help if they can give more details, but we're using the standard version of R; if something doesn't work, it's most likely either an issue configuring the package or a bug in the package.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file