Copying 116 billion row table - Disk usage while in progress

Copied a 116 billion row table between two 8 node clusters running version 5.0 using the copy from direct command. There's one projection which is segmented and it's associated segmented buddy projection with a K-safe value of 1. The total space used on each node for the table is 346GB. While the table was copying over I was monitoring the file system usage on the nodes of the destination cluster and noticed they had increased by around 700GB, it then cleared down and began steadily increasing again. It did this a few times until the table had finished copying whereby data usage dropped to ~340GB. Could you help me understand the fluctuation in disk space used during the copy?

Comments

  • Hi Mark, Well, the start of the story is, Vertica stores its data sorted. What you're seeing is Vertica sorting your data. 340gb is large enough that we will use an external (disk-backed) sorting algorithm, rather than an in-memory sort. Vertica's external-sort algorithm is a merge-based algorithm; we write many smaller temp files, then merge all of those into a smaller number of big temp files, etc., until we have just one large sorted file. As we write the merged file, disk usage will of course increase. At the moment that we finish writing the merged file, you have two copies of that piece of data; all of the little files plus the one big file. At that point, we start deleting the many smaller files and disk usage decreases. Adam
  • Many thanks Adam, very helpful

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file