Database copy results in broken instance

MattK · January 2014

I have a new install on CentOS 6.4, using the RPM installer, and a successful run of the install_vertica script.

A database was then copied from another single node host as in https://my.vertica.com/docs/7.0.x/HTML/index.htm#Authoring/AdministratorsGuide/BackupRestore/Copying...

vbr.py on the source indicated copy was successful, but an attempt to start the database in admintools results in:

*** Starting database: hwdw ***
Starting nodes:
All nodes to start have failed startup tests.  Nothing to do.Database start up failed.  Failed startup tests. Nothing to do.Press RETURN to continue

I note that attempting to run "/etc/init.d/verticad status" reports "Vertica: No Spread installed". I find no existence of Spread on this machine - no spreadd init script.

Before copying the database to this host, I was able to create the empty target database via admintools, but this server had no data in Vertica prior to the copy operation.

Is Spread needed for a single node install? Suggestions on how to resolve this problem?

MattK · January 2014

To confirm the install, I dropped the database on the copy target, recreated a new one, created a simple table, inserted and selected data successfully. Then I ran "vbr.py --task copycluster", which again resulted in an un-startable database.

This is the config file I am using with vbr.py:

[Misc]
snapshotName = copyhwdw
restorePointLimit = 5
verticaConfig = False
tempDir = /tmp/vbr
retryCount = 5
retryDelay = 1
 [Database]
dbName = hwdw
dbUser = dbadmin
dbPassword =
dbPromptForPassword = True
 [Transmission]
encrypt = False
checksum = False
port_rsync = 50000
bwlimit = 50000
 [Mapping]
; backupDir is not used for cluster copy
v_hwdw_node0001=<hostname>:/backup/vertica

MattK · January 2014

Looking at the adminTools log I see this entry for an attempt to start the database:

Jan 22 23:56:36  <BashAdapter: dbadmin@127.0.0.1>: ['/opt/vertica/oss/python/bin/python -m vertica.config.VDatabase /localdata/vertica/hwdw/v_hwdw_node0001_catalog /opt/vertica']Jan 22 23:56:42  <BashAdapter: dbadmin@127.0.0.1>: (rc=0) ['{', ' "name": "hwdw", ', ' "spreadversion": 3014, ', ' "version": 3014, ', ' "flags": {}, ', ' "deps": [', '  [', '   "v_hwdw_node0001"', '  ]', ' ], ', ' "nodes": [', '  {', '   "host": "192.168.17.72", ', '   "catalogpath": "/localdata/vertica/hwdw/v_hwdw_node0001_catalog", ', '   "controlnode": "45035996273704980", ', '   "name": "v_hwdw_node0001", ', '   "storagelocs": [', '    "/localdata/vertica/hwdw/v_hwdw_node0001_data"', '   ], ', '   "oid": "45035996273704980", ', '   "startcmd": "/opt/vertica/bin/vertica -D \\"/localdata/vertica/hwdw/v_hwdw_node0001_catalog\\" -C \\"hwdw\\" -n \\"v_hwdw_node0001\\" -h \\"192.168.17.72\\" -p 5433 -P \\"4803\\"", ', '   "port": "5433"', '  }', ' ], ', ' "willupgrade": false, ', ' "controlmode": "pt2pt"', '}']

If extract the call to the vertica daemon and simplify it, I seem to get a working database instance:

/opt/vertica/bin/vertica -D /localdata/vertica/hwdw/v_hwdw_node0001_catalog -C hwdw -n v_hwdw_node0001 -h 192.168.17.72 -p 5433 -P 4803

Could the vbr.py database copy be writing some invalid metadata or an invalid start script?

MattK · January 2014

Possible workaround:

Finding that the source system was using 127.0.0.1 as the address in the :Site section of the Catalog/config.cat file, but the target had the local eth0 IP, I edited the config.cat file to use 127.0.0.1 as the address and the database started normally via adminTools.

Is this an issue with the copy method in vbr.py, or is my config file for the copy invalid above?

If my config file is invalid, the copy should not leave the database in an unstartable state.

Stig_1 · February 2014

>> attempting to run "/etc/init.d/verticad status" reports "Vertica: No Spread installed

I had the same problem with 7.0.0. The spreadd checks looked to be from 6 when spreadd was run externally to vertica. I found this post/thread:

https://community.vertica.com/vertica/topics/error_starting_verticad_no_spread_installed

which contains:

"The verticad service has a bug where it refers to the spreadd service. You can either delete the lines that reference spreadd in /etc/init.d/verticad or wait for a forthcoming bug fix."

After commenting out the spreadd check lines the init script works.

Stig_1 · February 2014

>> Finding that the source system was using 127.0.0.1 as the address in the :Site section of the Catalog/config.cat file, but the target had the local eth0 IP, I edited the config.cat file to use 127.0.0.1 as the address and the database started normally via adminTools.

I don't know if you hit the same bug I did, but this patch might be of interest:

https://community.vertica.com/vertica/topics/patch_for_vertica_7_vertica_network_ssh_py_in_nicinfofr...

glennimoss · March 2014

I'm having this same problem.

I was testing different methods of copying the database to another cluster. I first copied all the tables from cluster1 to cluster2 with COPY table FROM VERTICA and that worked fine. So then I shut down the database on cluster1 and created a new, empty one there. I then did a vbr.py copycluster from cluster2 back to cluster1's new database.

When trying to start the new database on cluster1 with admintools, I see the same message as MattK about failing the startup checks. I noticed in the adminTools-dbadmin.log file that when trying to start the database, it calls a python module that inspects the catalog.

So looking at the newly copied database catalog:

$ /opt/vertica/oss/python/bin/python -m vertica.config.VDatabase /opt/verticadata/catalog/db2/v_db2_node0001_catalog /opt/vertica
{
 "name": "db2", 
 "spreadversion": 1099, 
 "version": 1099, 
 "flags": {}, 
 "deps": [
  [
   "v_db2_node0001"
  ]
 ], 
 "nodes": [
  {
   "host": "10.1.83.30", 
   "catalogpath": "/opt/verticadata/catalog/db2/v_db2_node0001_catalog", 
   "controlnode": "45035996273704980", 
   "name": "v_db2_node0001", 
   "storagelocs": [
    "/opt/verticadata/data/db2/v_db2_node0001_data", 
    "/opt/verticadata/temp/db2"
   ], 
   "oid": "45035996273704980", 
   "startcmd": "/opt/vertica/bin/vertica -D \"/opt/verticadata/catalog/db2/v_db2_node0001_catalog\" -C \"db2\" -n \"v_db2_node0001\" -h \"10.1.83.30\" -p 5433 -P \"4803\"", 
   "port": "5433"
  }
 ], 
 "willupgrade": false, 
 "controlmode": "broadcast"
}

And now looking at the original (working) database catalog:

$ /opt/vertica/oss/python/bin/python -m vertica.config.VDatabase /opt/verticadata/catalog/db1/v_db1_node0001_catalog /opt/vertica
{
 "name": "db1", 
 "spreadversion": 1, 
 "version": 18236, 
 "flags": {}, 
 "deps": [
  [
   "v_db1_node0001"
  ]
 ], 
 "nodes": [
  {
   "host": "127.0.0.1", 
   "catalogpath": "/opt/verticadata/catalog/kount/v_db1_node0001_catalog", 
   "controlnode": "45035996273704980", 
   "name": "v_db1_node0001", 
   "storagelocs": [
    "/opt/verticadata/data/db1/v_db1_node0001_data", 
    "/opt/verticadata/temp/db1"
   ], 
   "oid": "45035996273704980", 
   "startcmd": "/opt/vertica/bin/vertica -D \"/opt/verticadata/catalog/db1/v_db1_node0001_catalog\" -C \"db1\" -n \"v_db1_node0001\" -h \"127.0.0.1\" -p 5433 -P \"4803\"", 
   "port": "5433"
  }
 ], 
 "willupgrade": false, 
 "controlmode": "broadcast"
}

The diffirences I notice are the host IP addresses, same as MattK reported. The original, working DB, uses 127.0.0.1, whereas the newly copied one uses that host's eth0 IP. Also, I see that spreadversion and version are quite different. Looking at another database I have, its version numbers look more like the working DB here (speadversion is a single digit number and version is a large number.) Having spreadversion and version be equal seems odd, but I know nothing about where those numbers come from.

Then I tried running the startcmd directly:

$ /opt/vertica/bin/vertica -D "/opt/verticadata/catalog/db2/v_db2_node0001_catalog" -C "db2" -n "v_db2_node0001" -h "10.1.83.30" -p 5433 -P "4803"  # ... Snip spread toolkit banner  Conf_load_conf_file: using file: /opt/verticadata/catalog/db2/v_db2_node0001_catalog/spread.conf  Successfully configured Segment 0 [192.168.30.255:4803] with 1 procs:          N192168030010: 192.168.30.10  02/28/14 12:45:59 SP_connect: unable to connect mailbox 9: Connection refused  02/28/14 12:46:00 SP_connect: unable to connect mailbox 9: Connection refused  02/28/14 12:46:00 SP_connect: unable to connect mailbox 9: Connection refused  02/28/14 12:46:01 SP_connect: unable to connect mailbox 9: Connection refused  02/28/14 12:46:01 SP_connect: unable to connect mailbox 9: Connection refused  # ... and so on for a while  VSpread could not connect on local domain socket 4803: -2  Unable to open indirect spread information: /opt/vertica/config/local-spread.conf

After which the vertica process terminated.

The IP used by spread (and is in /opt/verticadata/catalog/db2/v_db2_node0001_catalog/spread.conf) is the private lan IP of the node in cluster2 which the database was copied from. This is most definitely wrong. The original database's spread.conf uses 127.0.0.1.

In this case, cluster1 is a single node, whereas cluster2 has 3 nodes. For this test, I created a the database in cluster2 with only one node so that it would be identical (or so I thought) to the database in cluster1. Might this be an edge-case not expected by the copycluster process?

Next, I copied the original db1 spread.conf to the new db2 spread.conf, and ran the above command to start vertica and it worked properly.

However, admintools still will not start the database. After some digging I discovered the problem is that in /opt/vertica/conf/admintools.conf all the nodes listed have 127.0.0.1 as the IP. But, as we noted earlier, the IP of the configured database is the eth0 IP. So after changing only the line for v_db2_node001 to the eth0 IP, admintools successfully starts the db.

...

Well, I didn't expect to actually find the solution but I guess I have. Hopefully someone at Vertica can take this information and use it to fix this permanently.

In summary, the two problems are:

vbr.py copycluster changes the "primary" IP of the database so that it does not match the IP listed in /opt/vertica/config/admintools.conf preventing admintools from even attempting to start the database.
When the database is started, because the wrong IP address is used in spread.conf, it cannot start.

Workarounds:

Change the entry in /opt/vertica/config/admintools.conf for the node named after the newly copied database to match the IP used in the catalog. This seems simpler than digging into Catalog/config.cat as suggested by MattK, although I think that would work as well.
Change the spread.conf file to use a correct IP. I found it worked with both 127.0.0.1 and the eth0 IP. So, it could be as simple as copying a spread.conf from another database on the same host that actually works.

My suggestions for fixing this bug:

Because you need to have created the destination database before you can use vbr.py copycluster, spread.conf already exists and should be correct. Therefore, do not copy spread.conf from the source database.
When creating the catalog metadata, the IP chosen is a valid IP for that host, so it's obviously not a problem of copying the data verbatim from the source as in spread.conf. So, however the IP is chosen, I suspect it's guessed from the ifconfig output. Again, because the database was already created with correct parameters, the IP should be read from the original catalog and used when generating/modifying the new one.

I hope this helps.

We're Moving!

Create My New Community Account Now

Database copy results in broken instance

Comments

Leave a Comment