Failed to create a vertica cluster with 8+ nodes

julianjulian Registered User
edited October 3 in Vertica Forum

I am creating a Vertica cluster on OpenStack (Kilo) through Trove. It is possible to create a 2-node, 4-node or 6-node Vertica cluster but it is failed to create 8-node Vertica cluster (formed by 8 Nova instances).

Background:
I have 4 physical server (1 controller node, 3 compute nodes), each of them has 8 physical CPU and 128 RAM. For creating Vertica cluster, the flavor includes 2 VCPUs, 4 GB RAM, 15 GB root disk, 2 GB swap disk.

As OpenStack allows CPU over-committing 16:1 and RAM over-committing 1.5:1, therefore 512 vcpu and 768 GB memory could be used in the environment. (At most 8 vcpu is assigned to each instance).

After creating the Vertica cluster, I found that the status of Trove instances of the cluster is ERROR. Then, I use /opt/vertica/bin/admintools to view the status of the database cluster state which is UP. I login to the database, and try the database with '\d' or '\dn' but it stuck with no any responds.

I cannot find any hints in vertica.log, install.log and adminTools-dbadmin.log. No content can be found in ErrorReport.txt among all 8 nodes of Vertica cluster. Then, I try to use Vertica analytics tools like vcpuperf, vioperf and vnetperf. Only 1 node of the 8-node Vertica cluster can run vnetperf, other 7 nodes stuck with the command as follows:

The results of Vertica analytic tools (as mentioned only 1 node of the 8-node Vertica cluster can run vnetperf)

vcpuperf:

CPU Time: 8.880000s
Real Time:9.470000s

This machine's high load time: 86 microseconds.
This machine's low load time: 135 microseconds.

vioperf:

vnetperf:

Log:
For Trove:
In /var/log/trove/trove-guestagent.log:
2017-07-23 17:45:51.201 1110 ERROR trove.guestagent.datastore.experimental.vertica.service [-] Failed to get database status.
2017-07-23 17:45:51.201 1110 TRACE trove.guestagent.datastore.experimental.vertica.service Stdout: u'\nERROR: /opt/vertica/config/admintools.conf does not exist.\nThis file is created by the installer upon successful completion.\nYou must successfully run /opt/vertica/sbin/install_vertica before running admintools.\n\n'

Can anyone give suggestions on the above conditions?

Comments

  • Jim_KnicelyJim_Knicely Employee, Registered User, VerticaExpert

    Hi,

    Did you create the dbamin Linux user yourself or did the Vertica installer create it?

    Some suggestions:

    • Make sure that each node can ssh to all of the others nodes as the dbadmin user without a password. And you'll probably see those "fingerprint" errors on each node. Fix all of those. I think this is why vnetperf is not running on those nodes.
    • On the node where you ran the Vertica installer, run the command "admintools -t list_allnodes" as the dbadmin user to list the status of the nodes.
    • Did the Vertica RPM install on each node? Use the command "rpm -qa | grep -i vertica" to find out
    • You might be able to distribute the config files via admintools from the node where you ran the installer. Run "admintools -t distribute_config_files"
  • julianjulian Registered User
    edited October 3

    I create a Trove Guest Image which is used to create Vertica cluster, the dbadmin user is created before running the install_vertica script.

    For your first suggestion:
    You're right. After I ssh to other nodes, then "fingerprint" errors is solved. Then, vnetperf can be run in all nodes.

    For your second suggestion:
    I run "admintools -t list_allnodes" in one of the nodes in the 8-node Vertica cluster, it shows me all nodes are in 'UP' status.

    For your third suggestion:
    All nodes should have installed vertica.deb as follows:

    For the last suggestion:
    It is able to distribute the config files as follows:
    Intiating admintools.conf distribution...
    Local admintools.conf sent to all nodes in the cluster.

    After I solve the fingerprint errors, I restart the vertica cluster and it shows me the vertica database started successfully, but the 8-node Vertica cluster give me the error in vertica.log as follows:

    2017-10-03 18:00:19.849 PartitionTables:0x7f03d4012cc0 @v_db_srvr_node0002: {threadShim} 08006/4539: Received no response from v_db_srvr_node0001, v_db_srvr_node0003, v_db_srvr_node0004, v_db_srvr_node0005, v_db_srvr_node0006, v_db_srvr_node0007, v_db_srvr_node0008 in transaction bind

  • Jim_KnicelyJim_Knicely Employee, Registered User, VerticaExpert

    Hi,

    Fyi ... Vertica 7.1.2 is no longer supported. Could be an old bug you are seeing?

    Are you sure you disabled the Linux firewalls or open up the ports required by Vertica?

    See:
    https://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/InstallationGuide/BeforeYouInstall/iptablesEnabled.htm?Highlight=firewall

    Also, when you installed Vertica, did you use the -T or --point-to-point option? This option is almost always better for a virtual environment.

    See: https://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/InstallationGuide/InstallingVertica/RunTheInstallScript.htm?Highlight=point-to-point

  • julianjulian Registered User

    Thanks for your suggestions.

    I still wonder why 2-node, 4-node and 6-node Vertica cluster can be created successfully, only 8-node (or more) Vertica cluster is failed to create.

    I have disabled the firewalls of VMs and physical machines. The ports are opened by Trove Taskmanager

    I have changed the install_vertica command to "/opt/vertica/sbin/install_vertica -s %s -d %s -X -N -T -r /vertica.deb -L /vertica.dat -Y --no-system-checks --failure-threshold NONE", but it is still failed to create 8-nodes Vertica cluster.

    I scan through all the logs in 8-node Vertica cluster and I list the logs here.

    1. vertica.log
      The network socket experienced an error. This Spread mailbox will no longer work until the connection is disconnected and then reconnected

    2017-10-08 06:22:23.027 CatchUp:0x7f340c00eb70-d000000000000e @v_db_srvr_node0004: {runRecover} 08006/4539: Received no response from v_db_srvr_node0006 in transaction bind
    2017-10-08 06:23:19.437 CatchUp:0x7f340c00eb70 @v_db_srvr_node0004: 00000/3298: Event Posted: Event Code:6 Event Id:3 Event Severity: Informational [6] PostedTimestamp: 2017-10-08 06:23:19.437249 ExpirationTimestamp: 2085-10-26 09:37:26.437249 EventCodeDescription: Node State Change ProblemDescription: Changing node v_db_srvr_node0004 startup state to RECOVER_ERROR DatabaseName: db_srvr Hostname: vertica-cluster-member-8
    2017-10-08 06:23:19.437 CatchUp:0x7f340c00eb70 [Recover] Changing node v_db_srvr_node0004 startup state from SHUTDOWN to RECOVER_ERROR

    1. dbLog
      10/08/17 06:21:56 SP_connect: unable to connect via UNIX socket to /tmp/4803 (pid=4472): Error: No such file or directory

    2. adminTools-dbadmin.log
      Oct 8 06:34:17 [5791] [vsql.connect] EOF ERROR: vsql: could not connect to server: Connection refused
      Oct 8 06:34:17 [5791] [vsql.connect] EOF ERROR:

    3. trove-guestagent.log
      ERROR trove.guestagent.datastore.experimental.vertica.service [-] Vertica database create failed.

    DB server is not installed or is in restart mode, so for now we'll skip determining the status of DB on this instance.

  • julianjulian Registered User

    To simplify the situation, I now just create a Vertica cluster in the single physical machine. Now, it can create a 8-node Vertica cluster, but it is failed to create a 16-node Vertica cluster.

  • Jim_KnicelyJim_Knicely Employee, Registered User, VerticaExpert
    edited October 11

    Hmm. Vertica should run on any commodity hardware. I can set up 1, 2, 4, 8 , 10 , x number of vm nodes on my laptop (i.e using Oracle Virtual box) .

    What OS are you using?

  • julianjulian Registered User
    edited October 12

    Hi,I install Openstack on a Ubuntu14.04 server and I use Trove to create database cluster. Since the database is failed to installed on 16-node cluster and I re-run the install_verticall script manually and it shows me this, is it necessary to fix this problem?

    I use --failure-threshold previoiusly to hide those messages.

    Additionally, it shows me this error in dbLog when 16-node Vertica cluster is built (no this error when creating 2,4,6-node Vertica cluster)

    10/11/17 09:55:45 SP_connect: unable to connect via UNIX socket to /tmp/4803 (pid=10826): Error: No such file or directory

    All 16-nodes of Vertica cluster are built on the same physical machine, so it should not be related to the network problem? (bandwidth of UDP and TCP)