How to fix Kafka load NETWORK_ISSUE end reason

lop2lop · June 2021

We have a Kafka Cluster with 3 brokers (No SSL with SASL PlainText) and a Vertica Cluster with 3 Nodes.
I've made a bash script file which creates and configures a scheduler with Frame Duration = 00:01:00 which it works almost fine and loads the new data every minute.

But, when the connection breaks (even for a couple of seconds), I see network issue (The cluster was unavailable) in Load tab of MC and it tries to load new data every minute but it stuck on NETWORK_ISSUE and never recovers the connection.

The only way which I've found to recover the connection is creating a bash script file and put it in CronTab and check these things every 5 minutes:

Query Scheduler State (Running or stopped) and launch scheduler if it's stopped
Query stream_microbatch_history and check the end_reason and if it's NETWORK_ISSUE then:
2.1. Shutdown the Instance
2.2. Delete and recreate the Kafka Source
2.3. Delete and recreate the Microbatch
2.4. Launch the instance again

So,
1. Is there any other way to recover the connection ?!
2. How should I keep the scheduler up when failures happens? (network, db failures and etc...)

lop2lop · June 2021

@SergeB said:
You should really get to the bottom of your network issue because but can you provide details on how scheduler is stuck? There are regular checks on Kafka's health and if the health check fails too often (look at these options for scheduler CLI : pushback-policy , pushback-max-count and auto-sync) . If health check fails, the affected microbatch gets disabled (look at stream_microbatches table).

After tons of tests and researches, I found the problem and that was absolutely my fault. But there's another little problem which I think it should be Vertica's fault

So, my fault:

I used 'Data Collector Table Queries' to see the exact COPY statement which scheduler executes:

select time,request, * from dc_requests_issued where time >= timestampadd(hour,-1,current_timestamp) and request ilike 'copy%' order by time desc;

And I realized when I restart the node or the connection breaks, the kafka_conf parameter inside the copy statement changes after a while because my checker bash script file (Which was already created by the Create and Configure Scheduler bash script file) tries to export VERTICA_RDKAFKA_CONF to the Environment Variables again, using this command:

export VERTICA_RDKAFKA_CONF =sasl.username=consumer;sasl.password=consumer_password;sasl.mechanism=PLAIN;security.protocol=SASL_PLAINTEXT;

And it fails when it reaches to the semicolon character because it assumes that after the semicolons, there is another command !

export VERTICA_RDKAFKA_CONF=sasl.username=consumer;sasl.password=consumer_password;sasl.mechanism=PLAIN;security.protocol=SASL_PLAINTEXT;
sasl.password=consumer_password: command not found
sasl.mechanism=PLAIN: command not found
security.protocol=SASL_PLAINTEXT: command not found

and It must be changed to this:

export VERTICA_RDKAFKA_CONF ='sasl.username=consumer;sasl.password=consumer_password;sasl.mechanism=PLAIN;security.protocol=SASL_PLAINTEXT;'

I realized that the single quotes at the beginning and end of kafka_conf have been removed during the creation of the file !

The new problem is that, When I completely delete the scheduler and recreate the scheduler again, It consumes whole messages inside the Kafka topic from the beginning and it doesn't use consumer group's saved offset, neither the offsets in KafkaOffsets() over() .

According to this:
https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/KafkaIntegrationGuide/UtilityOptions/MicroBatchUtilityOptions.htm?tocpath=Integrating with Apache Kafka|vkconfig Script Options|_____7#Special

I tried to set offset for the microbatch to -3 using --offset -3|-3|-3|-3|-3|-3|-3-3|-3|-3|-3|-3|-3|-3|-3 Parameter and it throws an exception:

Exception in thread "main" com.vertica.solutions.kafka.exception.ConfigurationException: Invalid Offset number detected: -3. Offset out of range. Offset should be positive integer, or -2 which indicates head of the stream.
at com.vertica.solutions.kafka.model.StreamMicrobatch.validateConfiguration(StreamMicrobatch.java:455)
at com.vertica.solutions.kafka.model.StreamMicrobatch.setFromMapAndValidate(StreamMicrobatch.java:362)
at com.vertica.solutions.kafka.cli.CLI.run(CLI.java:72)
at com.vertica.solutions.kafka.cli.MicrobatchCLI.run(MicrobatchCLI.java:77)
at com.vertica.solutions.kafka.cli.CLI._main(CLI.java:141)
at com.vertica.solutions.kafka.cli.MicrobatchCLI.main(MicrobatchCLI.java:72)

It seems that it only accept -2 and integer values >= 0 !

How can I force it to use consumer group's saved offset ?

SergeB · June 2021

As you found out, currently -3 won't work. This will be fixed in the next major release

In the meantime, you could read last consumed offsets by running /kafka-consumer-groups.sh (or equivalent) and using these offsets when you reset the scheduler.

After the first microbatch, the scheduler will no longer rely on starting offsets nor on consumer group but will maintain last consumed offsets in one of its tables (microbatch_history).

SergeB · June 2021

You should really get to the bottom of your network issue because but can you provide details on how scheduler is stuck? There are regular checks on Kafka's health and if the health check fails too often (look at these options for scheduler CLI : pushback-policy , pushback-max-count and auto-sync) . If health check fails, the affected microbatch gets disabled (look at stream_microbatches table).

lop2lop · June 2021

Ok, thank you so much 🌹

We're Moving!

Create My New Community Account Now

How to fix Kafka load NETWORK_ISSUE end reason

Best Answers

Answers

Leave a Comment