How can I make my consumers fault-tolerant?

I’m trying to figure out how to make my consumers fault tolerant. So, take the following typical scenario:

image

I have a topic with multiple streams, and the streams are being automatically distributed across the multiple replicas in a consumer group. This is horizontal scaling. So far, so good.

But let’s say I push an update to my code and redeploy, but I’ve inadvertently got an error in my code, say a dodgy Dockerfile. Because the replicas are instances of the same code, they will all fail to start.

Yes, I could fix the problem, restart, and the group could “catch up” with processing as Kafka retains the messages for me, but let’s say the issue is more complicated, and might take hours, or even days, to find and fix, and I have a real-time app that can not be offline for that long - processing of the topic must continue…

So, how could I arrange things so I get horizontal scaling and fault tolerance, so that if a consumer group fails, processing continues?

Thanks!

1 Like

The obvious solution in this case is to rollback the code update and redeploy, which can be done easily in Quix. But I was curious if a backup consumer group made sense. Here’s what ChatGPT has to say:

Yes, I could rollback the code update to fix this problem. But is there another solution? Could I have a backup consumer group?

ChatGPT

Having a backup consumer group can be a viable solution to mitigate the risk of failures caused by code updates in a primary consumer group. It provides an additional layer of redundancy and enables a fallback option in case of issues with the updated code.

Here’s how you can set up a backup consumer group:

1. Clone the Consumer Group: Create a separate consumer group that mirrors the primary consumer group. This backup consumer group should have the same configuration, topic subscriptions, and processing logic as the primary group.
2. Delayed Code Updates: Instead of updating the code simultaneously in both groups, introduce a delay in applying the code updates to the backup consumer group. This delay gives you a window to observe the behavior of the updated code in the primary group before rolling it out to the backup group.
3. Testing and Validation: During the delay period, thoroughly test and validate the updated code in the primary consumer group. Monitor its performance, check for any issues or errors, and verify that the code update is functioning as expected.
4. Rollout to the Backup Group: Once you are confident in the stability and correctness of the code update in the primary consumer group, proceed with rolling out the update to the backup consumer group. This ensures that both groups eventually run the updated code.
5. Monitoring and Failover: Continuously monitor both the primary and backup consumer groups for any issues. If the primary group encounters failures or inconsistencies after the code update, you can initiate a failover process by directing the Kafka traffic to the backup consumer group. This can be done by updating the consumer group configuration or by reassigning the topic partitions to the backup group.

By having a backup consumer group, you create a safety net that allows you to quickly switch to a stable and functioning codebase if issues arise. It helps minimize downtime and ensure continuous message processing in the event of failures caused by code updates.

Remember to regularly synchronize the codebase between the primary and backup consumer groups to maintain consistency. Additionally, consider automating the failover process to minimize manual intervention and streamline the recovery process.

This sounds sensible to me. It’s worth pointing out that the Portal API could be used for the monitoring and failover - you can get the status of deployments, and also start specified deployments using the Portal API.

If anyone would like to comment further on this, please do! :slight_smile:

Hi Tony,

In this scenario there are a few things you can do.

  1. Test using dev environment rather than prod.
    This is the most straight-forward thing you can change, you should not be developing code in production environment. If it works on dev, you port it over the prod and should work as expected. The biggest obstacle you might run into here is the docker image builds differently. This could be because you haven’t properly nailed down the version of it for example, resulting in unexpected issues or you’re depending on external resources that changed/gone missing.
  2. Create a new deployment with a single replica that uses same consumer group
    You could deploy try deploying the new version of the code and have some of the workload processed by it. If it is successful, shut down old version and increase new version replicas.
  3. "Backup consumer group"s as you say.
    Create a new deployment against the original source topic using another consumer group and a different destination topic. This would allow you to validate the code is working as expected. If you have both topics persisted, the output streams can be compared also. This comparison is currently not supported in live view.

Hope that helps!

1 Like