Optimizing for Availability
To optimize for high availability, you should tune your Kafka application to
recover as quickly as possible from failure scenarios.
The values for some of the configuration parameters in this page depend on
other factors, such as the average message size and number of partitions.
These can differ greatly from environment to environment.
Some configuration parameters have a range of values, so benchmarking
helps to validate the configuration for your application and environment.
Minimum Insync Replicas
When a producer sets acks=all
or acks=-1
, the configuration parameter
min.insync.replicas
specifies the minimum number of replicas that must
acknowledge a write for the write to be considered successful. If this minimum
cannot be met, then the producer raises an exception. In the case of a shrinking
ISR, the higher this minimum value is, the more likely there is to be a failure
on producer send, which decreases availability for the partition. On the other
hand, by setting this value low (for example, min.insync.replicas=1
), the
system tolerates more replica failures. As long as the minimum number of
replicas is met, the producer requests continue to succeed, which increases
availability for the partition.
Consumer failures
Consumers in a consumer group can share processing load. If a consumer
unexpectedly fails, Kafka can detect the failure and rebalance the partitions
amongst the remaining consumers in the consumer group. The consumer failures can
be hard failures (for example, SIGKILL
) or soft failures (for example,
expired session timeouts). These failures can be detected either when consumers
fail to send heartbeats or when they fail to send poll()
calls. The consumer
liveness is maintained with a heartbeat, now in a background thread since
KIP-62,
and the configuration parameter session.timeout.ms
dictates the timeout used
to detect failed heartbeats. Increase the session timeout to take into account
potential network delays and to avoid soft failures. Soft failures occur most
commonly in two cases: when poll()
returns a batch of messages that take too
long to process or when a JVM GC pause takes too long. If you have a poll()
loop that spends much time processing messages, you can do one of the following:
- Increase the upper bound on the amount of time that a consumer can be idle
before fetching more records with
max.poll.interval.ms
.
- Reduce the maximum size of batches the
max.poll.records
configuration
parameter returns.
Although higher session timeouts increase the amount of time to detect and
recover from a consumer failure, failed client incidents are less likely
than network issues.
Restoring Task Processing State
Finally, when rebalancing workloads by moving tasks between event streaming
application instances, you can reduce the time it takes to restore task
processing state before the application instance resumes processing. In
Kafka Streams, you can accomplish state restoration by replaying the
corresponding changelog topic to reconstruct the state store. The application
can replicate local state stores to minimize changelog-based restoration time by
setting the configuration parameter num.standby.replicas
. Thus, when you
initialize a stream task or reinitialize on the application instance, the state
store is restored to the most recent snapshot accordingly:
- If a local state store doesn’t exist (
num.standby.replicas=0
), then
the changelog is replayed from the earliest offset.
- If a local state store does exist (
num.standby.replicas
is
greater than 0), the changelog is replayed from the previously checkpointed
offset. This method takes less time because it is applies a smaller portion
of the changelog.
Summary of Configurations for Optimizing Availability
Consumer
session.timeout.ms
: increase (default 10000)
Streams
StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG
: 1 or more (default 0)
- Streams applications have embedded producers and consumers, so also
check those configuration recommendations