UNPKG

@confluentinc/kafka-javascript

Version:
1,023 lines (814 loc) 103 kB
# librdkafka v2.11.0 librdkafka v2.11.0 is a feature release: * [KIP-1102](https://cwiki.apache.org/confluence/display/KAFKA/KIP-1102%3A+Enable+clients+to+rebootstrap+based+on+timeout+or+error+code) Enable clients to rebootstrap based on timeout or error code (#4981). * [KIP-1139](https://cwiki.apache.org/confluence/display/KAFKA/KIP-1139%3A+Add+support+for+OAuth+jwt-bearer+grant+type) Add support for OAuth jwt-bearer grant type (#4978). * Fix for poll ratio calculation in case the queues are forwarded (#5017). * Fix data race when buffer queues are being reset instead of being initialized (#4718). * Features BROKER_BALANCED_CONSUMER and SASL_GSSAPI don't depend on JoinGroup v0 anymore, missing in AK 4.0 and CP 8.0 (#5131). * Improve HTTPS CA certificates configuration by probing several paths when OpenSSL is statically linked and providing a way to customize their location or value (#). ## Fixes ### General fixes * Issues: #4522. A data race happened when emptying buffers of a failing broker, in its thread, with the statistics callback in main thread gathering the buffer counts. Solved by resetting the atomic counters instead of initializing them. Happening since 1.x (#4718). * Issues: #4948 Features BROKER_BALANCED_CONSUMER and SASL_GSSAPI don't depend on JoinGroup v0 anymore, missing in AK 4.0 and CP 8.0. This PR partially fixes the linked issue, a complete fix for all features will follow. Rest of fixes are necessary only for a subsequent Apache Kafka major version (e.g. AK 5.x). Happening since 1.x (#5131). ### Telemetry fixes * Issues: #5109 Fix for poll ratio calculation in case the queues are forwarded. Poll ratio is now calculated per-queue instead of per-instance and it allows to avoid calculation problems linked to using the same field. Happens since 2.6.0 (#5017). # librdkafka v2.10.1 librdkafka v2.10.1 is a maintenance release: * Fix to add locks when updating the metadata cache for the consumer after no broker connection is available (@marcin-krystianc, #5066). * Fix to the re-bootstrap case when `bootstrap.servers` is `NULL` and brokers were added manually through `rd_kafka_brokers_add` (#5067). * Fix an issue where the first message to any topic produced via `producev` or `produceva` was delivered late (by up to 1 second) (#5032). * Fix for a loop of re-bootstrap sequences in case the client reaches the `all brokers down` state (#5086). * Fix for frequent disconnections on push telemetry requests with particular metric configurations (#4912). * Avoid copy outside boundaries when reading metric names in telemetry subscription (#5105) * Metrics aren't duplicated when multiple prefixes match them (#5104) ## Fixes ### General fixes * Issues: #5088. Fix for a loop of re-bootstrap sequences in case the client reaches the `all brokers down` state. The client continues to select the bootstrap brokers given they have no connection attempt and doesn't re-connect to the learned ones. In case it happens a broker restart can break the loop for the clients using the affected version. Fixed by giving a higher chance to connect to the learned brokers even if there are new ones that never tried to connect. Happens since 2.10.0 (#5086). * Issues: #5057. Fix to the re-bootstrap case when `bootstrap.servers` is `NULL` and brokers were added manually through `rd_kafka_brokers_add`. Avoids a segmentation fault in this case. Happens since 2.10.0 (#5067). ### Producer fixes * In case of `producev` or `produceva`, the producer did not enqueue a leader query metadata request immediately, and rather, waited for the 1 second timer to kick in. This could cause delays in the sending of the first message by up to 1 second. Happens since 1.x (#5032). ### Consumer fixes * Issues: #5051. Fix to add locks when updating the metadata cache for the consumer. It can cause memory corruption or use-after-free in case there's no broker connection and the consumer group metadata needs to be updated. Happens since 2.10.0 (#5066). ### Telemetry fixes * Issues: #5106. Fix for frequent disconnections on push telemetry requests with particular metric configurations. A `NULL` payload is sent in a push telemetry request when an empty one is needed. This causes disconnections every time the push is sent, only when metrics are requested and some metrics are matching the producer but none the consumer or the other way around. Happens since 2.5.0 (#4912). * Issues: #5102. Avoid copy outside boundaries when reading metric names in telemetry subscription. It can cause that some metrics aren't matched. Happens since 2.5.0 (#5105). * Issues: #5103. Telemetry metrics aren't duplicated when multiple prefixes match them. Fixed by keeping track of the metrics that already matched. Happens since 2.5.0 (#5104). # librdkafka v2.10.0 librdkafka v2.10.0 is a feature release: > [!WARNING] it's suggested to upgrade to 2.10.1 or later > because of the possibly critical bug #5088 ## [KIP-848](https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol) – Now in **Preview** - [KIP-848](https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol) has transitioned from *Early Access* to *Preview*. - Added support for **regex-based subscriptions**. - Implemented client-side member ID generation as per [KIP-1082](https://cwiki.apache.org/confluence/display/KAFKA/KIP-1082%3A+Require+Client-Generated+IDs+over+the+ConsumerGroupHeartbeat+RPC). - `rd_kafka_DescribeConsumerGroups()` now supports KIP-848-style `consumer` groups. Two new fields have been added: - **Group type** – Indicates whether the group is `classic` or `consumer`. - **Target assignment** – Applicable only to `consumer` protocol groups (defaults to `NULL`). - Group configuration is now supported in `AlterConfigs`, `IncrementalAlterConfigs`, and `DescribeConfigs`. ([#4939](https://github.com/confluentinc/librdkafka/pull/4939)) - Added **Topic Authorization Error** support in the `ConsumerGroupHeartbeat` response. - Removed usage of the `partition.assignment.strategy` property for the `consumer` group protocol. An error will be raised if this is set with `group.protocol=consumer`. - Deprecated and disallowed the following properties for the `consumer` group protocol: - `session.timeout.ms` - `heartbeat.interval.ms` - `group.protocol.type` Attempting to set any of these will result in an error. - Enhanced handling for `subscribe()` and `unsubscribe()` edge cases. > [!Note] > The [KIP-848](https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol) consumer is currently in **Preview** and should not be used in production environments. Implementation is feature complete but contract could have minor changes before General Availability. ## Upgrade considerations Starting from this version, brokers not reported in Metadata RPC call are removed along with their threads. Brokers and their threads are added back when they appear in a Metadata RPC response again. When no brokers are left or they're not reachable, the client will start a re-bootstrap sequence by default. `metadata.recovery.strategy` controls this, which defaults to `rebootstrap`. Setting `metadata.recovery.strategy` to `none` avoids any re-bootstrapping and leaves only the broker received in last successful metadata response. ## Enhancements and Fixes * [KIP-899](https://cwiki.apache.org/confluence/display/KAFKA/KIP-899%3A+Allow+producer+and+consumer+clients+to+rebootstrap) Allow producer and consumer clients to rebootstrap * Identify brokers only by broker id (#4557, @mfleming) * Remove unavailable brokers and their thread (#4557, @mfleming) * Commits during a cooperative incremental rebalance aren't causing an assignment lost if the generation id was bumped in between (#4908). * Fix for librdkafka yielding before timeouts had been reached (#4970) * Removed a 500ms latency when a consumer partition switches to a different leader (#4970) * The mock cluster implementation removes brokers from Metadata response when they're not available, this simulates better the actual behavior of a cluster that is using KRaft (#4970). * Doesn't remove topics from cache on temporary Metadata errors but only on metadata cache expiry (#4970). * Doesn't mark the topic as unknown if it had been marked as existent earlier and `topic.metadata.propagation.max.ms` hasn't passed still (@marcin-krystianc, #4970). * Doesn't update partition leaders if the topic in metadata response has errors (#4970). * Only topic authorization errors in a metadata response are considered permanent and are returned to the user (#4970). * The function `rd_kafka_offsets_for_times` refreshes leader information if the error requires it, allowing it to succeed on subsequent manual retries (#4970). * Deprecated `api.version.request`, `api.version.fallback.ms` and `broker.version.fallback` configuration properties (#4970). * When consumer is closed before destroying the client, the operations queue isn't purged anymore as it contains operations unrelated to the consumer group (#4970). * When making multiple changes to the consumer subscription in a short time, no unknown topic error is returned for topics that are in the new subscription but weren't in previous one (#4970). * Prevent metadata cache corruption when topic id changes (@kwdubuc, @marcin-krystianc, @GerKr, #4970). * Fix for the case where a metadata refresh enqueued on an unreachable broker prevents refreshing the controller or the coordinator until that broker becomes reachable again (#4970). * Remove a one second wait after a partition fetch is restarted following a leader change and offset validation (#4970). * Fix the Nagle algorithm (TCP_NODELAY) on broker sockets to not be enabled by default (#4986). ## Fixes ### General fixes * Issues: #4212 Identify brokers only by broker id, as happens in Java, avoid to find the broker with same hostname and use the same thread and connection. Happens since 1.x (#4557, @mfleming). * Issues: #4557 Remove brokers not reported in a metadata call, along with their threads. Avoids that unavailable brokers are selected for a new connection when there's no one available. We cannot tell if a broker was removed temporarily or permanently so we always remove it and it'll be added back when it becomes available again. Happens since 1.x (#4557, @mfleming). * Issues: #4970 librdkafka code using `cnd_timedwait` was yielding before a timeout occurred without the condition being fulfilled because of spurious wake-ups. Solved by verifying with a monotonic clock that the expected point in time was reached and calling the function again if needed. Happens since 1.x (#4970). * Issues: #4970 Doesn't remove topics from cache on temporary Metadata errors but only on metadata cache expiry. It allows the client to continue working in case of temporary problems to the Kafka metadata plane. Happens since 1.x (#4970). * Issues: #4970 Doesn't mark the topic as unknown if it had been marked as existent earlier and `topic.metadata.propagation.max.ms` hasn't passed still. It achieves this property expected effect even if a different broker had previously reported the topic as existent. Happens since 1.x (@marcin-krystianc, #4970). * Issues: #4907 Doesn't update partition leaders if the topic in metadata response has errors. It's in line with what Java client does and allows to avoid segmentation faults for unknown partitions. Happens since 1.x (#4970). * Issues: #4970 Only topic authorization errors in a metadata response are considered permanent and are returned to the user. It's in line with what Java client does and avoids returning to the user an error that wasn't meant to be permanent. Happens since 1.x (#4970). * Issues: #4964, #4778 Prevent metadata cache corruption when topic id for the same topic name changes. Solved by correctly removing the entry with the old topic id from metadata cache to prevent subsequent use-after-free. Happens since 2.4.0 (@kwdubuc, @marcin-krystianc, @GerKr, #4970). * Issues: #4970 Fix for the case where a metadata refresh enqueued on an unreachable broker prevents refreshing the controller or the coordinator until that broker becomes reachable again. Given the request continues to be retried on that broker, the counter for refreshing complete broker metadata doesn't reach zero and prevents the client from obtaining the new controller or group or transactional coordinator. It causes a series of debug messages like: "Skipping metadata request: ... full request already in-transit", until the broker the request is enqueued on is up again. Solved by not retrying these kinds of metadata requests. Happens since 1.x (#4970). * The Nagle algorithm (TCP_NODELAY) is now disabled by default. It caused a large increase in latency for some use cases, for example, when using an SSL connection. For efficient batching, the application should use `linger.ms`, `batch.size` etc. Happens since: 0.x (#4986). ### Consumer fixes * Issues: #4059 Commits during a cooperative incremental rebalance could cause an assignment lost if the generation id was bumped by a second join group request. Solved by not rejoining the group in case an illegal generation error happens during a rebalance. Happening since v1.6.0 (#4908) * Issues: #4970 When switching to a different leader a consumer could wait 500ms (`fetch.error.backoff.ms`) before starting to fetch again. The fetch backoff wasn't reset when joining the new broker. Solved by resetting it, given it's not needed to backoff the first fetch on a different node. This way faster leader switches are possible. Happens since 1.x (#4970). * Issues: #4970 The function `rd_kafka_offsets_for_times` refreshes leader information if the error requires it, allowing it to succeed on subsequent manual retries. Similar to the fix done in 2.3.0 in `rd_kafka_query_watermark_offsets`. Additionally, the partition current leader epoch is taken from metadata cache instead of from passed partitions. Happens since 1.x (#4970). * Issues: #4970 When consumer is closed before destroying the client, the operations queue isn't purged anymore as it contains operations unrelated to the consumer group. Happens since 1.x (#4970). * Issues: #4970 When making multiple changes to the consumer subscription in a short time, no unknown topic error is returned for topics that are in the new subscription but weren't in previous one. This was due to the metadata request relative to previous subscription. Happens since 1.x (#4970). * Issues: #4970 Remove a one second wait after a partition fetch is restarted following a leader change and offset validation. This is done by resetting the fetch error backoff and waking up the delegated broker if present. Happens since 2.1.0 (#4970). *Note: there was no v2.9.0 librdkafka release, it was a dependent clients release only* # librdkafka v2.8.0 librdkafka v2.8.0 is a maintenance release: * Socket options are now all set before connection (#4893). * Client certificate chain is now sent when using `ssl.certificate.pem` or `ssl_certificate` or `ssl.keystore.location` (#4894). * Avoid sending client certificates whose chain doesn't match with broker trusted root certificates (#4900). * Fixes to allow to migrate partitions to leaders with same leader epoch, or NULL leader epoch (#4901). * Support versions of OpenSSL without the ENGINE component (Chris Novakovic, #3535 and @remicollet, #4911). ## Fixes ### General fixes * Socket options are now all set before connection, as [documentation](https://man7.org/linux/man-pages/man7/tcp.7.html) says it's needed for socket buffers to take effect, even if in some cases they could have effect even after connection. Happening since v0.9.0 (#4893). * Issues: #3225. Client certificate chain is now sent when using `ssl.certificate.pem` or `ssl_certificate` or `ssl.keystore.location`. Without that, broker must explicitly add any intermediate certification authority certificate to its truststore to be able to accept client certificate. Happens since: 1.x (#4894). ### Consumer fixes * Issues: #4796. Fix to allow to migrate partitions to leaders with NULL leader epoch. NULL leader epoch can happen during a cluster roll with an upgrade to a version supporting KIP-320. Happening since v2.1.0 (#4901). * Issues: #4804. Fix to allow to migrate partitions to leaders with same leader epoch. Same leader epoch can happen when partition is temporarily migrated to the internal broker (#4804), or if broker implementation never bumps it, as it's not needed to validate the offsets. Happening since v2.4.0 (#4901). *Note: there was no v2.7.0 librdkafka release* # librdkafka v2.6.1 librdkafka v2.6.1 is a maintenance release: * Fix for a Fetch regression when connecting to Apache Kafka < 2.7 (#4871). * Fix for an infinite loop happening with cooperative-sticky assignor under some particular conditions (#4800). * Fix for retrieving offset commit metadata when it contains zeros and configured with `strndup` (#4876) * Fix for a loop of ListOffset requests, happening in a Fetch From Follower scenario, if such request is made to the follower (#4616, #4754, @kphelps). * Fix to remove fetch queue messages that blocked the destroy of rdkafka instances (#4724) * Upgrade Linux dependencies: OpenSSL 3.0.15, CURL 8.10.1 (#4875). * Upgrade Windows dependencies: MSVC runtime to 14.40.338160.0, zstd 1.5.6, zlib 1.3.1, OpenSSL 3.3.2, CURL 8.10.1 (#4872). * SASL/SCRAM authentication fix: avoid concatenating client side nonce once more, as it's already prepended in server sent nonce (#4895). * Allow retrying for status code 429 ('Too Many Requests') in HTTP requests for OAUTHBEARER OIDC (#4902). ## Fixes ### General fixes * SASL/SCRAM authentication fix: avoid concatenating client side nonce once more, as it's already prepended in server sent nonce. librdkafka was incorrectly concatenating the client side nonce again, leading to [this fix](https://github.com/apache/kafka/commit/0a004562b8475d48a9961d6dab3a6aa24021c47f) being made on AK side, released with 3.8.1, with `endsWith` instead of `equals`. Happening since v0.0.99 (#4895). ### Consumer fixes * Issues: #4870 Fix for a Fetch regression when connecting to Apache Kafka < 2.7, causing fetches to fail. Happening since v2.6.0 (#4871) * Issues: #4783. A consumer configured with the `cooperative-sticky` partition assignment strategy could get stuck in an infinite loop, with corresponding spike of main thread CPU usage. That happened with some particular orders of members and potential assignable partitions. Solved by removing the infinite loop cause. Happening since: 1.6.0 (#4800). * Issues: #4649. When retrieving offset metadata, if the binary value contained zeros and librdkafka was configured with `strndup`, part of the buffer after first zero contained uninitialized data instead of rest of metadata. Solved by avoiding to use `strndup` for copying metadata. Happening since: 0.9.0 (#4876). * Issues: #4616 When an out of range on a follower caused an offset reset, the corresponding ListOffsets request is made to the follower, causing a repeated "Not leader for partition" error. Fixed by sending the request always to the leader. Happening since 1.5.0 (tested version) or previous ones (#4616, #4754, @kphelps). * Issues: Fix to remove fetch queue messages that blocked the destroy of rdkafka instances. Circular dependencies from a partition fetch queue message to the same partition blocked the destroy of an instance, that happened in case the partition was removed from the cluster while it was being consumed. Solved by purging internal partition queue, after being stopped and removed, to allow reference count to reach zero and trigger a destroy. Happening since 2.0.2 (#4724). # librdkafka v2.6.0 librdkafka v2.6.0 is a feature release: * [KIP-460](https://cwiki.apache.org/confluence/display/KAFKA/KIP-460%3A+Admin+Leader+Election+RPC) Admin Leader Election RPC (#4845) * [KIP-714] Complete consumer metrics support (#4808). * [KIP-714] Produce latency average and maximum metrics support for parity with Java client (#4847). * [KIP-848] ListConsumerGroups Admin API now has an optional filter to return only groups of given types. * Added Transactional id resource type for ACL operations (@JohnPreston, #4856). * Fix for permanent fetch errors when using a newer Fetch RPC version with an older inter broker protocol (#4806). ## Fixes ### Consumer fixes * Issues: #4806 Fix for permanent fetch errors when brokers support a Fetch RPC version greater than 12 but cluster is configured to use an inter broker protocol that is less than 2.8. In this case returned topic ids are zero valued and Fetch has to fall back to version 12, using topic names. Happening since v2.5.0 (#4806) # librdkafka v2.5.3 librdkafka v2.5.3 is a feature release. * Fix an assert being triggered during push telemetry call when no metrics matched on the client side. (#4826) ## Fixes ### Telemetry fixes * Issue: #4833 Fix a regression introduced with [KIP-714](https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability) support in which an assert is triggered during **PushTelemetry** call. This happens when no metric is matched on the client side among those requested by broker subscription. Happening since 2.5.0 (#4826). *Note: there were no v2.5.1 and v2.5.2 librdkafka releases* # librdkafka v2.5.0 > [!WARNING] This version has introduced a regression in which an assert is triggered during **PushTelemetry** call. This happens when no metric is matched on the client side among those requested by broker subscription. > > You won't face any problem if: > * Broker doesn't support [KIP-714](https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability). > * [KIP-714](https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability) feature is disabled on the broker side. > * [KIP-714](https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability) feature is disabled on the client side. This is enabled by default. Set configuration `enable.metrics.push` to `false`. > * If [KIP-714](https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability) is enabled on the broker side and there is no subscription configured there. > * If [KIP-714](https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability) is enabled on the broker side with subscriptions that match the [KIP-714](https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability) metrics defined on the client. > > Having said this, we strongly recommend using `v2.5.3` and above to not face this regression at all. librdkafka v2.5.0 is a feature release. * [KIP-951](https://cwiki.apache.org/confluence/display/KAFKA/KIP-951%3A+Leader+discovery+optimisations+for+the+client) Leader discovery optimisations for the client (#4756, #4767). * Fix segfault when using long client id because of erased segment when using flexver. (#4689) * Fix for an idempotent producer error, with a message batch not reconstructed identically when retried (#4750) * Removed support for CentOS 6 and CentOS 7 (#4775). * [KIP-714](https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability) Client metrics and observability (#4721). ## Upgrade considerations * CentOS 6 and CentOS 7 support was removed as they reached EOL and security patches aren't publicly available anymore. ABI compatibility from CentOS 8 on is maintained through pypa/manylinux, AlmaLinux based. See also [Confluent supported OSs page](https://docs.confluent.io/platform/current/installation/versions-interoperability.html#operating-systems) (#4775). ## Enhancements * Update bundled lz4 (used when `./configure --disable-lz4-ext`) to [v1.9.4](https://github.com/lz4/lz4/releases/tag/v1.9.4), which contains bugfixes and performance improvements (#4726). * [KIP-951](https://cwiki.apache.org/confluence/display/KAFKA/KIP-951%3A+Leader+discovery+optimisations+for+the+client) With this KIP leader updates are received through Produce and Fetch responses in case of errors corresponding to leader changes and a partition migration happens before refreshing the metadata cache (#4756, #4767). ## Fixes ### General fixes * Issues: [confluentinc/confluent-kafka-dotnet#2084](https://github.com/confluentinc/confluent-kafka-dotnet/issues/2084) Fix segfault when a segment is erased and more data is written to the buffer. Happens since 1.x when a portion of the buffer (segment) is erased for flexver or compression. More likely to happen since 2.1.0, because of the upgrades to flexver, with certain string sizes like a long client id (#4689). ### Idempotent producer fixes * Issues: #4736 Fix for an idempotent producer error, with a message batch not reconstructed identically when retried. Caused the error message "Local: Inconsistent state: Unable to reconstruct MessageSet". Happening on large batches. Solved by using the same backoff baseline for all messages in the batch. Happens since 2.2.0 (#4750). # librdkafka v2.4.0 librdkafka v2.4.0 is a feature release: * [KIP-848](https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol): The Next Generation of the Consumer Rebalance Protocol. **Early Access**: This should be used only for evaluation and must not be used in production. Features and contract of this KIP might change in future (#4610). * [KIP-467](https://cwiki.apache.org/confluence/display/KAFKA/KIP-467%3A+Augment+ProduceResponse+error+messaging+for+specific+culprit+records): Augment ProduceResponse error messaging for specific culprit records (#4583). * [KIP-516](https://cwiki.apache.org/confluence/display/KAFKA/KIP-516%3A+Topic+Identifiers) Continue partial implementation by adding a metadata cache by topic id and updating the topic id corresponding to the partition name (#4676) * Upgrade OpenSSL to v3.0.12 (while building from source) with various security fixes, check the [release notes](https://www.openssl.org/news/cl30.txt). * Integration tests can be started in KRaft mode and run against any GitHub Kafka branch other than the released versions. * Fix pipeline inclusion of static binaries (#4666) * Fix to main loop timeout calculation leading to a tight loop for a max period of 1 ms (#4671). * Fixed a bug causing duplicate message consumption from a stale fetch start offset in some particular cases (#4636) * Fix to metadata cache expiration on full metadata refresh (#4677). * Fix for a wrong error returned on full metadata refresh before joining a consumer group (#4678). * Fix to metadata refresh interruption (#4679). * Fix for an undesired partition migration with stale leader epoch (#4680). * Fix hang in cooperative consumer mode if an assignment is processed while closing the consumer (#4528). * Upgrade OpenSSL to v3.0.13 (while building from source) with various security fixes, check the [release notes](https://www.openssl.org/news/cl30.txt) (@janjwerner-confluent, #4690). * Upgrade zstd to v1.5.6, zlib to v1.3.1, and curl to v8.8.0 (@janjwerner-confluent, #4690). ## Upgrade considerations * With KIP 467, INVALID_MSG (Java: CorruptRecordExpection) will be retried automatically. INVALID_RECORD (Java: InvalidRecordException) instead is not retriable and will be set only to the records that caused the error. Rest of records in the batch will fail with the new error code _INVALID_DIFFERENT_RECORD (Java: KafkaException) and can be retried manually, depending on the application logic (#4583). ## Early Access ### [KIP-848](https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol): The Next Generation of the Consumer Rebalance Protocol * With this new protocol the role of the Group Leader (a member) is removed and the assignment is calculated by the Group Coordinator (a broker) and sent to each member through heartbeats. The feature is still _not production-ready_. It's possible to try it in a non-production enviroment. A [guide](INTRODUCTION.md#next-generation-of-the-consumer-group-protocol-kip-848) is available with considerations and steps to follow to test it (#4610). ## Fixes ### General fixes * Issues: [confluentinc/confluent-kafka-go#981](https://github.com/confluentinc/confluent-kafka-go/issues/981). In librdkafka release pipeline a static build containing libsasl2 could be chosen instead of the alternative one without it. That caused the libsasl2 dependency to be required in confluent-kafka-go v2.1.0-linux-musl-arm64 and v2.3.0-linux-musl-arm64. Solved by correctly excluding the binary configured with that library, when targeting a static build. Happening since v2.0.2, with specified platforms, when using static binaries (#4666). * Issues: #4684. When the main thread loop was awakened less than 1 ms before the expiration of a timeout, it was serving with a zero timeout, leading to increased CPU usage until the timeout was reached. Happening since 1.x. * Issues: #4685. Metadata cache was cleared on full metadata refresh, leading to unnecessary refreshes and occasional `UNKNOWN_TOPIC_OR_PART` errors. Solved by updating cache for existing or hinted entries instead of clearing them. Happening since 2.1.0 (#4677). * Issues: #4589. A metadata call before member joins consumer group, could lead to an `UNKNOWN_TOPIC_OR_PART` error. Solved by updating the consumer group following a metadata refresh only in safe states. Happening since 2.1.0 (#4678). * Issues: #4577. Metadata refreshes without partition leader change could lead to a loop of metadata calls at fixed intervals. Solved by stopping metadata refresh when all existing metadata is non-stale. Happening since 2.3.0 (#4679). * Issues: #4687. A partition migration could happen, using stale metadata, when the partition was undergoing a validation and being retried because of an error. Solved by doing a partition migration only with a non-stale leader epoch. Happening since 2.1.0 (#4680). ### Consumer fixes * Issues: #4686. In case of subscription change with a consumer using the cooperative assignor it could resume fetching from a previous position. That could also happen if resuming a partition that wasn't paused. Fixed by ensuring that a resume operation is completely a no-op when the partition isn't paused. Happening since 1.x (#4636). * Issues: #4527. While using the cooperative assignor, given an assignment is received while closing the consumer it's possible that it gets stuck in state WAIT_ASSIGN_CALL, while the method is converted to a full unassign. Solved by changing state from WAIT_ASSIGN_CALL to WAIT_UNASSIGN_CALL while doing this conversion. Happening since 1.x (#4528). # librdkafka v2.3.0 librdkafka v2.3.0 is a feature release: * [KIP-516](https://cwiki.apache.org/confluence/display/KAFKA/KIP-516%3A+Topic+Identifiers) Partial support of topic identifiers. Topic identifiers in metadata response available through the new `rd_kafka_DescribeTopics` function (#4300, #4451). * [KIP-117](https://cwiki.apache.org/confluence/display/KAFKA/KIP-117%3A+Add+a+public+AdminClient+API+for+Kafka+admin+operations) Add support for AdminAPI `DescribeCluster()` and `DescribeTopics()` (#4240, @jainruchir). * [KIP-430](https://cwiki.apache.org/confluence/display/KAFKA/KIP-430+-+Return+Authorized+Operations+in+Describe+Responses): Return authorized operations in Describe Responses. (#4240, @jainruchir). * [KIP-580](https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients): Added Exponential Backoff mechanism for retriable requests with `retry.backoff.ms` as minimum backoff and `retry.backoff.max.ms` as the maximum backoff, with 20% jitter (#4422). * [KIP-396](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=97551484): completed the implementation with the addition of ListOffsets (#4225). * Fixed ListConsumerGroupOffsets not fetching offsets for all the topics in a group with Apache Kafka version below 2.4.0. * Add missing destroy that leads to leaking partition structure memory when there are partition leader changes and a stale leader epoch is received (#4429). * Fix a segmentation fault when closing a consumer using the cooperative-sticky assignor before the first assignment (#4381). * Fix for insufficient buffer allocation when allocating rack information (@wolfchimneyrock, #4449). * Fix for infinite loop of OffsetForLeaderEpoch requests on quick leader changes. (#4433). * Fix to add leader epoch to control messages, to make sure they're stored for committing even without a subsequent fetch message (#4434). * Fix for stored offsets not being committed if they lacked the leader epoch (#4442). * Upgrade OpenSSL to v3.0.11 (while building from source) with various security fixes, check the [release notes](https://www.openssl.org/news/cl30.txt) (#4454, started by @migarc1). * Fix to ensure permanent errors during offset validation continue being retried and don't cause an offset reset (#4447). * Fix to ensure max.poll.interval.ms is reset when rd_kafka_poll is called with consume_cb (#4431). * Fix for idempotent producer fatal errors, triggered after a possibly persisted message state (#4438). * Fix `rd_kafka_query_watermark_offsets` continuing beyond timeout expiry (#4460). * Fix `rd_kafka_query_watermark_offsets` not refreshing the partition leader after a leader change and subsequent `NOT_LEADER_OR_FOLLOWER` error (#4225). ## Upgrade considerations * `retry.backoff.ms`: If it is set greater than `retry.backoff.max.ms` which has the default value of 1000 ms then it is assumes the value of `retry.backoff.max.ms`. To change this behaviour make sure that `retry.backoff.ms` is always less than `retry.backoff.max.ms`. If equal then the backoff will be linear instead of exponential. * `topic.metadata.refresh.fast.interval.ms`: If it is set greater than `retry.backoff.max.ms` which has the default value of 1000 ms then it is assumes the value of `retry.backoff.max.ms`. To change this behaviour make sure that `topic.metadata.refresh.fast.interval.ms` is always less than `retry.backoff.max.ms`. If equal then the backoff will be linear instead of exponential. ## Fixes ### General fixes * An assertion failed with insufficient buffer size when allocating rack information on 32bit architectures. Solved by aligning all allocations to the maximum allowed word size (#4449). * The timeout for `rd_kafka_query_watermark_offsets` was not enforced after making the necessary ListOffsets requests, and thus, it never timed out in case of broker/network issues. Fixed by setting an absolute timeout (#4460). ### Idempotent producer fixes * After a possibly persisted error, such as a disconnection or a timeout, next expected sequence used to increase, leading to a fatal error if the message wasn't persisted and the second one in queue failed with an `OUT_OF_ORDER_SEQUENCE_NUMBER`. The error could contain the message "sequence desynchronization" with just one possibly persisted error or "rewound sequence number" in case of multiple errored messages. Solved by treating the possible persisted message as _not_ persisted, and expecting a `DUPLICATE_SEQUENCE_NUMBER` error in case it was or `NO_ERROR` in case it wasn't, in both cases the message will be considered delivered (#4438). ### Consumer fixes * Stored offsets were excluded from the commit if the leader epoch was less than committed epoch, as it's possible if leader epoch is the default -1. This didn't happen in Python, Go and .NET bindings when stored position was taken from the message. Solved by checking only that the stored offset is greater than committed one, if either stored or committed leader epoch is -1 (#4442). * If an OffsetForLeaderEpoch request was being retried, and the leader changed while the retry was in-flight, an infinite loop of requests was triggered, because we weren't updating the leader epoch correctly. Fixed by updating the leader epoch before sending the request (#4433). * During offset validation a permanent error like host resolution failure would cause an offset reset. This isn't what's expected or what the Java implementation does. Solved by retrying even in case of permanent errors (#4447). * If using `rd_kafka_poll_set_consumer`, along with a consume callback, and then calling `rd_kafka_poll` to service the callbacks, would not reset `max.poll.interval.ms.` This was because we were only checking `rk_rep` for consumer messages, while the method to service the queue internally also services the queue forwarded to from `rk_rep`, which is `rkcg_q`. Solved by moving the `max.poll.interval.ms` check into `rd_kafka_q_serve` (#4431). * After a leader change a `rd_kafka_query_watermark_offsets` call would continue trying to call ListOffsets on the old leader, if the topic wasn't included in the subscription set, so it started querying the new leader only after `topic.metadata.refresh.interval.ms` (#4225). # librdkafka v2.2.0 librdkafka v2.2.0 is a feature release: * Fix a segmentation fault when subscribing to non-existent topics and using the consume batch functions (#4273). * Store offset commit metadata in `rd_kafka_offsets_store` (@mathispesch, #4084). * Fix a bug that happens when skipping tags, causing buffer underflow in MetadataResponse (#4278). * Fix a bug where topic leader is not refreshed in the same metadata call even if the leader is present. * [KIP-881](https://cwiki.apache.org/confluence/display/KAFKA/KIP-881%3A+Rack-aware+Partition+Assignment+for+Kafka+Consumers): Add support for rack-aware partition assignment for consumers (#4184, #4291, #4252). * Fix several bugs with sticky assignor in case of partition ownership changing between members of the consumer group (#4252). * [KIP-368](https://cwiki.apache.org/confluence/display/KAFKA/KIP-368%3A+Allow+SASL+Connections+to+Periodically+Re-Authenticate): Allow SASL Connections to Periodically Re-Authenticate (#4301, started by @vctoriawu). * Avoid treating an OpenSSL error as a permanent error and treat unclean SSL closes as normal ones (#4294). * Added `fetch.queue.backoff.ms` to the consumer to control how long the consumer backs off next fetch attempt. (@bitemyapp, @edenhill, #2879) * [KIP-235](https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection): Add DNS alias support for secured connection (#4292). * [KIP-339](https://cwiki.apache.org/confluence/display/KAFKA/KIP-339%3A+Create+a+new+IncrementalAlterConfigs+API): IncrementalAlterConfigs API (started by @PrasanthV454, #4110). * [KIP-554](https://cwiki.apache.org/confluence/display/KAFKA/KIP-554%3A+Add+Broker-side+SCRAM+Config+API): Add Broker-side SCRAM Config API (#4241). ## Enhancements * Added `fetch.queue.backoff.ms` to the consumer to control how long the consumer backs off next fetch attempt. When the pre-fetch queue has exceeded its queuing thresholds: `queued.min.messages` and `queued.max.messages.kbytes` it backs off for 1 seconds. If those parameters have to be set too high to hold 1 s of data, this new parameter allows to back off the fetch earlier, reducing memory requirements. ## Fixes ### General fixes * Fix a bug that happens when skipping tags, causing buffer underflow in MetadataResponse. This is triggered since RPC version 9 (v2.1.0), when using Confluent Platform, only when racks are set, observers are activated and there is more than one partition. Fixed by skipping the correct amount of bytes when tags are received. * Avoid treating an OpenSSL error as a permanent error and treat unclean SSL closes as normal ones. When SSL connections are closed without `close_notify`, in OpenSSL 3.x a new type of error is set and it was interpreted as permanent in librdkafka. It can cause a different issue depending on the RPC. If received when waiting for OffsetForLeaderEpoch response, it triggers an offset reset following the configured policy. Solved by treating SSL errors as transport errors and by setting an OpenSSL flag that allows to treat unclean SSL closes as normal ones. These types of errors can happen it the other side doesn't support `close_notify` or if there's a TCP connection reset. ### Consumer fixes * In case of multiple owners of a partition with different generations, the sticky assignor would pick the earliest (lowest generation) member as the current owner, which would lead to stickiness violations. Fixed by choosing the latest (highest generation) member. * In case where the same partition is owned by two members with the same generation, it indicates an issue. The sticky assignor had some code to handle this, but it was non-functional, and did not have parity with the Java assignor. Fixed by invalidating any such partition from the current assignment completely. # librdkafka v2.1.1 librdkafka v2.1.1 is a maintenance release: * Avoid duplicate messages when a fetch response is received in the middle of an offset validation request (#4261). * Fix segmentation fault when subscribing to a non-existent topic and calling `rd_kafka_message_leader_epoch()` on the polled `rkmessage` (#4245). * Fix a segmentation fault when fetching from follower and the partition lease expires while waiting for the result of a list offsets operation (#4254). * Fix documentation for the admin request timeout, incorrectly stating -1 for infinite timeout. That timeout can't be infinite. * Fix CMake pkg-config cURL require and use pkg-config `Requires.private` field (@FantasqueX, @stertingen, #4180). * Fixes certain cases where polling would not keep the consumer in the group or make it rejoin it (#4256). * Fix to the C++ set_leader_epoch method of TopicPartitionImpl, that wasn't storing the passed value (@pavel-pimenov, #4267). ## Fixes ### Consumer fixes * Duplicate messages can be emitted when a fetch response is received in the middle of an offset validation request. Solved by avoiding a restart from last application offset when offset validation succeeds. * When fetching from follower, if the partition lease expires after 5 minutes, and a list offsets operation was requested to retrieve the earliest or latest offset, it resulted in segmentation fault. This was fixed by allowing threads different from the main one to call the `rd_kafka_toppar_set_fetch_state` function, given they hold the lock on the `rktp`. * In v2.1.0, a bug was fixed which caused polling any queue to reset the `max.poll.interval.ms`. Only certain functions were made to reset the timer, but it is possible for the user to obtain the queue with messages from the broker, skipping these functions. This was fixed by encoding information in a queue itself, that, whether polling, resets the timer. # librdkafka v2.1.0 librdkafka v2.1.0 is a feature release: * [KIP-320](https://cwiki.apache.org/confluence/display/KAFKA/KIP-320%3A+Allow+fetchers+to+detect+and+handle+log+truncation) Allow fetchers to detect and handle log truncation (#4122). * Fix a reference count issue blocking the consumer from closing (#4187). * Fix a protocol issue with ListGroups API, where an extra field was appended for API Versions greater than or equal to 3 (#4207). * Fix an issue with `max.poll.interval.ms`, where polling any queue would cause the timeout to be reset (#4176). * Fix seek partition timeout, was one thousand times lower than the passed value (#4230). * Fix multiple inconsistent behaviour in batch APIs during **pause** or **resume** operations (#4208). See **Consumer fixes** section below for more information. * Update lz4.c from upstream. Fixes [CVE-2021-3520](https://github.com/advisories/GHSA-gmc7-pqv9-966m) (by @filimonov, #4232). * Upgrade OpenSSL to v3.0.8 with various security fixes, check the [release notes](https://www.openssl.org/news/cl30.txt) (#4215). ## Enhancements * Added `rd_kafka_topic_partition_get_leader_epoch()` (and `set..()`). * Added partition leader epoch APIs: - `rd_kafka_topic_partition_get_leader_epoch()` (and `set..()`) - `rd_kafka_message_leader_epoch()` - `rd_kafka_*assign()` and `rd_kafka_seek_partitions()` now supports partitions with a leader epoch set. - `rd_kafka_offsets_for_times()` will return per-partition leader-epochs. - `leader_epoch`, `stored_leader_epoch`, and `committed_leader_epoch` added to per-partition statistics. ## Fixes ### OpenSSL fixes * Fixed OpenSSL static build not able to use external modules like FIPS provider module. ### Consumer fixes * A reference count issue was blocking the consumer from closing. The problem would happen when a partition is lost, because forcibly unassigned from the consumer or if the corresponding topic is deleted. * When using `rd_kafka_seek_partitions`, the remaining timeout was converted from microseconds to milliseconds but the expected unit for that parameter is microseconds. * Fixed known issues related to Batch Consume APIs mentioned in v2.0.0 release notes. * Fixed `rd_kafka_consume_batch()` and `rd_kafka_consume_batch_queue()` intermittently updating `app_offset` and `store_offset` incorrectly when **pause** or **resume** was being used for a partition. * Fixed `rd_kafka_consume_batch()` and `rd_kafka_consume_batch_queue()` intermittently skipping offsets when **pause** or **resume** was being used for a partition. ## Known Issues ### Consume Batch API * When `rd_kafka_consume_batch()` and `rd_kafka_consume_batch_queue()` APIs are used with any of the **seek**, **pause**, **resume** or **rebalancing** operation, `on_consume` interceptors might be called incorrectly (maybe multiple times) for not consumed messages. ### Consume API * Duplicate messages can be emitted when a fetch response is received in the middle of an offset validation request. * Segmentation fault when subscribing to a non-existent topic and calling `rd_kafka_message_leader_epoch()` on the polled `rkmessage`. # librdkafka v2.0.2 librdkafka v2.0.2 is a maintenance release: * Fix OpenSSL version in Win32 nuget package (#4152). # librdkafka v2.0.1 librdkafka v2.0.1 is a maintenance release: * Fixed nuget package for Linux ARM64 release (#4150). # librdkafka v2.0.0 librdkafka v2.0.0 is a feature release: * [KIP-88](https://cwiki.apache.org/confluence/display/KAFKA/KIP-88%3A+OffsetFetch+Protocol+Update) OffsetFetch Protocol Update (#3995). * [KIP-222](https://cwiki.apache.org/confluence/display/KAFKA/KIP-222+-+Add+Consumer+Group+operations+to+Admin+API) Add Consumer Group operations to Admin API (started by @lesterfan, #3995). * [KIP-518](https://cwiki.apache.org/confluence/display/KAFKA/KIP-518%3A+Allow+listing+consumer+groups+per+state) Allow listing consumer groups per state (#3995). * [KIP-396](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=97551484) Partially implemented: support for AlterConsumerGroupOffsets (started by @lesterfan, #3995). * OpenSSL 3.0.x support - the maximum bundled OpenSSL version is now 3.0.7 (previously 1.1.1q). * Fixes to the transactional and idempotent producer. ## Upgrade considerations ### OpenSSL 3.0.x #### OpenSSL default ciphers The introduction of OpenSSL 3.0.x in the self-contained librdkafka bundles changes the default set of available ciphers, in particular all obsolete or insecure ciphers and algorithms as listed in the OpenSSL [legacy](https://www.openssl.org/docs/man3.0/man7/OSSL_PROVIDER-legacy.html) manual page are now disabled by default. **WARNING**: These ciphers are disabled for security reasons and it is highly recommended NOT to use them. Should you need to use any of these old ciphers you'll need to explicitly enable the `legacy` provider by configuring `ssl.providers=default,legacy` on the librdkafka client. #### OpenSSL engines and providers OpenSSL 3.0.x deprecates the use of engines, which is being replaced by providers. As such librdkafka will emit a deprecation warning if `ssl.engine.location` is configured. OpenSSL providers may be configured with the new `ssl.providers` configuration property. ### Broker TLS certificate hostname verification The default value for `ssl.endpoint.identification.algorithm` has been changed from `none` (no hostname verification) to `https`, which enables broker hostname verification (to counter man-in-the-middle impersonation attacks) by default.