High connection or CQv1 queue churn can cause file_handle_cache to terminate, making the node unavailable

Originally discussed and narrowed down based on log evidence in #8776.

Consider the following scenario in a cluster with very short lived connections and thus very high [connection churn](https://rabbitmq.com/connections.html#high-connection-churn):

 * A client connection is opened, and an Erlang process (a "green thread" if you will) is started for it to handle it. This connection has an identifier such as <0.1139.0>
 * Connection is closed a few milliseconds later
`file_handle_cache` handles the connection closure event and clears a row with some metrics

So far so good. Now, concurrently with that, another connection is open and

 * It gets the same local Erlang process identifier, say, <0.1139.0> (the same value as the recently terminated connection)
 * It starts performing operations that result in metrics being updated
 * But the above connection closure event handler will concurrently delete the metric row table
 * Now all metric updates (writes) operation fail due to the missing key
 * And with that, all client operations on a connection

In other words, for this scenario to happen you need one very short lived connection and another very short-lived connection to get "assigned" the same Erlang process ("green thread") ID, then two independent metric table updates can step over one another.

`file_handle_cache` and `file_handle_cache_stats` should be more defensive when updating metrics.

Besides avoiding connection churn, what reduces the probability of running into this
issue is the use of [classic queues v2](https://blog.rabbitmq.com/posts/2023/05/rabbitmq-3.12-performance-improvements/) that do not rely on the file handle cache where CQv1 did.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

High connection or CQv1 queue churn can cause file_handle_cache to terminate, making the node unavailable #8784

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

High connection or CQv1 queue churn can cause file_handle_cache to terminate, making the node unavailable #8784

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.