Description
Originally discussed and narrowed down based on log evidence in #8776.
Consider the following scenario in a cluster with very short lived connections and thus very high connection churn:
- A client connection is opened, and an Erlang process (a "green thread" if you will) is started for it to handle it. This connection has an identifier such as <0.1139.0>
- Connection is closed a few milliseconds later
file_handle_cache
handles the connection closure event and clears a row with some metrics
So far so good. Now, concurrently with that, another connection is open and
- It gets the same local Erlang process identifier, say, <0.1139.0> (the same value as the recently terminated connection)
- It starts performing operations that result in metrics being updated
- But the above connection closure event handler will concurrently delete the metric row table
- Now all metric updates (writes) operation fail due to the missing key
- And with that, all client operations on a connection
In other words, for this scenario to happen you need one very short lived connection and another very short-lived connection to get "assigned" the same Erlang process ("green thread") ID, then two independent metric table updates can step over one another.
file_handle_cache
and file_handle_cache_stats
should be more defensive when updating metrics.
Besides avoiding connection churn, what reduces the probability of running into this
issue is the use of classic queues v2 that do not rely on the file handle cache where CQv1 did.