Skip to content

High connection or CQv1 queue churn can cause file_handle_cache to terminate, making the node unavailable #8784

Closed
@michaelklishin

Description

@michaelklishin

Originally discussed and narrowed down based on log evidence in #8776.

Consider the following scenario in a cluster with very short lived connections and thus very high connection churn:

  • A client connection is opened, and an Erlang process (a "green thread" if you will) is started for it to handle it. This connection has an identifier such as <0.1139.0>
  • Connection is closed a few milliseconds later
    file_handle_cache handles the connection closure event and clears a row with some metrics

So far so good. Now, concurrently with that, another connection is open and

  • It gets the same local Erlang process identifier, say, <0.1139.0> (the same value as the recently terminated connection)
  • It starts performing operations that result in metrics being updated
  • But the above connection closure event handler will concurrently delete the metric row table
  • Now all metric updates (writes) operation fail due to the missing key
  • And with that, all client operations on a connection

In other words, for this scenario to happen you need one very short lived connection and another very short-lived connection to get "assigned" the same Erlang process ("green thread") ID, then two independent metric table updates can step over one another.

file_handle_cache and file_handle_cache_stats should be more defensive when updating metrics.

Besides avoiding connection churn, what reduces the probability of running into this
issue is the use of classic queues v2 that do not rely on the file handle cache where CQv1 did.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    pFad - Phonifier reborn

    Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

    Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


    Alternative Proxies:

    Alternative Proxy

    pFad Proxy

    pFad v3 Proxy

    pFad v4 Proxy