Content-Length: 342990 | pFad | http://github.com/paritytech/polkadot-sdk/issues/7076

FC Kusama Validators Litep2p - Monitoring and Feedback · Issue #7076 · paritytech/polkadot-sdk · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kusama Validators Litep2p - Monitoring and Feedback #7076

Open
1 task
lexnv opened this issue Jan 7, 2025 · 8 comments
Open
1 task

Kusama Validators Litep2p - Monitoring and Feedback #7076

lexnv opened this issue Jan 7, 2025 · 8 comments

Comments

@lexnv
Copy link
Contributor

lexnv commented Jan 7, 2025

This is a placeholder issue for the community (kusama validators) to share their feedback, monitoring and logs.

We’re excited to announce the next step in improving the Kusama network with the introduction of litep2p—a more resource-efficient network backend. We need your help to make this transition successful!

Enable Litep2p Backend

We’re gradually rolling out litep2p across all validators. Here’s how you can help:

  1. Ensure you're running the latest Polkadot release (version 2412 or newer).
  2. Restart your node with the following flag
--network-backend litep2p

Rollout Plan

  • Phase 1: We need around 100 validators to start the transition.
  • Phase 2 (in a few days): Increase to 500 validators running litep2p.
  • Phase 3: Full rollout—inviting all validators to switch.

Monitoring & Feedback

Please keep an eye on your node after restarting and report any warnings or errors you encounter. In the first 15–30 minutes after the restart, you may see some temporary warnings, such as:

    Some network error occurred when fetching erasure chunk
    Low connectivity

We'd like to pay special attention to at least the following metrics:

  • Sync peers (substrate_sync_peers)
  • Block height (substrate_block_height)

Tasks

Preview Give feedback
@lexnv lexnv added this to Networking Jan 7, 2025
@alexggh
Copy link
Contributor

alexggh commented Jan 7, 2025

This is a concern for me: #7077, we should pay attention to this.

@alexggh
Copy link
Contributor

alexggh commented Jan 8, 2025

This is a concern for me: #7077, we should pay attention to this.

And this is the impact, I assume it was because validators restarted to use litep2p, but we should keep an eye on this if it is keep repeating.

Screenshot 2025-01-08 at 11 50 49

https://grafana.teleport.parity.io/goto/-FyJtnvNg?orgId=1

@alexggh
Copy link
Contributor

alexggh commented Jan 9, 2025

This is a concern for me: #7077, we should pay attention to this.

And this is the impact, I assume it was because validators restarted to use litep2p, but we should keep an eye on this if it is keep repeating.

Confirm with paranodes, he had some validators that were constantly restarting, so that was the reason for this finality delays.

@eskimor
Copy link
Member

eskimor commented Jan 10, 2025

And what was the reason for constantly restarting?

@alexggh
Copy link
Contributor

alexggh commented Jan 10, 2025

And what was the reason for constantly restarting?

Paranode had a script that restarted on low connectivity, which is exactly what this #7077 will produce.

Nevertheless, even after the script was stopped we are still seeing occasional lower spikes in finality because of no-shows on around ~20 validators, I'm working with @lexnv to understand what might cause that, because it perfectly correlates with the enablement of validators with litep2p.

@alexggh
Copy link
Contributor

alexggh commented Jan 14, 2025

And what was the reason for constantly restarting?

Paranode had a script that restarted on low connectivity, which is exactly what this #7077 will produce.

Nevertheless, even after the script was stopped we are still seeing occasional lower spikes in finality because of no-shows on around ~20 validators, I'm working with @lexnv to understand what might cause that, because it perfectly correlates with the enablement of validators with litep2p.

Did a bit more investigation on this path and for this list of candidates, which are slow to be approved, they induce a finality lag of around ~16 blocks.

0x8fe297cb881a48611829b911b9dfc4c176d5a540b5fd0ab4a2114b6b65e04d71
0x16a13885e900a4afc18d1689a0197602ac64176d70ccd8608f9a260de3b3a22e
0x869c13acf8857fc2df09bd3991cfda24f81b639373bdc68ba7a10436940c4a4d
0x83a323ec47de77df87e9e2dfc49650169622fd641304e62f8f83db085fc1822c
0x7881ad19ad2cd502c7af16313bf569c9c5ac5d42cd47b9b2bede8df71f63cd56
0xb062a7c221fcd1ecfe33f1aabc1b48abf3949a9ab09713e8ca6a86800388103c
0xd233f3836a3a4bc7703c7b211d8da49bd7455e7b2d264b6cfd053f42e73948f7
0x8f811f8b2d246badb748a46156fc0a992316616e36f8779f32a0606853f68df8
0x477c4030533603182240063e80b1169fac6674bdd1b207c719d5363b231a28ba
0x756973880ea6a9a08d3ab881ef7e908a4779c5c2f1d4e4aa3f01f2c2510171ec

For this particular candidates around 20/30 random validators(different polkadot versions) are a no-show, those validators aren't no-shows on any other candidate before and after, so it is a one-off for this particular candidates.

What this candidates have in common is that all of them(9 of 9), have been backed in a group that contains STKD.IO/01 https://apps.turboflakes.io/?chain=kusama#/validator/5FKStTNJCk5J3EuuYcvJpNn8CxbkzW1J7mst3aayWCT8XrXh which seem to be one of the nodes that enabled litep2p.

So, my theory is that the presence of this node in the backing group might make others slow on availability-recovery which results in no-shows and finality lag, however I don't have a definitive proof where this happens.

Next

  • Confirm STKD.IO/01 is rolled back from litep2p to the default networking backend and that the occasional lag goes away.
  • Root-cause why the above happens

@lexnv
Copy link
Contributor Author

lexnv commented Jan 14, 2025

Confirmed STKD.IO/01 runs litep2p, reboot to libp2p will happen soon

@Sudo-Whodo
Copy link
Contributor

STKD.IO/01 was restarted with the litep2p flag around 2025-01-08 04:02:20 (at the start of the log file). It ran and outputted errors for about 25-30 min and cleared up around ~2025-01-08 04:30:00. I restarted the service a couple times at the beginning. The flag was removed 2025-01-14 14:43:47.

https://public-logs-stkd.s3.us-west-2.amazonaws.com/extracted-messages.txt

If you need any more info or have any questions let me know.

github-merge-queue bot pushed a commit that referenced this issue Jan 15, 2025
This PR rejects inbound requests from banned peers (reputation is below
the banned threshold).

This mirrors the request-response implementation from the libp2p side.
I won't expect this to get triggered too often, but we'll monitor this
metric.

While at it, have registered a new inbound failure metric to have
visibility into this.

Discovered during the investigation of:
#7076 (comment)

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
github-actions bot pushed a commit that referenced this issue Jan 15, 2025
This PR rejects inbound requests from banned peers (reputation is below
the banned threshold).

This mirrors the request-response implementation from the libp2p side.
I won't expect this to get triggered too often, but we'll monitor this
metric.

While at it, have registered a new inbound failure metric to have
visibility into this.

Discovered during the investigation of:
#7076 (comment)

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
(cherry picked from commit ef064a3)
github-actions bot pushed a commit that referenced this issue Jan 15, 2025
This PR rejects inbound requests from banned peers (reputation is below
the banned threshold).

This mirrors the request-response implementation from the libp2p side.
I won't expect this to get triggered too often, but we'll monitor this
metric.

While at it, have registered a new inbound failure metric to have
visibility into this.

Discovered during the investigation of:
#7076 (comment)

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
(cherry picked from commit ef064a3)
github-actions bot pushed a commit that referenced this issue Jan 15, 2025
This PR rejects inbound requests from banned peers (reputation is below
the banned threshold).

This mirrors the request-response implementation from the libp2p side.
I won't expect this to get triggered too often, but we'll monitor this
metric.

While at it, have registered a new inbound failure metric to have
visibility into this.

Discovered during the investigation of:
#7076 (comment)

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
(cherry picked from commit ef064a3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

4 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/paritytech/polkadot-sdk/issues/7076

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy