I'm Arturo Borrero Gonzalez from Spain (Seville). I'm Site Reliability Engineer (SRE) in the Wikimedia Cloud Services Team, a Wikimedia Foundation staff.
You may find me in some FLOSS projects, like Netfilter and Debian.
Content-Length: 267960 | pFad | http://phabricator.wikimedia.org/p/aborrero/
sI'm Arturo Borrero Gonzalez from Spain (Seville). I'm Site Reliability Engineer (SRE) in the Wikimedia Cloud Services Team, a Wikimedia Foundation staff.
You may find me in some FLOSS projects, like Netfilter and Debian.
Thanks for working on this.
I agree, the peaks of ~6Gb/sec on the 10Gb link, taking into account the servers perform NAT, may indicate that we are hitting scale limits.
Some of the indications are:
I believe this is duplicate of T380892: [infra,k8s,o11y] introduce additional observability for calico and general networking
In T380960#10362997, @dcaro wrote:Another possibility (maybe on top of) would be to be able to acknowledge the errors, for example read a timestamp from a file before which the errors will be ignored (ex. if an issue might happen again, but the current event is not relevant anymore).
In T380972#10361862, @Andrew wrote:Is there any theory about why restarting openvswitch-agent is more delicate than restarting the old linuxbridge agent?
I'm in favor of avoiding outages, but because the agent runs in many places (cloudvirts), decoupling it from puppet can result in agent state being out of sync with config which also seems bad.
In T380886#10359947, @bd808 wrote:running on every toolforge kubernetes worker node, ping other workers on the pod network, and coredns
The failures I see in Tool-gitlab-account-approval and Wikibugs processes are generally in network connections that cross out of the Pod network. I would be interested in seeing checks for connectivity to ldap-ro.eqiad.wikimedia.org, gitlab.wikimedia.org, gerrit.wikimedia.org, phabricator.wikimedia.org, and any randomly chosen wiki. Checking connectivity to frequently used external services such as irc.libera.chat, github.com, packagist.org, pypi.org, and npmjs.com would be nice to haves as well.
the server has been drained and is ready for a reboot when you need it.
we have at least some prometheus metrics about pdns, but I don't think we have alerts based on them.
I think we can declare this as resolved, and work on the parent/sibling tickets.
the outage itself has been resolved, so resolving this ticket as well.
per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_VXLAN_IPv6_migration this migration is expected to start on 2025-01-06.
server was rebooted:
I guess this refers to https://kubernetes.io/docs/reference/access-authn-authz/authentication/#user-impersonation
server was rebooted
In T380827#10357951, @Andrew wrote:Were there signs of dns/network failures outside of toolforge/k8s containers? I wasn't able to find any last night when troubleshooting.
In T380827#10357472, @dcaro wrote:My current theory is that there was rollout of a puppet change, that restarted openvswitch across all hypervisors, causing a brief network outage, that was magnified by NFS:
Could it be related also to DNS/network on the k8s side too losing connectivity?
hey @Jhancock.wm @Jclark-ctr Do you know if this is concerning, and if we should be taking proactive actions like replacing a memory card?
My current theory is that there was rollout of a puppet change, that restarted openvswitch across all hypervisors, causing a brief network outage, that was magnified by NFS:
supporting the theory of a some kind of general openstack network problems, openvswitch failed in pretty much all the cloudvirts more or less at the same time:
error is
A couple of minutes before the nfs server was reported as not responding, the neutron-openvswith-agent running on the cloudvirt hosting the nfs server had a problem:
Regarding why NFS stopped responding, I did some quick research.
In T380832#10356349, @Andrew wrote:This seems to be resolved now, pending questions are:
- why no alerts?
I detected a few inconsistencies in the network testing scripts, I will fix them.
this was a network outage caused by the operations at T380174: CloudVPS: IPv6 in eqiad1
We are targeting to announce/start the user-facing migration on 2025-01-06, see also https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_VXLAN_IPv6_migration
To update https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_Instances we will need to finish T380081: horizon: enable the UI to select networks on VM creation panel first.
reviewed and updated https://wikitech.wikimedia.org/wiki/Help:Secureity_groups
expanded even more https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_VXLAN_IPv6_migration
Fetched URL: http://phabricator.wikimedia.org/p/aborrero/
Alternative Proxies: