Skip to content

kind [ipv6?] CI jobs failing sometimes on network not ready #131948

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
BenTheElder opened this issue May 23, 2025 · 16 comments
Open

kind [ipv6?] CI jobs failing sometimes on network not ready #131948

BenTheElder opened this issue May 23, 2025 · 16 comments
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/testing Categorizes an issue or PR as relevant to SIG Testing. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@BenTheElder
Copy link
Member

BenTheElder commented May 23, 2025

Which jobs are failing?

pull-kubernetes-e2e-kind-ipv6

possibly others..

Which tests are failing?

cluster creation, network is unready

Since when has it been failing?

looks like this is failing a lot more in the past day: https://go.k8s.io/triage?pr=1&job=kind&test=SynchronizedBeforeSuite

Testgrid link

No response

Reason for failure (if possible)

[ERROR] plugin/errors: 2 4527517896100725881.1499910709173596313. HINFO: dial udp 172.18.0.1:53: connect: network is unreachable

and similar network unready errors (visible in e.g. coredns logs)

Anything else we need to know?

containerd 2.1.1 was adopted a few days ago kubernetes-sigs/kind@31a79fd

That doesn't align with the failure spike though:

Image

Further back we updated other dependencies recently-ish, but again, that doesn't align

We haven't merged anything in kind since the 20th, but there's failure spike in the past day or so.
So I suspect either the CI infra, or kubernetes/kubernetes changes.

Relevant SIG(s)

/sig testing

@BenTheElder BenTheElder added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label May 23, 2025
@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label May 23, 2025
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 23, 2025
@BenTheElder
Copy link
Member Author

#131883 was pretty recent on the kubernetes/kubernetes changes side of things. It doesn't appear to have flaked on this though. https://prow.k8s.io/pr-history/?org=kubernetes&repo=kubernetes&pr=131883

Other commits don't stand out.

Not sure about the infra yet, but that sounds more likely at the moment.

@BenTheElder
Copy link
Member Author

https://github.com/kubernetes/test-infra/commits/master/ nothing obvious here?

@BenTheElder
Copy link
Member Author

Or in https://github.com/kubernetes/k8s.io/commits/main/

Maybe the cluster itself. These ran in gke-prow-build-pool5 .... nodes in k8s-infra-prow-build which are not the new nodepool experiment @upodroid and @ameukam have been working on.

I don't think we've had other changes to that cluster lately, it could've auto-upgraded maybe.

@BenTheElder
Copy link
Member Author

If we ignore the compat-verison jobs which have had other issues we get a clearer picture:
https:/go.k8s.io/triagel?pr=1&job=kind&test=SynchronizedBeforeSuite&xjob=compatibility

Image

There's a lone failure in the alpha-beta-features job on tuesday the 20th that isn't clearly related, the rest start since 4:00 UTC-7 (US Pacific) on the 23rd.

@BenTheElder
Copy link
Member Author

Upgrade logs might align:

{
  "insertId": "1m0nhfte82p86",
  "jsonPayload": {
    "operation": "operation-1747999572322-1bde5da1-4f3f-4340-9c30-ad42e3f6cdf2",
    "@type": "type.googleapis.com/google.container.v1beta1.UpgradeEvent",
    "resource": "projects/k8s-infra-prow-build/locations/us-central1/clusters/prow-build/nodePools/pool5-20210928124956061000000001",
    "currentVersion": "1.32.2-gke.1297002",
    "operationStartTime": "2025-05-23T11:26:12.322248554Z",
    "resourceType": "NODE_POOL",
    "targetVersion": "1.32.3-gke.1785003"
  },
  "resource": {
    "type": "gke_nodepool",
    "labels": {
      "location": "us-central1",
      "nodepool_name": "pool5-20210928124956061000000001",
      "cluster_name": "prow-build",
      "project_id": "k8s-infra-prow-build"
    }
  },
  "timestamp": "2025-05-23T11:26:25.010880914Z",
  "severity": "NOTICE",
  "logName": "projects/k8s-infra-prow-build/logs/container.googleapis.com%2Fnotifications",
  "receiveTimestamp": "2025-05-23T11:26:25.026034909Z"
}

@BenTheElder
Copy link
Member Author

I think this is a kernel issue with the node pool.

SIG K8s Infra is looking to migrate to COS + cgroup v2 + C4 VMs (this is running on Ubuntu + v1 + N1), but we were still testing this.

Looking into CI node pool down/up-grade.
/assign

@BenTheElder
Copy link
Member Author

/triage accepted
/sig k8s-infra

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 23, 2025
@BenTheElder
Copy link
Member Author

gcloud container node-pools rollback pool5-20210928124956061000000001 --cluster=prow-build --project=k8s-infra-prow-build --region=us-central1 is currently running.

@BenTheElder
Copy link
Member Author

BenTheElder commented May 23, 2025

Specifically there appear to be netfilter UDP bug(s) that affected Ubuntu and COS (and others, it's an upstream kernel issue), COS has a patched release already available but I'm not sure Ubuntu does. AFAICT the upstream issue in Ubuntu is still open tracking one of the kernel reports.

EDIT: There's a workaround available for the known impact to GKE clusters with intranode visibility, the kind angle is new ...

@BenTheElder
Copy link
Member Author

BenTheElder commented May 23, 2025

https://bugzilla.netfilter.org/show_bug.cgi?id=1795
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2109889
https://bugzilla.netfilter.org/show_bug.cgi?id=1797
EDIT: this is possibly different, there was a known issue impacting this version that pointed to these, I haven't had a chance to root-cause, I'm focused on getting CI green inbetween other things right now ...

@BenTheElder BenTheElder changed the title pull-kubernetes-e2e-kind* failing sometimes on network not ready kind CI jobs failing sometimes on network not ready May 23, 2025
@BenTheElder
Copy link
Member Author

this is impacting ~all kind e2e jobs

If you see it fail at SynchronizedBeforeSuite it's probably this, the logs will have an error like:

[ERROR] plugin/errors: 2 7730850699321325609.2949657682776755114. HINFO: dial udp 172.18.0.1:53: connect: network is unreachable

@BenTheElder
Copy link
Member Author

BenTheElder commented May 23, 2025

NOTE: we're heading into a 3 day weekend here in the US.

I think this might be IPV6 only, digging through more of the failures.

kernel version upgraded from 6.8.0-1019 to 6.8.0-1022

@BenTheElder BenTheElder changed the title kind CI jobs failing sometimes on network not ready kind [ipv6?] CI jobs failing sometimes on network not ready May 23, 2025
@BenTheElder
Copy link
Member Author

Tentatively the COS + C4 + cgroup v2 nodepool (pool6....) is good. https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/126563/pull-kubernetes-e2e-kind-ipv6/1926052312141271040

To migrate a job:

  1. Set the taint and tolerations as in canary job followups test-infra#34841 (this will run it on the nodepool)
  2. Drop the preset-kind-volume-mounts label as in canary job followups test-infra#34841 (we really don't want to be bind-mounting cgroups anymore, never should have)

The operation to rollback the main (pool5...) nodepool upgrade is still pending some capacity issues are slowing it down so the pool is only partially rolled back.

This can be checked like:
gcloud container operations describe operation-1748028469601-078a0d5e-5797-4c6c-a6cb-6e3ac98b04d4 --region=us-central1 --project=k8s-infra-prow-build

@BenTheElder
Copy link
Member Author

The job updated above (pull-kubernetes-e2e-kind-ipv6) seems to be working reliably.

https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/pr-logs/directory/pull-kubernetes-e2e-kind-ipv6

@aojea
Copy link
Member

aojea commented May 25, 2025

The job indicated https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/131869/pull-kubernetes-e2e-kind-ipv6/1925942985422278656

https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/131869/pull-kubernetes-e2e-kind-ipv6/1925942985422278656/artifacts/kind-control-plane/pods/kube-system_kube-proxy-jfpnz_622983a3-cbb6-47d3-b8a2-640b56839328/kube-proxy/0.log

2025-05-23T16:09:53.05128622Z stderr F E0523 16:09:53.051168       1 proxier.go:1565] "Failed to execute iptables-restore" err=<
2025-05-23T16:09:53.051310201Z stderr F 	exit status 2: Ignoring deprecated --wait-interval option.
2025-05-23T16:09:53.051314559Z stderr F 	Warning: Extension MARK revision 0 not supported, missing kernel module?
2025-05-23T16:09:53.051317661Z stderr F 	ip6tables-restore v1.8.9 (legacy): unknown option "--xor-mark"
2025-05-23T16:09:53.05132104Z stderr F 	Error occurred at line: 22
2025-05-23T16:09:53.05132394Z stderr F 	Try `ip6tables-restore -h' or 'ip6tables-restore --help' for more information.
2025-05-23T16:09:53.051337798Z stderr F  > ipFamily="IPv6"
2025-05-23T16

There are other failures, checking https://testgrid.k8s.io/sig-release-1.33-blocking#kind-1.33-parallel

https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-kind-e2e-parallel-1-33/1925908531593089024

Kernel Version: 6.8.0-1022-gke

2025-05-23T14:03:03.718610318Z stderr F I0523 14:03:03.718387       1 iptables.go:452] "Running" command="ip6tables-restore" arguments=["-w","5","-W","100000","--noflush","--counters"]
2025-05-23T14:03:03.735525109Z stderr F E0523 14:03:03.735334       1 proxier.go:1553] "Failed to execute iptables-restore" err=<
2025-05-23T14:03:03.735618813Z stderr F 	exit status 2: Ignoring deprecated --wait-interval option.
2025-05-23T14:03:03.735627923Z stderr F 	Warning: Extension MARK revision 0 not supported, missing kernel module?
2025-05-23T14:03:03.735633917Z stderr F 	ip6tables-restore v1.8.9 (legacy): unknown option "--xor-mark"
2025-05-23T14:03:03.73564045Z stderr F 	Error occurred at line: 17
2025-05-23T14:03:03.73564673Z stderr F 	Try `ip6tables-restore -h' or 'ip6tables-restore --help' for more information.
2025-05-23T14:03:03.735654031Z stderr F  > ipFamily="IPv6"

The --wait-interval flag was removed in https://git.netfilter.org/iptables/commit/?id=07e2107ef0cbc1b81864c3c0f0ef297a9dfff44d it seems in iptables v1.8.8.

/cc @danwinship @aroradaman
I think we should extend the code to avoid adding it for 1.8.8+

func getIPTablesWaitFlag(version *utilversion.Version) []string {
switch {
case version.AtLeast(WaitIntervalMinVersion):
return []string{WaitString, WaitSecondsValue, WaitIntervalString, WaitIntervalUsecondsValue}
case version.AtLeast(WaitSecondsMinVersion):
return []string{WaitString, WaitSecondsValue}
case version.AtLeast(WaitMinVersion):
return []string{WaitString}
default:
return nil
}
}

The --xor-mark thing is https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2101914 that causes a lots of issues in github actions actions/runner-images#11985 @BenTheElder @ameukam , we need to avoid those kernels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/testing Categorizes an issue or PR as relevant to SIG Testing. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy