kind [ipv6?] CI jobs failing sometimes on network not ready #131948

BenTheElder · 2025-05-23T18:42:07Z

Which jobs are failing?

pull-kubernetes-e2e-kind-ipv6

possibly others..

Which tests are failing?

cluster creation, network is unready

Since when has it been failing?

looks like this is failing a lot more in the past day: https://go.k8s.io/triage?pr=1&job=kind&test=SynchronizedBeforeSuite

Testgrid link

No response

Reason for failure (if possible)

[ERROR] plugin/errors: 2 4527517896100725881.1499910709173596313. HINFO: dial udp 172.18.0.1:53: connect: network is unreachable

and similar network unready errors (visible in e.g. coredns logs)

Anything else we need to know?

containerd 2.1.1 was adopted a few days ago kubernetes-sigs/kind@31a79fd

That doesn't align with the failure spike though:

Further back we updated other dependencies recently-ish, but again, that doesn't align

We haven't merged anything in kind since the 20th, but there's failure spike in the past day or so.
So I suspect either the CI infra, or kubernetes/kubernetes changes.

Relevant SIG(s)

/sig testing

The text was updated successfully, but these errors were encountered:

BenTheElder · 2025-05-23T18:42:45Z

#131947 and #131869 hit this

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/131947/pull-kubernetes-e2e-kind-ipv6/1925971218247192576
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/131869/pull-kubernetes-e2e-kind-ipv6/1925942985422278656

BenTheElder · 2025-05-23T18:48:39Z

#131883 was pretty recent on the kubernetes/kubernetes changes side of things. It doesn't appear to have flaked on this though. https://prow.k8s.io/pr-history/?org=kubernetes&repo=kubernetes&pr=131883

Other commits don't stand out.

Not sure about the infra yet, but that sounds more likely at the moment.

BenTheElder · 2025-05-23T18:50:16Z

https://github.com/kubernetes/test-infra/commits/master/ nothing obvious here?

BenTheElder · 2025-05-23T18:52:21Z

Or in https://github.com/kubernetes/k8s.io/commits/main/

Maybe the cluster itself. These ran in gke-prow-build-pool5 .... nodes in k8s-infra-prow-build which are not the new nodepool experiment @upodroid and @ameukam have been working on.

I don't think we've had other changes to that cluster lately, it could've auto-upgraded maybe.

BenTheElder · 2025-05-23T18:58:27Z

If we ignore the compat-verison jobs which have had other issues we get a clearer picture:
https:/go.k8s.io/triagel?pr=1&job=kind&test=SynchronizedBeforeSuite&xjob=compatibility

There's a lone failure in the alpha-beta-features job on tuesday the 20th that isn't clearly related, the rest start since 4:00 UTC-7 (US Pacific) on the 23rd.

BenTheElder · 2025-05-23T19:01:30Z

Upgrade logs might align:

{
  "insertId": "1m0nhfte82p86",
  "jsonPayload": {
    "operation": "operation-1747999572322-1bde5da1-4f3f-4340-9c30-ad42e3f6cdf2",
    "@type": "type.googleapis.com/google.container.v1beta1.UpgradeEvent",
    "resource": "projects/k8s-infra-prow-build/locations/us-central1/clusters/prow-build/nodePools/pool5-20210928124956061000000001",
    "currentVersion": "1.32.2-gke.1297002",
    "operationStartTime": "2025-05-23T11:26:12.322248554Z",
    "resourceType": "NODE_POOL",
    "targetVersion": "1.32.3-gke.1785003"
  },
  "resource": {
    "type": "gke_nodepool",
    "labels": {
      "location": "us-central1",
      "nodepool_name": "pool5-20210928124956061000000001",
      "cluster_name": "prow-build",
      "project_id": "k8s-infra-prow-build"
    }
  },
  "timestamp": "2025-05-23T11:26:25.010880914Z",
  "severity": "NOTICE",
  "logName": "projects/k8s-infra-prow-build/logs/container.googleapis.com%2Fnotifications",
  "receiveTimestamp": "2025-05-23T11:26:25.026034909Z"
}

BenTheElder · 2025-05-23T19:11:11Z

I think this is a kernel issue with the node pool.

SIG K8s Infra is looking to migrate to COS + cgroup v2 + C4 VMs (this is running on Ubuntu + v1 + N1), but we were still testing this.

Looking into CI node pool down/up-grade.
/assign

BenTheElder · 2025-05-23T19:11:21Z

/triage accepted
/sig k8s-infra

BenTheElder · 2025-05-23T19:28:11Z

gcloud container node-pools rollback pool5-20210928124956061000000001 --cluster=prow-build --project=k8s-infra-prow-build --region=us-central1 is currently running.

BenTheElder · 2025-05-23T19:48:47Z

Specifically there appear to be netfilter UDP bug(s) that affected Ubuntu and COS (and others, it's an upstream kernel issue), ~~COS has a patched release already available~~ but I'm not sure Ubuntu does. AFAICT the upstream issue in Ubuntu is still open tracking one of the kernel reports.

EDIT: There's a workaround available for the known impact to GKE clusters with intranode visibility, the kind angle is new ...

BenTheElder · 2025-05-23T21:48:37Z

https://bugzilla.netfilter.org/show_bug.cgi?id=1795
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2109889
https://bugzilla.netfilter.org/show_bug.cgi?id=1797
EDIT: this is possibly different, there was a known issue impacting this version that pointed to these, I haven't had a chance to root-cause, I'm focused on getting CI green inbetween other things right now ...

BenTheElder · 2025-05-23T21:54:25Z

this is impacting ~all kind e2e jobs

If you see it fail at SynchronizedBeforeSuite it's probably this, the logs will have an error like:

[ERROR] plugin/errors: 2 7730850699321325609.2949657682776755114. HINFO: dial udp 172.18.0.1:53: connect: network is unreachable

BenTheElder · 2025-05-23T22:02:59Z

NOTE: we're heading into a 3 day weekend here in the US.

I think this might be IPV6 only, digging through more of the failures.

kernel version upgraded from 6.8.0-1019 to 6.8.0-1022

BenTheElder · 2025-05-23T23:43:49Z

Tentatively the COS + C4 + cgroup v2 nodepool (pool6....) is good. https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/126563/pull-kubernetes-e2e-kind-ipv6/1926052312141271040

To migrate a job:

Set the taint and tolerations as in canary job followups test-infra#34841 (this will run it on the nodepool)
Drop the preset-kind-volume-mounts label as in canary job followups test-infra#34841 (we really don't want to be bind-mounting cgroups anymore, never should have)

The operation to rollback the main (pool5...) nodepool upgrade is still pending some capacity issues are slowing it down so the pool is only partially rolled back.

This can be checked like:
gcloud container operations describe operation-1748028469601-078a0d5e-5797-4c6c-a6cb-6e3ac98b04d4 --region=us-central1 --project=k8s-infra-prow-build

BenTheElder · 2025-05-24T02:22:48Z

The job updated above (pull-kubernetes-e2e-kind-ipv6) seems to be working reliably.

https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/pr-logs/directory/pull-kubernetes-e2e-kind-ipv6

BenTheElder added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label May 23, 2025

k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label May 23, 2025

BenTheElder mentioned this issue May 23, 2025

Bump opencontainers/cgroups to v0.0.2 #131947

Open

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 23, 2025

BenTheElder mentioned this issue May 23, 2025

[go] Bump dependencies, images and versions used to Go 1.24.3 and distroless iptables #131934

Merged

k8s-ci-robot assigned BenTheElder May 23, 2025

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 23, 2025

BenTheElder changed the title ~~pull-kubernetes-e2e-kind* failing sometimes on network not ready~~ kind CI jobs failing sometimes on network not ready May 23, 2025

BenTheElder mentioned this issue May 23, 2025

ensure kind canary jobs select the new nodepool, drop cgroup bind mount on cgroup v2 kubernetes/test-infra#34840

Merged

BenTheElder changed the title ~~kind CI jobs failing sometimes on network not ready~~ kind [ipv6?] CI jobs failing sometimes on network not ready May 23, 2025

BenTheElder mentioned this issue May 23, 2025

canary job followups kubernetes/test-infra#34841

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kind [ipv6?] CI jobs failing sometimes on network not ready #131948

kind [ipv6?] CI jobs failing sometimes on network not ready #131948

BenTheElder commented May 23, 2025 •

edited

Loading

BenTheElder commented May 23, 2025 •

edited

Loading

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 23, 2025 •

edited

Loading

Uh oh!

BenTheElder commented May 23, 2025 •

edited

Loading

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 23, 2025 •

edited

Loading

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 24, 2025

Uh oh!

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

kind [ipv6?] CI jobs failing sometimes on network not ready #131948

kind [ipv6?] CI jobs failing sometimes on network not ready #131948

Comments

BenTheElder commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which jobs are failing?

Which tests are failing?

Since when has it been failing?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Relevant SIG(s)

BenTheElder commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenTheElder commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenTheElder commented May 23, 2025

Uh oh!

BenTheElder commented May 24, 2025

Uh oh!

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

BenTheElder commented May 23, 2025 •

edited

Loading

BenTheElder commented May 23, 2025 •

edited

Loading

BenTheElder commented May 23, 2025 •

edited

Loading

BenTheElder commented May 23, 2025 •

edited

Loading

BenTheElder commented May 23, 2025 •

edited

Loading