(1.17) Kubelet won't reconnect to Apiserver after NIC failure (use of closed network connection) #87615

ghost · 2020-01-28T18:20:31Z

We've just upgrade our production cluster to 1.17.2.

Since the update on saturday, we've had this strange outage: Kubelet, after a NIC bond fail (that recovers not long after), will have all of its connections broken and won't retry to restablish them unless manually restarted.

Here is the timeline of last time it occured:

01:31:16: Kernel recognizes a fail on the bond interface. It goes for a while. Eventually it recovers.

Jan 28 01:31:16 baremetal044 kernel: bond-mngmt: link status definitely down for interface eno1, disabling it
...
Jan 28 01:31:37 baremetal044  systemd-networkd[1702]: bond-mngmt: Lost carrier
Jan 28 01:31:37 baremetal044  systemd-networkd[1702]: bond-mngmt: Gained carrier
Jan 28 01:31:37 baremetal044  systemd-networkd[1702]: bond-mngmt: Configured

As expected, all watches are closed. Message is the same for them all:

...
Jan 28 01:31:44 baremetal044 kubelet-wrapper[2039]: W0128 04:31:44.352736    2039 reflector.go:326] object-"namespace"/"default-token-fjzcz": watch of *v1.Secret ended with: very short watch: object-"namespace"/"default-token-fjzcz": Unexpected watch close - watch lasted less than a second and no items received
...

So these messages begin:

`Jan 28 01:31:44 baremetal44 kubelet-wrapper[2039]: E0128 04:31:44.361582 2039 desired_state_of_world_populator.go:320] Error processing volume "disco-arquivo" for pod "pod-bb8854ddb-xkwm9_namespace(8151bfdc-ec91-48d4-9170-383f5070933f)": error processing PVC namespace/disco-arquivo: failed to fetch PVC from API server: Get https://apiserver:443/api/v1/namespaces/namespace/persistentvolumeclaims/disco-arquivo: write tcp baremetal44.ip:42518->10.79.32.131:443: use of closed network connection`

Which I'm guessing shouldn't be a problem for a while. But it never recovers. Our event came to happen at 01:31 AM, and had to manually restart Kubelet around 9h to get stuff normalized.

# journalctl --since '2020-01-28 01:31'   | fgrep 'use of closed' | cut -f3 -d' ' | cut -f1 -d1 -d':' | sort | uniq -dc
   9757 01
  20663 02
  20622 03
  20651 04
  20664 05
  20666 06
  20664 07
  20661 08
  16655 09
      3 10

Apiservers were up and running, all other nodes were up and running, everything else pretty uneventful. This one was the only one affected (today) by this problem.

Is there any way to mitigate this kind of event?

Would this be a bug?

The text was updated successfully, but these errors were encountered:

rikatz · 2020-01-28T18:26:28Z

/sig node
/sig api-machinery

Taking a look into the code the error happens here

The explanation of the code is that it assumes its probably EOF (IsProbableEOF) while in this case this doesn't seems to be.

fedebongio · 2020-01-28T21:22:35Z

/assign @caesarxuchao

caesarxuchao · 2020-01-28T21:44:00Z

@rikatz can you elaborate how did you track down to the code you pasted?

My thought is that the reflector would have restarted the watch no matter how it handles the error (code), so it doesn't explain the failure to recover.

rikatz · 2020-01-29T17:42:21Z

Exactly @caesarxuchao so this is our question.

I've tracked the error basically grepping it through the Code and crossing with what kubelet was doing that time (watching secrets) to get into that part.

Not an advanced way, through this seems to be the exact point of the error code.

The question is, because the connection is closed is there somewhere flagging that this is the watch EOF instead of understanding this is an error?

ghost · 2020-01-29T18:06:02Z

I do not have nothing else smarter to add other than we had another node fail the same way, increasing the ocurrences from last 4 days to 4.

Will try to map if bond disconects events are happening on other nodes and if kubelet is recovering - it might be bad luck on some recovers, and not a 100% event.

towolf · 2020-02-01T20:55:52Z

I think we are seeing this too, but we do not have bonds, we only see these networkd "carrier lost" messages for Calico cali* interfaces, and they are local veth devices.

abays · 2020-02-04T13:00:27Z

I have encountered this as well, with no bonds involved. Restarting the node fixes the problem, but just restarting the Kubelet service does not (all API calls fail with "Unauthorized").

abays · 2020-02-04T14:27:55Z

I have encountered this as well, with no bonds involved. Restarting the node fixes the problem, but just restarting the Kubelet service does not (all API calls fail with "Unauthorized").

Update: restarting Kubelet did fix the problem after enough time (1 hour?) was allowed to pass.

cranky-coder · 2020-02-04T15:35:01Z

I am seeing this same behavior. Ubuntu 18.04.3 LTS clean installs. Cluster built with rancher 2.3.4. I have seen this happen periodically lately and just restarting kubelet tends to fix it for me. Last night all 3 of my worker nodes exhibited this same behavior. I corrected 2 to bring my cluster up. Third is still in this state while im digging around.

r-catania · 2020-03-05T17:18:04Z

we are seeing the same issue on CentOS 7, cluster freshly built with rancher (1.17.2). We are using weave. All of 3 worker nodes are showing this issue. Restarting kubelet does not work for us we have to restart the entire node

mlmhl · 2020-03-09T03:16:53Z

/sig node
/sig api-machinery

Taking a look into the code the error happens here

The explanation of the code is that it assumes its probably EOF (IsProbableEOF) while in this case this doesn't seems to be.

We are also seeing the same issue. From the log, we found that after the problem occurred all subsequent requests were still send on the same connection. It seems that although the client will resend the request to apiserver, but the underlay http2 library still maintains the old connection so all subsequent requests are still send on this connection and received the same error use of closed connection.

So the question is why http2 still maintains an already closed connection? Maybe the connection it maintained is indeed alive but some intermediate connections are closed unexpectedly?

sbiermann · 2020-03-09T16:37:26Z

I have the same issue with a Raspberry Pi cluster with k8s 1.17.3 very often. Based on some older issues, i have set the kube API server http connection limit to 1000 "- --http2-max-streams-per-connection=1000", it was fine for more than 2 weeks after that it starts now again.

ik9999 · 2020-03-12T17:59:46Z

Is it possible to rebuild kube-apiserver https://github.com/kubernetes/apiserver/blob/b214a49983bcd70ced138bd2717f78c0cff351b2/pkg/server/secure_serving.go#L50
setting the s.DisableHTTP2 to true by default?
Is there a dockerfile for an official image (k8s.gcr.io/kube-apiserver:v1.17.3)?

mritd · 2020-03-13T06:37:19Z

same here.(ubuntu 18.04, kubernetes 1.17.3)

JensErat · 2020-03-13T14:13:23Z

We also observed this in two of our clusters. Not entirely sure about the root cause, but at least we were able to see this happened in cluster with very high watch counts. I was not able to reproduce by forcing high number of watches per kubelet though (started pods with 300 secrets per pod, which also resulted in 300 watches per pod in Prometheus metrics). Also setting very low http2-max-streams-per-connection values did not trigger the issue, but at least I was able to observe some unexpected scheduler and controller-manager behavior (might have been just overload after endless re-watch loops or something like this, though).

sbiermann · 2020-03-20T06:57:06Z

As workarround all of my nodes restarting every night kublet via local cronjob. Now after 10 days ago, i can say it works for me, i have no more "use of closed network connection" on my nodes.

ik9999 · 2020-03-20T06:58:26Z

@sbiermann
Thank you for posting this. What time interval you use for cronjob?

sbiermann · 2020-03-20T07:12:32Z

24 hours

chrischdi · 2020-03-26T07:10:08Z

I can also confirm this issue, we are not yet on 1.17.3, currently running Ubuntu 19.10:

Linux <STRIPPED>-kube-node02 5.3.0-29-generic #31-Ubuntu SMP Fri Jan 17 17:27:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

NAME                  STATUS   ROLES    AGE   VERSION       INTERNAL-IP   EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION     CONTAINER-RUNTIME
STRIPPED-kube-node02   Ready    <none>   43d   v1.16.6   10.6.0.12     <none>        Ubuntu 19.10   5.3.0-29-generic   docker://19.3.3

RikuXan · 2020-03-30T06:59:03Z

I can also confirm this on Kubernetes 1.17.4 deployed through Rancher 2.3.5 on RancherOS 1.5.5 nodes. Restarting the kubelet seems to work for me, I don't have to restart the whole node.

The underlying cause for me seems to be RAM getting close to running out and kswapd0 getting up to 100% CPU usage due to that, since I forgot to set the swappiness to 0 for my Kubernetes nodes. After setting the swappiness to 0 and adding some RAM to the machines, the issue hasn't reoccurred for me yet.

caesarxuchao · 2020-04-02T18:14:15Z

If the underlying issue was "http2 using dead connections", then restarting kubelet should fix the problem. #48670 suggested reducing TCP_USER_TIMEOUT can mitigate the problem. I have opened golang/net#55 to add client-side connection health check to the http2 library, but it's going to take more time to land.

If restarting kubelet didn't solve the issue, then probably it's a different root cause.

pytimer · 2020-04-04T03:54:54Z

I have the same issue with v1.17.2 when restart network, but only one of node have this issue(my cluster have five nodes), i can't reproduce it. Restart kubelet solved this problem.

How can i avoid to this issue? Upgrade the latest version or have other way to fix it?

ik9999 · 2020-04-05T13:03:49Z

I've fixed it by running this bash script every 5 minutes:

#!/bin/bash
output=$(journalctl -u kubelet -n 1 | grep "use of closed network connection")
if [[ $? != 0 ]]; then
  echo "Error not found in logs"
elif [[ $output ]]; then
  echo "Restart kubelet"
  systemctl restart kubelet
fi

Golang 1.13 has horrible bugs that might have affected MetalLB as it's using client-go / watches, see kubernetes/kubernetes#87615 Newer client-go version (that we are now using) have implemented HTTP/2 health check by default but this is a workaround not a fix. Also Golang 1.13 is not supported anymore Signed-off-by: Etienne Champetier <e.champetier@ateme.com>

As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which could result in lost connections to kube-apiserver. Worse than this was that the client couldn't recover. In the case of CoreDNS the loose of connectivity to kube-apiserver was even not logged. I have validated this by adding the following rule on the node which was running the CoreDNS pod (6443 port as the socket-lb was doing the service xlation): iptables -I FORWARD 1 -m tcp --proto tcp --src $CORE_DNS_POD_IP \ --dport=6443 -j DROP After upgrading CoreDNS to the one which was compiled with Go >= 1.16, the pod was not only logging the errors, but also was able to recover from them in a fast way. An example of such an error: W1126 12:45:08.403311 1 reflector.go:436] pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: watch of *v1.Endpoints ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding To determine the min vsn bump, I was using the following: for i in 1.7.0 1.7.1 1.8.0 1.8.1 1.8.2 1.8.3 1.8.4; do docker run --rm -ti "k8s.gcr.io/coredns/coredns:v$i" \ --version done CoreDNS-1.7.0 linux/amd64, go1.14.4, f59c03d CoreDNS-1.7.1 linux/amd64, go1.15.2, aa82ca6 CoreDNS-1.8.0 linux/amd64, go1.15.3, 054c9ae k8s.gcr.io/coredns/coredns:v1.8.1 not found: manifest unknown: k8s.gcr.io/coredns/coredns:v1.8.2 not found: manifest unknown: CoreDNS-1.8.3 linux/amd64, go1.16, 4293992 CoreDNS-1.8.4 linux/amd64, go1.16.4, 053c4d5 Hopefully, the bumped version will fix the CI flakes in which a service domain name is not available after 7min. In other words, CoreDNS is not able to resolve the name which means that it hasn't received update from the kube-apiserver for the service. [1]: kubernetes/kubernetes#87615 (comment) Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit 398d55c ] As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which could result in lost connections to kube-apiserver. Worse than this was that the client couldn't recover. In the case of CoreDNS the loose of connectivity to kube-apiserver was even not logged. I have validated this by adding the following rule on the node which was running the CoreDNS pod (6443 port as the socket-lb was doing the service xlation): iptables -I FORWARD 1 -m tcp --proto tcp --src $CORE_DNS_POD_IP \ --dport=6443 -j DROP After upgrading CoreDNS to the one which was compiled with Go >= 1.16, the pod was not only logging the errors, but also was able to recover from them in a fast way. An example of such an error: W1126 12:45:08.403311 1 reflector.go:436] pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: watch of *v1.Endpoints ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding To determine the min vsn bump, I was using the following: for i in 1.7.0 1.7.1 1.8.0 1.8.1 1.8.2 1.8.3 1.8.4; do docker run --rm -ti "k8s.gcr.io/coredns/coredns:v$i" \ --version done CoreDNS-1.7.0 linux/amd64, go1.14.4, f59c03d CoreDNS-1.7.1 linux/amd64, go1.15.2, aa82ca6 CoreDNS-1.8.0 linux/amd64, go1.15.3, 054c9ae k8s.gcr.io/coredns/coredns:v1.8.1 not found: manifest unknown: k8s.gcr.io/coredns/coredns:v1.8.2 not found: manifest unknown: CoreDNS-1.8.3 linux/amd64, go1.16, 4293992 CoreDNS-1.8.4 linux/amd64, go1.16.4, 053c4d5 Hopefully, the bumped version will fix the CI flakes in which a service domain name is not available after 7min. In other words, CoreDNS is not able to resolve the name which means that it hasn't received update from the kube-apiserver for the service. [1]: kubernetes/kubernetes#87615 (comment) Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: nathanjsweet <nathanjsweet@pm.me>

LuChenjing · 2021-12-27T02:03:02Z

@cloud-66 I know, the point here is that a kubelet restart fixes the issue, so this is not fixed in 1.18.18, you need 1.19+

@champtar Hi, May I ask the solution(or which exact K8S version) you tried, I noticed someone merged this into 1.18 and able to find it in 1.18.18's CHANGELOG. #100376
However we met this issue in 1.18.20 also...

hashwing · 2021-12-27T02:03:43Z

你好，你的邮件已收到，祝你身体健康，学业进步

champtar · 2021-12-28T12:12:59Z

@cloud-66 I know, the point here is that a kubelet restart fixes the issue, so this is not fixed in 1.18.18, you need 1.19+

@champtar Hi, May I ask the solution(or which exact K8S version) you tried, I noticed someone merged this into 1.18 and able to find it in 1.18.18's CHANGELOG. #100376 However we met this issue in 1.18.20 also...

The exact version at the time was v1.19.10

Golang 1.13 has horrible bugs that might have affected MetalLB as it's using client-go / watches, see kubernetes/kubernetes#87615 Newer client-go version (that we are now using) have implemented HTTP/2 health check by default but this is a workaround not a fix. Also Golang 1.13 is not supported anymore Signed-off-by: Etienne Champetier <e.champetier@ateme.com>

kwenzh · 2023-07-17T10:34:13Z

the same issue , in kernel 5.15, kubelet 1.18.19, kube-apiserver v1.18.6·

E0716 17:14:05.352004 55625 server.go:253] Unable to authenticate the request due to an error: Post "https://1xxxxx:8443/apis/authentication.k8s.io/v1/tokenreviews": read tcp 10.2xxxx:33288->10.xxxx 8443: use of closed network connection

Golang 1.13 has horrible bugs that might have affected MetalLB as it's using client-go / watches, see kubernetes/kubernetes#87615 Newer client-go version (that we are now using) have implemented HTTP/2 health check by default but this is a workaround not a fix. Also Golang 1.13 is not supported anymore Signed-off-by: Etienne Champetier <e.champetier@ateme.com>

ghost added the kind/support Categorizes issue or PR as a support question. label Jan 28, 2020

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 28, 2020

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 28, 2020

k8s-ci-robot assigned caesarxuchao Jan 28, 2020

JensErat mentioned this issue Apr 2, 2020

Automated cherry pick of #88267: rest: remove connection refused from the list of retriable #89781

Closed

This was referenced Jun 25, 2021

Konnectivity failure investigation openshift/hypershift#313

Closed

update to latest go to pickup critical http2 connection fixes kubernetes-sigs/apiserver-network-proxy#246

Merged

cheftako mentioned this issue Aug 19, 2021

Enable http2 health checking with go 1.16.5 on KAS egress. #104444

Merged

brb mentioned this issue Nov 26, 2021

test/contrib: Bump CoreDNS version to 1.8.3 cilium/cilium#18018

Merged

aojea mentioned this issue Dec 31, 2021

client wait forever if kube-apiserver restart in slb environment #107266

Closed

pacoxu mentioned this issue Aug 25, 2022

Node NotReady: kubelet suddenly giving error Failed to list ****: an error on the server ("") has prevented the request from succeeding #102876

Closed

pacoxu mentioned this issue Jan 16, 2023

[v1.18 ]use of closed connection issue of kubelet klts-io/kubernetes-lts#180

Open

pacoxu mentioned this issue Mar 7, 2024

conn.go:254] Error on socket receive: read tcp 127.0.0.1:33421->127.0.0.1:54136: use of closed network connection #123787

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(1.17) Kubelet won't reconnect to Apiserver after NIC failure (use of closed network connection) #87615

(1.17) Kubelet won't reconnect to Apiserver after NIC failure (use of closed network connection) #87615

ghost commented Jan 28, 2020

rikatz commented Jan 28, 2020

fedebongio commented Jan 28, 2020

caesarxuchao commented Jan 28, 2020

rikatz commented Jan 29, 2020

ghost commented Jan 29, 2020

towolf commented Feb 1, 2020

abays commented Feb 4, 2020

abays commented Feb 4, 2020

cranky-coder commented Feb 4, 2020

r-catania commented Mar 5, 2020

mlmhl commented Mar 9, 2020

sbiermann commented Mar 9, 2020

ik9999 commented Mar 12, 2020 •

edited

Loading

mritd commented Mar 13, 2020

JensErat commented Mar 13, 2020

sbiermann commented Mar 20, 2020

ik9999 commented Mar 20, 2020

sbiermann commented Mar 20, 2020

chrischdi commented Mar 26, 2020

RikuXan commented Mar 30, 2020 •

edited

Loading

caesarxuchao commented Apr 2, 2020

pytimer commented Apr 4, 2020

ik9999 commented Apr 5, 2020 •

edited

Loading

LuChenjing commented Dec 27, 2021

hashwing commented Dec 27, 2021 via email

champtar commented Dec 28, 2021

kwenzh commented Jul 17, 2023

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

(1.17) Kubelet won't reconnect to Apiserver after NIC failure (use of closed network connection) #87615

(1.17) Kubelet won't reconnect to Apiserver after NIC failure (use of closed network connection) #87615

Comments

ghost commented Jan 28, 2020

rikatz commented Jan 28, 2020

fedebongio commented Jan 28, 2020

caesarxuchao commented Jan 28, 2020

rikatz commented Jan 29, 2020

ghost commented Jan 29, 2020

towolf commented Feb 1, 2020

abays commented Feb 4, 2020

abays commented Feb 4, 2020

cranky-coder commented Feb 4, 2020

r-catania commented Mar 5, 2020

mlmhl commented Mar 9, 2020

sbiermann commented Mar 9, 2020

ik9999 commented Mar 12, 2020 • edited Loading

mritd commented Mar 13, 2020

JensErat commented Mar 13, 2020

sbiermann commented Mar 20, 2020

ik9999 commented Mar 20, 2020

sbiermann commented Mar 20, 2020

chrischdi commented Mar 26, 2020

RikuXan commented Mar 30, 2020 • edited Loading

caesarxuchao commented Apr 2, 2020

pytimer commented Apr 4, 2020

ik9999 commented Apr 5, 2020 • edited Loading

LuChenjing commented Dec 27, 2021

hashwing commented Dec 27, 2021 via email

champtar commented Dec 28, 2021

kwenzh commented Jul 17, 2023

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

ik9999 commented Mar 12, 2020 •

edited

Loading

RikuXan commented Mar 30, 2020 •

edited

Loading

ik9999 commented Apr 5, 2020 •

edited

Loading