-
Notifications
You must be signed in to change notification settings - Fork 40.6k
Windows node PLEG not healthy
during load test with 1pod/s rate
#88153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We saw the same issue in different pod scaling scenarios. Probability of "PLEG not healthy" was connected with parameters of scaling ([number of pods]/[sec] or [number of pods]/[scaling step]) and with Windows instance flavor (more vCPUs == better node "durability"). Cloud providers in use: AWS and Azure. |
Yes, we use Docker. Am wondering is there any thing we can do to improve it. Currently we choose 1pod/33s scaling speed for windows, compared to Linux 5pods/s. |
I'm seeing this on windows nodes frequently. RDP/SSH gives permission denied. node restart seems to be the only workaround. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I've been testing recently with containerd to see if the issue still replicates. It seems based on the testing, the errors are different, but we still face the same problem of high throughput not being possible on windows nodes. Here is a recent test on latest 1.23 alpha and containerd 1.5.4 (gce provider):
Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434
I created 40 pods against each of the 2 windows nodes (ran the tests serially), with 10 seconds delay between every creation in the first test, and 5 seconds delay in the second test. Also, an initial deployment to the same pod is done to get the image cached on the node.
The first test, all 40 pods were able to get up and running normally, with a delay for sure towards the last set of pods:
In the second test, the first ~25-30 pods\containers were created and started up with no issues. but the remaining ones, go into a loop of CreateContainerError/RunContainerError till finally succeeding after 2-3 trials. It took ~12 minutes for last pod to startup succ. (since its initial creation). The failure in last state & events is as follows:
I can notice how the startup start to get significantly delayed when reaching ~20 pods. and then it starts building up, and pods will start failing:
The cpu on this second node kept under 50%. In kubelet logs, I see these repetitive errors:
So what's the recommended throttling required on the windows nodes to avoid big delays and failures - should we've guidelines ? And, Is it possible we can get rid of this or enhance it more, or maybe modify scheduler in future ? Earlier tests done on GKE clusters (v1.21 + containerd 1.5.2). |
/reopen |
@ibabou: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hey +@jsturtevant +@immuzz , I've reopened the issue as we discussed, I included as well a repro on latest k8s + containerd 1.5.4 - the results were slightly better but still saw similar context deadline errors. |
lets buld an e2e for the psuedocode in #88153 (comment) |
/remove-lifecycle rotten |
/cc @claudiubelu |
Is there currently a way to limit the new pods per minute/second? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
any update on this , @YangLu1031 did you solve the problem? |
What happened:
When running windows pod startup latency load test with pod creation speed 1pod/s, the kubelet on the node will become Not Ready with error message
PLEG is not healthy: pleg was last seen active 3m8.068354s ago; threshold is 3m0s
What you expected to happen:
Compared with linux nodes load test, 5 pods/s still works fine.
Is there anything we can do to improve the performance on windows node?
How to reproduce it (as minimally and precisely as possible):
For simplicity, created a script to reproduce it:
https://gist.github.com/YangLu1031/a318ad5e92ae1e61102801fdb9109788
Anything else we need to know?:
#45419
Scenarios when this failure happen:
It seems like there are situations in our current GKE Windows clusters where there's a risk of this issue happening and then causing cascading / continuous node failures:
Steps to reproduce this cascading node failures thru Deployment & ReplicaController.
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # how many pods we can add at a time
maxUnavailable: 0 # maxUnavailable define how many pods can be unavailable during the rolling update
/sig windows
/cc @PatrickLang @dineshgovindasamy @pjh @yliaog @ddebroy
The text was updated successfully, but these errors were encountered: