-
Notifications
You must be signed in to change notification settings - Fork 40.6k
Increase scheduling backoff queue max duration and attach specific error message to unschedulable pods #81214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
/sig scheduling |
/cc |
For lower priority pods getting behind high priority pods, what value are you thinking for the backoff queue max duration ? Thanks |
Unfortunately, increasing the max back-off increases scheduling latency in some other scenarios:
The existing 10 seconds is a relatively long time. Give that the scheduler can schedule 100 pods/s in 5K node clusters, it can evaluate/schedule 1000 pods during that back-off period. Do you often have more than 1000 pending pods in the cluster? One solution to this problem is to make max back-off configurable. It can be 10 seconds by default, but users can increase it if it is no enough for their use-case. |
SchedulingQueue is created by NewConfigFactory. |
@bsalamat Thanks for your reply. BTW: can we attach error message to pods when we add them to unschedulable queue ? So that when specific events come, we just need to move pods which may be influenced and do not need to blindly move all from unschedulable queue to active(backoff) queue ? |
What is described in last paragraph is more than attaching error message to Pods. |
We could start with making the scheduling backoff queue max duration configurable.
/assign
…On Aug 11, 2019, 11:11 AM +0800, Ted Yu ***@***.***>, wrote:
What is described in last paragraph is more than attaching error message to Pods.
It implies type information that distinguishes the Pods.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
We have a daily performance test that runs the latest version of K8s built from HEAD and runs a real cluster and measures performance of various components, including the scheduler. In that setup, the scheduler -> API server QPS is rate limited at 100 qps. You can see the graph for a 5K node cluster here. As the graph shows, the scheduler can scheduler 100 pods/s. In fact, it may be able to schedule more than 100 pods/s if it is not rate limited.
While this may seem a good optimization, it is error prone and hard to maintain. The scheduler code changes frequently and various predicates are added. For each one of these changes, we would need to revisit which pods should be moved to the active queue at every event. This becomes even harder to maintain after users add their own |
Even if we forget to filter pods from unschedulable queue for new predicates, the worst case is : moving all pods to active queue just like what it does today. |
Do we test scheduler with affinity/anti-affinity policy? |
Not in the manner of "X pods/s", but we have hourly benchmark tests on that.
Comparing to regular predicates, Pod(Anti-)Affinity involves a lot of extra calculation, and even more to guarantee its "Symmetry". We're continously improving the internal data structure and logic to improve the performance. BTW: what's the characteristics of your workloads? NodeAffinity, PodAffinity, or PodAntiAffinity, Required or Preferred, and how many AffinityTerm do you usually have? Understanding this can help us better improve the codebase. And in 1.16, we're also bringing up EvenPodsSpread feature to solve some problems in this area. |
Where can I find the parameters of the test mentioned above? Such as --percentage-of-nodes-to-score, policy config, workload(pods and nodes) spec and etc.
Most of our workload is suit to EvenPodsSpread. I have tried to optimize this scenario in our branch before. It is really excited to hear the official supported of this feature. |
They both employ the code in https://github.com/kubernetes/perf-tests/tree/master/clusterloader2 - which is maintained by sig-scalability. But I don't have a link around pointing to the specific scheduler config file. |
/unassign The "Increase scheduling backoff queue max duration" part was merged into the master with #81263 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@alculquicondor: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What would you like to be added:
Increase backoff queue max duration and attach specific error message to unschedulable pods
Why is this needed:
Now the scheduling backoff queue max duration is 10 seconds. We find that some pods in our cluster(5K nodes and 10w+ pods) will wait a very long time to be scheduled. These pods are in the active queue with lower priority.
If some higher priority pods can not be scheduled and be added to backoff queue because of many events which trigger
MoveAllToActiveQueue
, these higher priority pods will be moved back to active queue in at most 10 seconds, which makes the lower priority pods can not even get a chance to be scheduled, can we increase the backoff queue max duration to relieve this situation ?And also, some events such as PVC/Service ADD/UPDATE events will blindly move all pods in unschedulable queue to active queue. Can we attach the specific error message when we add pods to unschedulable queue so that events will only move partial pods in unschedulable queue to active queue ?
/assign
The text was updated successfully, but these errors were encountered: