Increase scheduling backoff queue max duration and attach specific error message to unschedulable pods #81214

NickrenREN · 2019-08-09T07:10:27Z

What would you like to be added:

Increase backoff queue max duration and attach specific error message to unschedulable pods

Why is this needed:

Now the scheduling backoff queue max duration is 10 seconds. We find that some pods in our cluster(5K nodes and 10w+ pods) will wait a very long time to be scheduled. These pods are in the active queue with lower priority.
If some higher priority pods can not be scheduled and be added to backoff queue because of many events which trigger MoveAllToActiveQueue, these higher priority pods will be moved back to active queue in at most 10 seconds, which makes the lower priority pods can not even get a chance to be scheduled, can we increase the backoff queue max duration to relieve this situation ?

And also, some events such as PVC/Service ADD/UPDATE events will blindly move all pods in unschedulable queue to active queue. Can we attach the specific error message when we add pods to unschedulable queue so that events will only move partial pods in unschedulable queue to active queue ?

/assign

The text was updated successfully, but these errors were encountered:

NickrenREN · 2019-08-09T07:10:51Z

/sig scheduling

draveness · 2019-08-09T08:10:54Z

/cc

tedyu · 2019-08-09T11:24:44Z

For lower priority pods getting behind high priority pods, what value are you thinking for the backoff queue max duration ?

Thanks

bsalamat · 2019-08-09T17:54:57Z

Unfortunately, increasing the max back-off increases scheduling latency in some other scenarios:

It raises scheduling latency in auto-scaled clusters: the scheduler retries scheduling pods in a cluster that has run out of resources. Many of such retried pods get max back-off. When auto-scaler adds a new node, the pending pods will have to wait a long time for their back-off to expire before they can be scheduled to these nodes.
A somewhat similar problem can happen after preemption. A cluster is out of resources, a new high priority pod arrives. The scheduler preempts one or more low priority pods. The victim pods get their graceful termination period which is by default 30 seconds (can be longer). Other events in the cluster trigger scheduler retries. The high priority pod is retried and its back-off time increases. The victims terminate and are removed, but the high priority pod has to wait for it back-off time to expire before it is scheduled.

The existing 10 seconds is a relatively long time. Give that the scheduler can schedule 100 pods/s in 5K node clusters, it can evaluate/schedule 1000 pods during that back-off period. Do you often have more than 1000 pending pods in the cluster?

One solution to this problem is to make max back-off configurable. It can be 10 seconds by default, but users can increase it if it is no enough for their use-case.

tedyu · 2019-08-10T02:57:15Z

SchedulingQueue is created by NewConfigFactory.
If we make max back-off duration configurable, does it mean we need to add field to ConfigFactoryArgs ?

NickrenREN · 2019-08-11T02:48:00Z

@bsalamat Thanks for your reply.
Actually, it is hard for scheduler to schedule 100 pods/s because of bottlenecks of many components(apiserver, controller(informer), scheduler itself...). If we do not make any optimizations, scheduler can usually only schedule 10 pods/s in 5k node clusters. 10s max duration is not enough in this case.
I do agree that we can make max back-off duration configurable.

BTW: can we attach error message to pods when we add them to unschedulable queue ? So that when specific events come, we just need to move pods which may be influenced and do not need to blindly move all from unschedulable queue to active(backoff) queue ?

tedyu · 2019-08-11T03:08:47Z

What is described in last paragraph is more than attaching error message to Pods.
It implies type information that distinguishes the Pods.

draveness · 2019-08-11T08:23:04Z

We could start with making the scheduling backoff queue max duration configurable. /assign

…

On Aug 11, 2019, 11:11 AM +0800, Ted Yu ***@***.***>, wrote: What is described in last paragraph is more than attaching error message to Pods. It implies type information that distinguishes the Pods. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

bsalamat · 2019-08-13T00:16:22Z

Actually, it is hard for scheduler to schedule 100 pods/s because of bottlenecks of many components(apiserver, controller(informer), scheduler itself...). If we do not make any optimizations, scheduler can usually only schedule 10 pods/s in 5k node clusters. 10s max duration is not enough in this case.
I do agree that we can make max back-off duration configurable.

We have a daily performance test that runs the latest version of K8s built from HEAD and runs a real cluster and measures performance of various components, including the scheduler. In that setup, the scheduler -> API server QPS is rate limited at 100 qps. You can see the graph for a 5K node cluster here. As the graph shows, the scheduler can scheduler 100 pods/s. In fact, it may be able to schedule more than 100 pods/s if it is not rate limited.

BTW: can we attach error message to pods when we add them to unschedulable queue ? So that when specific events come, we just need to move pods which may be influenced and do not need to blindly move all from unschedulable queue to active(backoff) queue ?

While this may seem a good optimization, it is error prone and hard to maintain. The scheduler code changes frequently and various predicates are added. For each one of these changes, we would need to revisit which pods should be moved to the active queue at every event. This becomes even harder to maintain after users add their own filter plugins to the scheduling framework. For these reasons, I don't think this optimization is worth the risks.

NickrenREN · 2019-08-13T07:03:37Z

While this may seem a good optimization, it is error prone and hard to maintain. The scheduler code changes frequently and various predicates are added. For each one of these changes, we would need to revisit which pods should be moved to the active queue at every event. This becomes even harder to maintain after users add their own filter plugins to the scheduling framework. For these reasons, I don't think this optimization is worth the risks.

Even if we forget to filter pods from unschedulable queue for new predicates, the worst case is : moving all pods to active queue just like what it does today.

ricky1993 · 2019-08-13T12:07:06Z

We have a daily performance test that runs the latest version of K8s built from HEAD and runs a real cluster and measures performance of various components, including the scheduler. In that setup, the scheduler -> API server QPS is rate limited at 100 qps. You can see the graph for a 5K node cluster here. As the graph shows, the scheduler can scheduler 100 pods/s. In fact, it may be able to schedule more than 100 pods/s if it is not rate limited.

Do we test scheduler with affinity/anti-affinity policy?
As #72479 mentioned, it is a really slow policy for scheduler.
I find our scheduler schedule a pod cost more than one second if the topology across the nodes for a 5K node cluster.

Huang-Wei · 2019-08-13T17:24:04Z

Do we test scheduler with affinity/anti-affinity policy?

Not in the manner of "X pods/s", but we have hourly benchmark tests on that.

As #72479 mentioned, it is a really slow policy for scheduler.
I find our scheduler schedule a pod cost more than one second if the topology across the nodes for a 5K node cluster.

Comparing to regular predicates, Pod(Anti-)Affinity involves a lot of extra calculation, and even more to guarantee its "Symmetry". We're continously improving the internal data structure and logic to improve the performance.

BTW: what's the characteristics of your workloads? NodeAffinity, PodAffinity, or PodAntiAffinity, Required or Preferred, and how many AffinityTerm do you usually have? Understanding this can help us better improve the codebase. And in 1.16, we're also bringing up EvenPodsSpread feature to solve some problems in this area.

ricky1993 · 2019-08-13T18:36:20Z

Not in the manner of "X pods/s", but we have hourly benchmark tests on that.

Where can I find the parameters of the test mentioned above? Such as --percentage-of-nodes-to-score, policy config, workload(pods and nodes) spec and etc.

BTW: what's the characteristics of your workloads? NodeAffinity, PodAffinity, or PodAntiAffinity, Required or Preferred, and how many AffinityTerm do you usually have? Understanding this can help us better improve the codebase. And in 1.16, we're also bringing up EvenPodsSpread feature to solve some problems in this area.

Most of our workload is suit to EvenPodsSpread. I have tried to optimize this scenario in our branch before. It is really excited to hear the official supported of this feature.
At the same time, we have some workloads of all of other scenario, but the number of that is limited, because of the scheduler performance concern.

Huang-Wei · 2019-08-14T00:31:36Z

Where can I find the parameters of the test mentioned above? Such as --percentage-of-nodes-to-score, policy config, workload(pods and nodes) spec and etc.

gce-5000 nodes test link - https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml#L73
benchmark test link - https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-periodic-jobs.yaml#L280

They both employ the code in https://github.com/kubernetes/perf-tests/tree/master/clusterloader2 - which is maintained by sig-scalability. But I don't have a link around pointing to the specific scheduler config file.

draveness · 2019-10-10T14:01:45Z

/unassign

The "Increase scheduling backoff queue max duration" part was merged into the master with #81263

fejta-bot · 2020-01-08T15:15:56Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

alculquicondor · 2020-01-27T21:19:35Z

/close

#81263 should be enough for what this issue was asking.

I'm deduping the rest of the discussion into #86373

k8s-ci-robot · 2020-01-27T21:19:37Z

@alculquicondor: Closing this issue.

In response to this:

/close

#81263 should be enough for what this issue was asking.

I'm deduping the rest of the discussion into #86373

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

NickrenREN added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 9, 2019

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Aug 9, 2019

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 9, 2019

k8s-ci-robot assigned NickrenREN Aug 9, 2019

k8s-ci-robot assigned draveness Aug 11, 2019

draveness mentioned this issue Aug 11, 2019

feat: update scheduling queue with options #81263

Merged

k8s-ci-robot unassigned draveness Oct 10, 2019

yqwang-ms mentioned this issue Dec 18, 2019

Totally avoid Pod starvation (HOL blocking) or clarify the user expectation on the wiki #86373

Closed

yingnanzhang666 mentioned this issue Jan 3, 2020

de-prioritize ScheduleFailed pods in scheduler #85792

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 8, 2020

k8s-ci-robot closed this as completed Jan 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase scheduling backoff queue max duration and attach specific error message to unschedulable pods #81214

Increase scheduling backoff queue max duration and attach specific error message to unschedulable pods #81214

NickrenREN commented Aug 9, 2019 •

edited

Loading

NickrenREN commented Aug 9, 2019

draveness commented Aug 9, 2019

tedyu commented Aug 9, 2019

bsalamat commented Aug 9, 2019

tedyu commented Aug 10, 2019

NickrenREN commented Aug 11, 2019

tedyu commented Aug 11, 2019

draveness commented Aug 11, 2019 via email •

edited

Loading

bsalamat commented Aug 13, 2019 •

edited

Loading

NickrenREN commented Aug 13, 2019 •

edited

Loading

ricky1993 commented Aug 13, 2019

Huang-Wei commented Aug 13, 2019

ricky1993 commented Aug 13, 2019

Huang-Wei commented Aug 14, 2019

draveness commented Oct 10, 2019 •

edited

Loading

fejta-bot commented Jan 8, 2020

alculquicondor commented Jan 27, 2020

k8s-ci-robot commented Jan 27, 2020

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Increase scheduling backoff queue max duration and attach specific error message to unschedulable pods #81214

Increase scheduling backoff queue max duration and attach specific error message to unschedulable pods #81214

Comments

NickrenREN commented Aug 9, 2019 • edited Loading

NickrenREN commented Aug 9, 2019

draveness commented Aug 9, 2019

tedyu commented Aug 9, 2019

bsalamat commented Aug 9, 2019

tedyu commented Aug 10, 2019

NickrenREN commented Aug 11, 2019

tedyu commented Aug 11, 2019

draveness commented Aug 11, 2019 via email • edited Loading

bsalamat commented Aug 13, 2019 • edited Loading

NickrenREN commented Aug 13, 2019 • edited Loading

ricky1993 commented Aug 13, 2019

Huang-Wei commented Aug 13, 2019

ricky1993 commented Aug 13, 2019

Huang-Wei commented Aug 14, 2019

draveness commented Oct 10, 2019 • edited Loading

fejta-bot commented Jan 8, 2020

alculquicondor commented Jan 27, 2020

k8s-ci-robot commented Jan 27, 2020

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

NickrenREN commented Aug 9, 2019 •

edited

Loading

draveness commented Aug 11, 2019 via email •

edited

Loading

bsalamat commented Aug 13, 2019 •

edited

Loading

NickrenREN commented Aug 13, 2019 •

edited

Loading

draveness commented Oct 10, 2019 •

edited

Loading