Skip to content

Increase scheduling backoff queue max duration and attach specific error message to unschedulable pods #81214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
NickrenREN opened this issue Aug 9, 2019 · 18 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@NickrenREN
Copy link
Contributor

NickrenREN commented Aug 9, 2019

What would you like to be added:

Increase backoff queue max duration and attach specific error message to unschedulable pods

Why is this needed:

Now the scheduling backoff queue max duration is 10 seconds. We find that some pods in our cluster(5K nodes and 10w+ pods) will wait a very long time to be scheduled. These pods are in the active queue with lower priority.
If some higher priority pods can not be scheduled and be added to backoff queue because of many events which trigger MoveAllToActiveQueue, these higher priority pods will be moved back to active queue in at most 10 seconds, which makes the lower priority pods can not even get a chance to be scheduled, can we increase the backoff queue max duration to relieve this situation ?

And also, some events such as PVC/Service ADD/UPDATE events will blindly move all pods in unschedulable queue to active queue. Can we attach the specific error message when we add pods to unschedulable queue so that events will only move partial pods in unschedulable queue to active queue ?

/assign

@NickrenREN NickrenREN added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 9, 2019
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Aug 9, 2019
@NickrenREN
Copy link
Contributor Author

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 9, 2019
@draveness
Copy link
Contributor

/cc

@tedyu
Copy link
Contributor

tedyu commented Aug 9, 2019

For lower priority pods getting behind high priority pods, what value are you thinking for the backoff queue max duration ?

Thanks

@bsalamat
Copy link
Member

bsalamat commented Aug 9, 2019

Unfortunately, increasing the max back-off increases scheduling latency in some other scenarios:

  1. It raises scheduling latency in auto-scaled clusters: the scheduler retries scheduling pods in a cluster that has run out of resources. Many of such retried pods get max back-off. When auto-scaler adds a new node, the pending pods will have to wait a long time for their back-off to expire before they can be scheduled to these nodes.
  2. A somewhat similar problem can happen after preemption. A cluster is out of resources, a new high priority pod arrives. The scheduler preempts one or more low priority pods. The victim pods get their graceful termination period which is by default 30 seconds (can be longer). Other events in the cluster trigger scheduler retries. The high priority pod is retried and its back-off time increases. The victims terminate and are removed, but the high priority pod has to wait for it back-off time to expire before it is scheduled.

The existing 10 seconds is a relatively long time. Give that the scheduler can schedule 100 pods/s in 5K node clusters, it can evaluate/schedule 1000 pods during that back-off period. Do you often have more than 1000 pending pods in the cluster?

One solution to this problem is to make max back-off configurable. It can be 10 seconds by default, but users can increase it if it is no enough for their use-case.

@tedyu
Copy link
Contributor

tedyu commented Aug 10, 2019

SchedulingQueue is created by NewConfigFactory.
If we make max back-off duration configurable, does it mean we need to add field to ConfigFactoryArgs ?

@NickrenREN
Copy link
Contributor Author

@bsalamat Thanks for your reply.
Actually, it is hard for scheduler to schedule 100 pods/s because of bottlenecks of many components(apiserver, controller(informer), scheduler itself...). If we do not make any optimizations, scheduler can usually only schedule 10 pods/s in 5k node clusters. 10s max duration is not enough in this case.
I do agree that we can make max back-off duration configurable.

BTW: can we attach error message to pods when we add them to unschedulable queue ? So that when specific events come, we just need to move pods which may be influenced and do not need to blindly move all from unschedulable queue to active(backoff) queue ?

@tedyu
Copy link
Contributor

tedyu commented Aug 11, 2019

What is described in last paragraph is more than attaching error message to Pods.
It implies type information that distinguishes the Pods.

@draveness
Copy link
Contributor

draveness commented Aug 11, 2019 via email

@bsalamat
Copy link
Member

bsalamat commented Aug 13, 2019

Actually, it is hard for scheduler to schedule 100 pods/s because of bottlenecks of many components(apiserver, controller(informer), scheduler itself...). If we do not make any optimizations, scheduler can usually only schedule 10 pods/s in 5k node clusters. 10s max duration is not enough in this case.
I do agree that we can make max back-off duration configurable.

We have a daily performance test that runs the latest version of K8s built from HEAD and runs a real cluster and measures performance of various components, including the scheduler. In that setup, the scheduler -> API server QPS is rate limited at 100 qps. You can see the graph for a 5K node cluster here. As the graph shows, the scheduler can scheduler 100 pods/s. In fact, it may be able to schedule more than 100 pods/s if it is not rate limited.

BTW: can we attach error message to pods when we add them to unschedulable queue ? So that when specific events come, we just need to move pods which may be influenced and do not need to blindly move all from unschedulable queue to active(backoff) queue ?

While this may seem a good optimization, it is error prone and hard to maintain. The scheduler code changes frequently and various predicates are added. For each one of these changes, we would need to revisit which pods should be moved to the active queue at every event. This becomes even harder to maintain after users add their own filter plugins to the scheduling framework. For these reasons, I don't think this optimization is worth the risks.

@NickrenREN
Copy link
Contributor Author

NickrenREN commented Aug 13, 2019

While this may seem a good optimization, it is error prone and hard to maintain. The scheduler code changes frequently and various predicates are added. For each one of these changes, we would need to revisit which pods should be moved to the active queue at every event. This becomes even harder to maintain after users add their own filter plugins to the scheduling framework. For these reasons, I don't think this optimization is worth the risks.

Even if we forget to filter pods from unschedulable queue for new predicates, the worst case is : moving all pods to active queue just like what it does today.

@ricky1993
Copy link
Contributor

We have a daily performance test that runs the latest version of K8s built from HEAD and runs a real cluster and measures performance of various components, including the scheduler. In that setup, the scheduler -> API server QPS is rate limited at 100 qps. You can see the graph for a 5K node cluster here. As the graph shows, the scheduler can scheduler 100 pods/s. In fact, it may be able to schedule more than 100 pods/s if it is not rate limited.

Do we test scheduler with affinity/anti-affinity policy?
As #72479 mentioned, it is a really slow policy for scheduler.
I find our scheduler schedule a pod cost more than one second if the topology across the nodes for a 5K node cluster.

@Huang-Wei
Copy link
Member

Do we test scheduler with affinity/anti-affinity policy?

Not in the manner of "X pods/s", but we have hourly benchmark tests on that.

As #72479 mentioned, it is a really slow policy for scheduler.
I find our scheduler schedule a pod cost more than one second if the topology across the nodes for a 5K node cluster.

Comparing to regular predicates, Pod(Anti-)Affinity involves a lot of extra calculation, and even more to guarantee its "Symmetry". We're continously improving the internal data structure and logic to improve the performance.

BTW: what's the characteristics of your workloads? NodeAffinity, PodAffinity, or PodAntiAffinity, Required or Preferred, and how many AffinityTerm do you usually have? Understanding this can help us better improve the codebase. And in 1.16, we're also bringing up EvenPodsSpread feature to solve some problems in this area.

@ricky1993
Copy link
Contributor

Not in the manner of "X pods/s", but we have hourly benchmark tests on that.

Where can I find the parameters of the test mentioned above? Such as --percentage-of-nodes-to-score, policy config, workload(pods and nodes) spec and etc.

BTW: what's the characteristics of your workloads? NodeAffinity, PodAffinity, or PodAntiAffinity, Required or Preferred, and how many AffinityTerm do you usually have? Understanding this can help us better improve the codebase. And in 1.16, we're also bringing up EvenPodsSpread feature to solve some problems in this area.

Most of our workload is suit to EvenPodsSpread. I have tried to optimize this scenario in our branch before. It is really excited to hear the official supported of this feature.
At the same time, we have some workloads of all of other scenario, but the number of that is limited, because of the scheduler performance concern.

@Huang-Wei
Copy link
Member

Where can I find the parameters of the test mentioned above? Such as --percentage-of-nodes-to-score, policy config, workload(pods and nodes) spec and etc.

They both employ the code in https://github.com/kubernetes/perf-tests/tree/master/clusterloader2 - which is maintained by sig-scalability. But I don't have a link around pointing to the specific scheduler config file.

@draveness
Copy link
Contributor

draveness commented Oct 10, 2019

/unassign

The "Increase scheduling backoff queue max duration" part was merged into the master with #81263

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 8, 2020
@alculquicondor
Copy link
Member

/close

#81263 should be enough for what this issue was asking.

I'm deduping the rest of the discussion into #86373

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: Closing this issue.

In response to this:

/close

#81263 should be enough for what this issue was asking.

I'm deduping the rest of the discussion into #86373

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

9 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy