Content-Length: 350725 | pFad | https://github.com/apache/airflow/pull/51023

EA Prevent CPU spike in task supervisor when heartbeat timeout exceeded by kaxil · Pull Request #51023 · apache/airflow · GitHub
Skip to content

Prevent CPU spike in task supervisor when heartbeat timeout exceeded #51023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 25, 2025

Conversation

kaxil
Copy link
Member

@kaxil kaxil commented May 23, 2025

closes #50507

When the task supervisor's heartbeat timeout is exceeded, the max_wait_time calculation can become 0 or negative, causing selector.select(timeout=0) to run in a tight non-blocking loop that consumes 100% CPU as explained in the GitHub issue.

Add minimum timeout of 0.01s in _service_subprocess to prevent this issue while maintaining responsive task monitoring.

Root Cause:
The issue occurs in the _monitor_subprocess method when:

  1. last_heartbeat_ago becomes very large e.g., 100+ seconds due to bugs like network issues or due to bugs like The task supervisor continues running indefinitely, even after the associated task process has completed #50500
  2. The calculation HEARTBEAT_TIMEOUT - last_heartbeat_ago * 0.75 becomes negative
  3. max(0, negative_value) results in 0
  4. selector.select(timeout=0) runs in a non-blocking tight loop

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

closes apache#50507

When the task supervisor's heartbeat timeout is exceeded, the `max_wait_time`
calculation can become 0 or negative, causing `selector.select(timeout=0)` to
run in a tight non-blocking loop that consumes 100% CPU as explained in the GitHub issue.

Add minimum timeout of 0.01s in `_service_subprocess` to prevent this issue
while maintaining responsive task monitoring.

**Root Cause:**
The issue occurs in the `_monitor_subprocess` method when:
1. `last_heartbeat_ago` becomes very large e.g., 100+ seconds due to bugs like network issues or due to bugs like apache#50500
2. The calculation `HEARTBEAT_TIMEOUT - last_heartbeat_ago * 0.75` becomes negative
3. `max(0, negative_value)` results in 0
4. `selector.select(timeout=0)` runs in a non-blocking tight loop
@kaxil kaxil requested review from ashb and amoghrajesh as code owners May 23, 2025 23:43
@kaxil kaxil added the backport-to-v3-0-test Mark PR with this label to backport to v3-0-test branch label May 23, 2025
@kaxil kaxil merged commit beb7b62 into apache:main May 25, 2025
72 checks passed
@kaxil kaxil deleted the stop-cpu-spike branch May 25, 2025 11:08
github-actions bot pushed a commit that referenced this pull request May 25, 2025
…ut exceeded (#51023)

(cherry picked from commit beb7b62)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Copy link

Backport successfully created: v3-0-test

Status Branch Result
v3-0-test PR Link

github-actions bot pushed a commit to aws-mwaa/upstream-to-airflow that referenced this pull request May 25, 2025
…ut exceeded (apache#51023)

(cherry picked from commit beb7b62)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
github-actions bot pushed a commit to guan404ming/airflow that referenced this pull request May 25, 2025
…ut exceeded (apache#51023)

(cherry picked from commit beb7b62)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Copy link
Contributor

@amoghrajesh amoghrajesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find!

kaxil added a commit that referenced this pull request Jun 2, 2025
…ut exceeded (#51023) (#51047)

(cherry picked from commit beb7b62)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
kaxil added a commit that referenced this pull request Jun 3, 2025
…ut exceeded (#51023) (#51047)

(cherry picked from commit beb7b62)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:task-sdk backport-to-v3-0-test Mark PR with this label to backport to v3-0-test branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Task supervisor CPU spike with 0 timeout for socket selector events
4 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/apache/airflow/pull/51023

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy