Skip to content

Improve detection and handling of timed out DAG processor processes #49868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 29, 2025

Conversation

amoghrajesh
Copy link
Contributor

closes: #49689

Problem

start_time property on a dag file processor subprocess https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/dag_processing/processor.py#L315-L317 is calculated using boot_time in psutil: https://github.com/giampaolo/psutil/blob/d461f4c0f0aad1a039c7d8bb724a4c7288ef2f39/psutil/_pslinux.py#L1557

The problem here seems in our usage of it, when we use it as a property, looks like due to caching, in https://github.com/giampaolo/psutil/blob/d461f4c0f0aad1a039c7d8bb724a4c7288ef2f39/psutil/__init__.py#L774-L784 that the start_time of a subprocess is not updating when the system sleeps, leading to a earlier start_time. And while calculating duration we do a time.time() comparison, which obviously shifts leading to subprocess getting killed.

Shifting to use of time.monotonic gives a more accurate uptime calculation by not letting restarts or system clock dependency.

The fix seems to fix it.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triggerer tests are failing:

FAILED airflow-core/tests/unit/jobs/test_triggerer_job.py::test_trigger_create_race_condition_38599 - TypeError: __init__() missing 1 required keyword-only argument: 'start_time'
FAILED airflow-core/tests/unit/jobs/test_triggerer_job.py::test_failed_trigger - TypeError: __init__() missing 1 required keyword-only argument: 'start_time'
XPASS airflow-core/tests/unit/jobs/test_scheduler_job.py::TestSchedulerJob::test_do_not_schedule_removed_task - This test does not verify anything; no time to fix; see notes below
XPASS airflow-core/tests/unit/jobs/test_triggerer_job.py::test_trigger_can_access_variables_connections_and_xcoms - We know that test is flaky and have no time to fix it before 3.0. We should fix it later. TODO: AIP-72
XPASS airflow-core/tests/unit/jobs/test_triggerer_job.py::test_trigger_can_fetch_trigger_dag_run_count_and_state_in_deferrable - We know that test is flaky and have no time to fix it before 3.0. We should fix it later. TODO: AIP-72
XPASS airflow-core/tests/unit/jobs/test_triggerer_job.py::test_trigger_can_fetch_dag_run_count_ti_count_in_deferrable - We know that test is flaky and have no time to fix it before 3.0. We should fix it later. TODO: AIP-72

You probably need:

diff --git a/airflow-core/tests/unit/jobs/test_triggerer_job.py b/airflow-core/tests/unit/jobs/test_triggerer_job.py
index d3c9ae6f27..a717b0bb83 100644
--- a/airflow-core/tests/unit/jobs/test_triggerer_job.py
+++ b/airflow-core/tests/unit/jobs/test_triggerer_job.py
@@ -174,6 +174,7 @@ def supervisor_builder(mocker, session):
             process=process,
             requests_fd=-1,
             capacity=10,
+            start_time=time.monotonic(),
         )
         # Mock the selector
         mock_selector = mocker.Mock(spec=selectors.DefaultSelector)

or make start_time optional and do

self.start_time = start_time or time.monotonic()

@amoghrajesh
Copy link
Contributor Author

Ah i had made it optional but didnt assign a default. Pushing a fix.

@ashb
Copy link
Member

ashb commented Apr 28, 2025

We should add some of the reason in a comment too, else someone might optimise it by switching back to created time

@kaxil kaxil force-pushed the dagprocessor-crashing-bug branch from e36b5e8 to f5d2783 Compare April 28, 2025 19:30
@kaxil kaxil added this to the Airflow 3.0.1 milestone Apr 28, 2025
@amoghrajesh amoghrajesh added the backport-to-v3-0-test Mark PR with this label to backport to v3-0-test branch label Apr 29, 2025
@amoghrajesh amoghrajesh merged commit af10644 into apache:main Apr 29, 2025
71 checks passed
@amoghrajesh amoghrajesh deleted the dagprocessor-crashing-bug branch April 29, 2025 05:10
github-actions bot pushed a commit that referenced this pull request Apr 29, 2025
… processes (#49868)

(cherry picked from commit af10644)

Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>
Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>
Copy link

Backport successfully created: v3-0-test

Status Branch Result
v3-0-test PR Link

github-actions bot pushed a commit to aws-mwaa/upstream-to-airflow that referenced this pull request Apr 29, 2025
… processes (apache#49868)

(cherry picked from commit af10644)

Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>
Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>
amoghrajesh added a commit that referenced this pull request Apr 29, 2025
… processes (#49868) (#49925)

(cherry picked from commit af10644)

Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>
Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>
mvfc pushed a commit to mvfc/airflow that referenced this pull request Apr 29, 2025
…pache#49868)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>
mvfc pushed a commit to mvfc/airflow that referenced this pull request Apr 29, 2025
…pache#49868)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>
jroachgolf84 pushed a commit to jroachgolf84/airflow that referenced this pull request Apr 30, 2025
…pache#49868)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:DAG-processing area:task-sdk backport-to-v3-0-test Mark PR with this label to backport to v3-0-test branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dag processor gets SIGKILL signal and all DAGs are removed from UI
3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy