Skip to content

Commit 4a06f89

Browse files
pcolladosotopotiuk
andauthored
Fix processor cleanup on DagFileProcessorManager (#22685)
* Fix processor cleanup References to processors weren't being cleaned up after killing them in the event of a timeout. This lead to a crash caused by an unhandled exception when trying to read from a closed end of a pipe. * Reap the zombie when killing the processor When calling `_kill_process()` we're generating zombies which weren't being `wait()`ed for. This led to a process leak we fix by just calling `waitpid()` on the appropriate PIDs. * Reap resulting zombies in a safe way According to @potiuk's and @malthe's input, the way we were reaping the zombies could cause some racy and unwanted situations. As seen on the discussion over at `https://bugs.python.org/issue42558` we can safely reap the spawned zombies with the changes we have introduced. * Explain why we are actively waiting As suggested by @potiuk explaining why we chose to actively wait on an scenario such as this one can indeed be useful for anybody taking a look at the code some time from now... Co-authored-by: Jarek Potiuk <jarek@potiuk.com> * Fix small typo and triling whitespace After accepting the changes proposed on the PR we found a small typo (we make those on a daily basis) and a trailing whitespace we though was nice to delete. Hope we made the right choice! * Fix call to `poll()` We were calling `poll()` through the `_process` attribute and, as shown on the static checks triggered by GitHub, it's not defined for the `BaseProcess` class. We instead have to call `poll()` through `BaseProcess`'s `_popen` attribute. * Fix processor cleanup References to processors weren't being cleaned up after killing them in the event of a timeout. This lead to a crash caused by an unhandled exception when trying to read from a closed end of a pipe. * Reap the zombie when killing the processor When calling `_kill_process()` we're generating zombies which weren't being `wait()`ed for. This led to a process leak we fix by just calling `waitpid()` on the appropriate PIDs. * Reap resulting zombies in a safe way According to @potiuk's and @malthe's input, the way we were reaping the zombies could cause some racy and unwanted situations. As seen on the discussion over at `https://bugs.python.org/issue42558` we can safely reap the spawned zombies with the changes we have introduced. * Explain why we are actively waiting As suggested by @potiuk explaining why we chose to actively wait on an scenario such as this one can indeed be useful for anybody taking a look at the code some time from now... Co-authored-by: Jarek Potiuk <jarek@potiuk.com> * Fix small typo and triling whitespace After accepting the changes proposed on the PR we found a small typo (we make those on a daily basis) and a trailing whitespace we though was nice to delete. Hope we made the right choice! * Fix call to `poll()` We were calling `poll()` through the `_process` attribute and, as shown on the static checks triggered by GitHub, it's not defined for the `BaseProcess` class. We instead have to call `poll()` through `BaseProcess`'s `_popen` attribute. * Prevent static check from failing After reading through `multiprocessing`'s implementation we really didn't know why the static check on line `239` was failing: the process should contain a `_popen` attribute... That's when we found line `223` and discovered the trailing `# type: ignore` comment. After reading up on it we found that it instructs *MyPy* not to statically check that very line. Given we're having trouble with the exact same attribute we decided to include the same directive for the static checker. Hope we made the right call! * Fix test for `_kill_timed_out_processors()` We hadn't updated the tests for the method whose body we've altered. This caused the tests to fail when trying to retrieve a processor's *waitable*, a property similar to a *file descriptor* in UNIX-like systems. We have added a mock property to the `processor` and we've also updated the `manager`'s attributes so as to faithfully recreate the state of the data sctructures at a moment when a `processor` is to be terminated. Please note the `assertions` at the end are meant to check we reach the `manager`'s expected state. We have chosen to check the number of processor's against an explicit value because we're defining `manager._processors` explicitly within the test. On the other hand, `manager.waitables` can have a different length depending on the call to `DagFileProcessorManager`'s `__init__()`. In this test the expected initial length is `1` given we're passing `MagicMock()` as the `signal_conn` when instantiating the manager. However, if this were to be changed the tests would 'inexplicably' fail. Instead of checking `manager.waitables`' length against a hardcoded value we decided to instead compare it to its initial length so as to emphasize we're interested in the change in length, not its absolute value. * Fix `black` checks and `mock` decorators One of the methods we are to mock required a rather long `@mock.patch` decorator which didn't pass the checks made by `black` on the precommit hooks. On top of that, we messed up the ordering of the `@mock.patch` decorators which meant we didn't set them up properly. This manifested as a `KeyError` on the method we're currently testing. O_o Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
1 parent c0c08b2 commit 4a06f89

File tree

3 files changed

+25
-1
lines changed

3 files changed

+25
-1
lines changed

airflow/dag_processing/manager.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1065,6 +1065,7 @@ def prepare_file_path_queue(self):
10651065
def _kill_timed_out_processors(self):
10661066
"""Kill any file processors that timeout to defend against process hangs."""
10671067
now = timezone.utcnow()
1068+
processors_to_remove = []
10681069
for file_path, processor in self._processors.items():
10691070
duration = now - processor.start_time
10701071
if duration > self._processor_timeout:
@@ -1080,6 +1081,14 @@ def _kill_timed_out_processors(self):
10801081
Stats.incr('dag_file_processor_timeouts')
10811082
processor.kill()
10821083

1084+
# Clean up processor references
1085+
self.waitables.pop(processor.waitable_handle)
1086+
processors_to_remove.append(file_path)
1087+
1088+
# Clean up `self._processors` after iterating over it
1089+
for proc in processors_to_remove:
1090+
self._processors.pop(proc)
1091+
10831092
def max_runs_reached(self):
10841093
""":return: whether all file paths have been processed max_runs times"""
10851094
if self._max_runs == -1: # Unlimited runs.

airflow/dag_processing/processor.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
import os
2222
import signal
2323
import threading
24+
import time
2425
from contextlib import redirect_stderr, redirect_stdout, suppress
2526
from datetime import timedelta
2627
from multiprocessing.connection import Connection as MultiprocessingConnection
@@ -231,6 +232,12 @@ def _kill_process(self) -> None:
231232
if self._process.is_alive() and self._process.pid:
232233
self.log.warning("Killing DAGFileProcessorProcess (PID=%d)", self._process.pid)
233234
os.kill(self._process.pid, signal.SIGKILL)
235+
236+
# Reap the spawned zombie. We active wait, because in Python 3.9 `waitpid` might lead to an
237+
# exception, due to change in Python standard library and possibility of race condition
238+
# see https://bugs.python.org/issue42558
239+
while self._process._popen.poll() is None: # type: ignore
240+
time.sleep(0.001)
234241
if self._parent_channel:
235242
self._parent_channel.close()
236243

tests/dag_processing/test_manager.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -501,10 +501,14 @@ def test_deactivate_stale_dags(self):
501501

502502
assert len(active_dags) == 0
503503

504+
@mock.patch(
505+
"airflow.dag_processing.processor.DagFileProcessorProcess.waitable_handle", new_callable=PropertyMock
506+
)
504507
@mock.patch("airflow.dag_processing.processor.DagFileProcessorProcess.pid", new_callable=PropertyMock)
505508
@mock.patch("airflow.dag_processing.processor.DagFileProcessorProcess.kill")
506-
def test_kill_timed_out_processors_kill(self, mock_kill, mock_pid):
509+
def test_kill_timed_out_processors_kill(self, mock_kill, mock_pid, mock_waitable_handle):
507510
mock_pid.return_value = 1234
511+
mock_waitable_handle.return_value = 3
508512
manager = DagFileProcessorManager(
509513
dag_directory='directory',
510514
max_runs=1,
@@ -518,8 +522,12 @@ def test_kill_timed_out_processors_kill(self, mock_kill, mock_pid):
518522
processor = DagFileProcessorProcess('abc.txt', False, [], [])
519523
processor._start_time = timezone.make_aware(datetime.min)
520524
manager._processors = {'abc.txt': processor}
525+
manager.waitables[3] = processor
526+
initial_waitables = len(manager.waitables)
521527
manager._kill_timed_out_processors()
522528
mock_kill.assert_called_once_with()
529+
assert len(manager._processors) == 0
530+
assert len(manager.waitables) == initial_waitables - 1
523531

524532
@mock.patch("airflow.dag_processing.processor.DagFileProcessorProcess.pid", new_callable=PropertyMock)
525533
@mock.patch("airflow.dag_processing.processor.DagFileProcessorProcess")

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy