-
Notifications
You must be signed in to change notification settings - Fork 15.1k
dag_processing code needs to handle OSError("handle is closed") in poll() and recv() calls #22191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
Feel free to submit a pull request to handle the exception! We can figure out how to test the solution in the review process. BTW I don’t know what your current fix looks like, but |
I plan to submit a PR within the next two weeks. |
Trying to explain things...Our team has run into this issue time and time again. We have tried different combinations of both Airflow and Python versions to no avail. TL;DRWhen a The long read...We're running a decouple Airflow deployment within a k8s cluster. We are currently using a 3-container pod where one of them runs the Web Server, another one executes the Scheduler and the third one implements Flower (we're using the CeleryExecutor). The backbone of the deployment is implemented through a StatefulSet that runs the Celery executors themselves. The trace we were seeing on the scheduler time and time again was:
This has been thrown by Airflow 2.1.3, but we've seen very similar (if not equal) variations with versions all the way up to Airflow 2.2.4. Given we traced the problem down to the way multiprocessing synchronisation was being handled we played around with airflow/airflow/utils/mixins.py Lines 27 to 38 in f309ea7
The containers we are using leverage For a while we coped with this behaviour by just restarting the Airflow deployment on an hourly basis, but we decided to set some time apart today to delve a bit deeper into all this. The good news is after a thorough investigation we noticed a pattern the preceded the crash. In order to pin it down down ran
The above led us to believe something was a bit off in the way the
Can you see how the number of Note we added the airflow/airflow/dag_processing/manager.py Lines 1159 to 1175 in 614858f
After we added additional logging it looked like: def _kill_timed_out_processors(self):
"""Kill any file processors that timeout to defend against process hangs."""
self.log.debug("FOO - Looking for DAG processors to terminate due to timeouts!")
now = timezone.utcnow()
processors_to_remove = []
for file_path, processor in self._processors.items():
duration = now - processor.start_time
if duration > self._processor_timeout:
self.log.error(
"Processor for %s with PID %s started at %s has timed out, killing it.",
file_path,
processor.pid,
processor.start_time.isoformat(),
)
Stats.decr('dag_processing.processes')
Stats.incr('dag_processing.processor_timeouts')
# TODO: Remove after Airflow 2.0
Stats.incr('dag_file_processor_timeouts')
self.log.debug(f"FOO - # of waitables BEFORE killing timed out processor: {len(self.waitables)}")
processor.kill()
self.log.debug(f"FOO - # of waitables AFTER killing timed out processor: {len(self.waitables)}") You can see how we call the airflow/airflow/dag_processing/processor.py Lines 238 to 246 in 614858f
Notice how the end of the communicating pipe opened on airflow/airflow/dag_processing/processor.py Line 187 in 614858f
That's exactly the same pipe end (i.e. file descriptor) the airflow/airflow/dag_processing/processor.py Line 286 in 614858f
What it's trying to do is So... why are we trying to collect results from a airflow/airflow/dag_processing/manager.py Lines 612 to 734 in 614858f
It basically runs an infinite (unless we specify a maximum number of runs) loop that calls airflow/airflow/dag_processing/manager.py Lines 1015 to 1034 in 614858f
However, this dictionary is not updated when a processor is killed due to a timeout. You can check that out on the snippet we included above. Thus, after the timed out After some testing we arrived at the following implementation of def _kill_timed_out_processors(self):
"""Kill any file processors that timeout to defend against process hangs."""
self.log.debug("FOO - ** Entering _kill_timed_out_processors() **")
# We'll get a clear picture of what these two attributes look like before
# killing anything.
self.log.debug(f"FOO - We'll iterate over: {self._processors}")
self.log.debug(f"FOO - Current waitables dir: {self.waitables}")
now = timezone.utcnow()
processors_to_remove = []
for file_path, processor in self._processors.items():
duration = now - processor.start_time
if duration > self._processor_timeout:
self.log.error(
"Processor for %s with PID %s started at %s has timed out, killing it.",
file_path,
processor.pid,
processor.start_time.isoformat(),
)
Stats.decr('dag_processing.processes')
Stats.incr('dag_processing.processor_timeouts')
# TODO: Remove after Airflow 2.0
Stats.incr('dag_file_processor_timeouts')
# Add some logging to check stuff out
self.log.debug(f"FOO - # of waitables BEFORE killing timed out processor: {len(self.waitables)}")
self.log.debug(f"FOO - We'll iterate over: {self._processors}")
self.log.debug(f"FOO - Current waitables dir: {self.waitables}")
# Kill the hanged processor
processor.kill()
# Update self.waitables to avoid asking for results later on
self.waitables.pop(processor.waitable_handle)
# Make a note of the file_paths we are to remove later on: we feel a bit uneasy about modifying the
# container we're currently iterating over...
processors_to_remove.append(file_path)
# Add some logging to check how stuff is doing...
self.log.debug(f"FOO - # of waitables AFTER killing timed out processor: {len(self.waitables)}")
self.log.debug(f"FOO - We'll iterate over: {self._processors}")
self.log.debug(f"FOO - Current waitables dir: {self.waitables}")
# Clean up `self._processors` too!
for proc in processors_to_remove:
self._processors.pop(proc)
# And after we're done take a look at the final state
self.log.debug(f"FOO - Processors after cleanup: {self._processors}")
self.log.debug(f"FOO - Waitables after cleanup: {self.waitables}")
self.log.debug(f"FOO - ** Leaving _kill_timed_out_processors() with **") We know the above can surely be written in a more succinct/better way: we're by no means good programmers! Against all odds, the code above seems to prevent the crash! 🎉 It does, however, spawn zombies when we kill the We decided to also play around with the def _kill_process(self) -> None:
if self._process is None:
raise AirflowException("Tried to kill process before starting!")
if self._process.is_alive() and self._process.pid:
self.log.warning("Killing DAGFileProcessorProcess (PID=%d)", self._process.pid)
os.kill(self._process.pid, signal.SIGKILL)
# Reap the resulting zombie! Note the call to `waitpid()` blocks unless we
# leverage the `WNOHANG` (https://docs.python.org/3/library/os.html#os.WNOHANG)
# option. Given we were just playing around we decided not to bother with that...
self.warning(f"FOO - Waiting to reap the zombie spawned from PID {self._process.pid}")
os.waitpid(self._process.pid)
self.warning(f"FOO - Reaped the zombie spawned from PID {self._process.pid}")
if self._parent_channel:
self._parent_channel.close() From what we could see, the above reaped the zombie like we initially expected it to. So, after all this nonsense we just wanted to end up by saying that we believe it's the way We would also like to thank everybody making Airflow possible: it's one heck of a tool! Feel free to ask for more details and, if we got anything wrong (it wouldn't be the first time), please do let us know! |
@pcolladosoto - I ❤️ your detailed description and explanation. It reads like a good crime story 🗡️ Fantastic investigation and outcome. How about you clean it up a bit and submit a PR fixing ti? |
Hi @potiuk! I'm glad you found our investigation useful and that you had fun reading through it. Reading so may Agatha Christie books has to pay off at some point 😜 I would be more than happy to polish it all up and open a Pull Request so that the changes are incorporated into Airflow itself. I'll do my best to find some time to do it throughout the week. And thanks a ton for the kind words! I really appreciate it 😋 |
Cool. If you have any questions during the contribution process - happy to help - just "@" me. And even if you are not sure about some of the decisions, we can discuss it in the PR and iterate before we merge (and drag-in more of the minds here to make it really good) |
This looks like a better approach than what I was thinking. One comment on the The question is whether to use a timeout or not for the join. Because SIGKILL was used, doing a join afterwards should always work, because KILL should always work (unlike SIGTERM, SIGTERM might not end the process). I don't know if the community wants to be extra cautious, and have a timeout on the join, and if a timeout occurs, abnormally end the whole process with an exception, because the OS did something unexpected. Abending would be better than hanging forever. |
That's one if the issues I wanted to bring up on the PR... As you say, The other key aspect is thinking about how we want that to be handled. Should we halt it all with an exception? Should we just cope with the (hopefully) few zombies that could be spawned and just show something on the log? I'm going to open a PR right now to try and move this discussion there. That way we can iterate over the available choices and decided what's best. Thanks a ton for the input! |
It does look good. I think it's the simples but also most efficient solution. |
Apache Airflow version
2.1.4
What happened
The problem also exists in the latest version of the Airflow code, but I experienced it in 2.1.4.
This is the root cause of problems experienced in issue#13542.
I'll provide a stack trace below. The problem is in the code of airflow/dag_processing/processor.py (and manager.py), all poll() and recv() calls to the multiprocessing communication channels need to be wrapped in exception handlers, handling OSError("handle is closed") exceptions. If one looks at the Python multiprocessing source code, it throws this exception when the channel's handle has been closed.
This occurs in Airflow when a DAG File Processor has been killed or terminated; the Airflow code closes the communication channel when it is killing or terminating a DAG File Processor process (for example, when a dag_file_processor_timeout occurs).This killing or terminating happens asynchronously (in another process) from the process calling the poll() or recv() on the communication channel. This is why an exception needs to be handled. A pre-check of the handle being open is not good enough, because the other process doing the kill or terminate may close the handle in between your pre-check and actually calling poll() or recv() (a race condition).
What you expected to happen
Here is the stack trace of the occurence I saw:
This corresponded in time to the following log entries:
You can see that when this exception occurred, there was a hang in the scheduler for almost 4 minutes, no scheduling loops, and no scheduler_job heartbeats.
This hang probably also caused stuck queued jobs as issue#13542 describes.
How to reproduce
This is hard to reproduce because it is a race condition. But you might be able to reproduce by having in a dagfile top-level code that calls sleep, so that it takes longer to parse than core dag_file_processor_timeout setting. That would cause the parsing processes to be terminated, creating the conditions for this bug to occur.
Operating System
NAME="Ubuntu" VERSION="18.04.6 LTS (Bionic Beaver)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 18.04.6 LTS" VERSION_ID="18.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=bionic UBUNTU_CODENAME=bionic
Versions of Apache Airflow Providers
Not relevant, this is a core dag_processing issue.
Deployment
Composer
Deployment details
"composer-1.17.6-airflow-2.1.4"
In order to isolate the scheduler to a separate machine, so as to not have interference from other processes such as airflow-workers running on the same machine, we created an additional node-pool for the scheduler, and ran these k8s patches to move the scheduler to a separate machine.
New node pool definition:
patch.sh
composer-fluentd-daemon-patch.yaml
airflow-scheduler-patch.yaml
Anything else
On the below checkbox of submitting a PR, I could submit one, but it'd be untested code, I don't really have the environment setup to test the patch.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: