-
Notifications
You must be signed in to change notification settings - Fork 15.1k
Fix bad delete logic for dagruns #32684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bad delete logic for dagruns #32684
Conversation
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
I spent all day debugging to figure out how our ITs disappear from all dags at the same time, and why most of the dags were stuck (we're using depend_on_past), and after many analyzes of the log table and the RDS query history in datadog, I found that the query that should remove TIs from 2 dag runs removed over 2k TIs. Luckily I found this PR which confirms my hypothesis. Thank you for this fix @dstandish! However, I wonder if we can improve our release notes by adding a level for the bug and its impact on the whole stack. For example, this bug is very important and it could have a significant impact on the whole platform, despite this, it is added in the middle of the bug list with a generic commit message:
|
Yeah this took me many hours to find :( I wonder if we should add a newsfragment? |
And we have another similar issue from the user who suffered from it I think https://apache-airflow.slack.com/archives/CCQ7EGB1P/p1693805145064939 |
@hussein-awala @RNHTTR -> what is the way to recover from that when your task instances have been deleted? Would back-filling work You think? Or do you need manually delete the DagRun ? |
Related (#34082) |
For the old DagRuns I did nothing because they are not important, but for the new ones blocked because of the DELETE from dag_run dr where dr.dag_id in (
SELECT distinct (dr.dag_id) as dag_id
FROM dag_run dr left join task_instance ti on dr.dag_id = ti.dag_id and dr.run_id = ti.run_id and ti.state is not NULL
WHERE ti.task_id is NULL and dr.execution_date >= '2023-08-30 08:00:00'
) and dr.execution_date >= '2023-08-30 08:00:00';
DELETE from task_instance ti where ti.dag_id in (
SELECT distinct (dr.dag_id) as dag_id
FROM dag_run dr left join task_instance ti on dr.dag_id = ti.dag_id and dr.run_id = ti.run_id and ti.state is not NULL
WHERE ti.task_id is NULL and dr.execution_date >= '2023-08-30 08:00:00'
) and ti.run_id in (
SELECT distinct (dr.run_id) as run_id
FROM dag_run dr left join task_instance ti on dr.dag_id = ti.dag_id and dr.run_id = ti.run_id and ti.state is not NULL
WHERE ti.task_id is NULL and dr.execution_date >= '2023-08-30 08:00:00'
); Then to fix the issue completely without upgrading to 2.7.0, I cherry-picked the fix (https://github.com/leboncoin/airflow/tree/lbc/2.6.3-r1), built Airflow and pushed it to our private PyPi then I used the patched version instead of the official one. |
So likely deleting the DagRun and THEN back-filling should also work? |
Yes, the TIs are already deleted from the DB, he can just delete the empty dag runs and recreate them using the backfill command, or the UI (trigger DAG w/ config, then just update the logical date) |
@wolfier @RNHTTR