OpenTelemetry traces implementation cleanup #49180

xBis7 · 2025-04-13T11:03:26Z

This is a follow up PR for #43941. It's also related to #43789.

The patch is composed by the following changes

The active_spans dict has been using the ti.key as a key for task instance spans. ti.key will be replaced by ti.id. The change is based on this comment Provide an alternative OpenTelemetry implementation for traces that follows standard otel practices #43941 (comment). The comment is referring to try_id but it was removed by PR Ensure that TI.id is unique per try. #48749 in favor of id.
The previous PR added some integration tests which aren't running on the CI. The patch fixes that. The tests are using the redis integration. I've verified that the providers tests with the redis integration, don't run on this new CI step.
We are generating spans for a bunch of airflow's internal methods. The information that comes from the spans, might be relevant to developers but not to end users. Added a config flag to turn on/off the traces.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

…orated function spans

xBis7 · 2025-04-15T17:50:37Z

That's probably because I updated it using a merge instead of a rebase, to avoid rewriting the commit history.

xBis7 · 2025-04-16T15:11:37Z

@potiuk I've added a redis integration on the ci so that it runs my new otel integration tests. I've noticed that 1/10 times, it freezes right before the tests start and the entire step times out after 30 mins. I thought that it had to do with my tests and I added a timeout to make it fail fast if that's the case, but it seems to be something else causing it. Any insight or any idea why this might be happening?

https://github.com/apache/airflow/actions/runs/14476740125/job/40604515009?pr=49180#logs

Looks random

https://github.com/xBis7/airflow/actions/runs/14497144925/job/40668243402#step:5:4279

xBis7 · 2025-04-24T14:49:13Z

@ashb @potiuk Can you take a look at this PR?

airflow-core/src/airflow/config_templates/config.yml

ashb · 2025-04-25T09:48:03Z

airflow-core/src/airflow/executors/workloads.py

@@ -69,7 +69,6 @@ class TaskInstance(BaseModel):

    parent_context_carrier: dict | None = None
    context_carrier: dict | None = None
-    queued_dttm: datetime | None = None


This looks like an un-intended change/a bad rebase?

No, this is intentional. I added it in the previous PR in order to use it as the start_time for the task span in the base_executor.py.

https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/executors/base_executor.py#L446

When workloads.TaskInstance is initialized, queue time isn't available and the field ends up None. I removed it because it's redundant.

airflow-core/src/airflow/jobs/scheduler_job_runner.py

ashb · 2025-04-25T09:51:36Z

airflow-core/src/airflow/jobs/scheduler_job_runner.py

+        for prefixed_key, span in self.active_spans.get_all().items():
+            # Use partition to split on the first occurrence of ':'.
+            prefix, sep, key = prefixed_key.partition(":")


I wonder if instead of prefixing the string we should store the keys as tuples:

("ti", str(ti.id)), ("dr", dr.id) etc?

Not much in it but maybe it makes things slightly clearer? (Though we'd have to be more careful of the type of the id we put in -- cos ("ti", UUID(...)) wouldn't match ("ti", "the_uuid")

Initially, I thought of using nested dictionaries

active_spans: { "ti": { str(ti.id): span, str(ti.id): span, str(ti.id): span, ... }, "dr": { dr.id: span, dr.id: span, ... }, }

I went with the string prefix but if we change the dag_run key type to an integer then this approach might make more sense over the rest.

Though we'd have to be more careful of the type of the id we put in -- cos ("ti", UUID(...)) wouldn't match ("ti", "the_uuid")

The ti.id is always a UUID str.

https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/models/taskinstance.py#L717

https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/models/taskinstance.py#L555-L557

airflow-core/src/airflow/jobs/scheduler_job_runner.py

airflow-core/src/airflow/traces/tracer.py

airflow-core/tests/integration/otel/test_otel.py

ashb · 2025-04-25T09:56:30Z

airflow-core/tests/integration/otel/test_otel.py

@@ -1287,6 +1291,7 @@ def test_scheduler_exits_forcefully_in_the_middle_of_the_first_task(
            # Dag run should have succeeded. Test the spans in the output.
            check_spans_without_continuance(output=out, dag=dag, is_recreated=True, check_t1_sub_spans=False)

+    @pytest.mark.xfail(reason="Tests with a control file are flaky when running on the remote CI.")


What is the plan with these long term? Having them left as xfail adds very little and is there any point keeping them?

The problem is that they are always passing locally and pretty fast, it takes around 6 mins for the entire class to run, while these tests are flaky on the ci and I don't see any errors. They mostly freeze and timeout.

We shouldn't remove them and I would like to have them running on the ci to make sure that future changes don't break the otel implementation. I'm looking at refactoring the test class to see if I can get rid of the flakiness.

I've removed the xfail annotations. I refactored the tests a bit and I was able to run the tests 5 times in a row successfully.

https://github.com/apache/airflow/actions/runs/14759772893/job/41437793893?pr=49180

https://github.com/xBis7/airflow/actions/runs/14759759875/job/41437846694

https://github.com/xBis7/airflow/actions/runs/14760507799/job/41468986413

https://github.com/xBis7/airflow/actions/runs/14760511337/job/41453655641

https://github.com/xBis7/airflow/actions/runs/14760515145/job/41473346959

airflow-core/tests/unit/core/test_otel_tracer.py

pyproject.toml

…_otel.py

xBis7 · 2025-05-01T14:22:05Z

@ashb I think I fixed the flaky tests. I've run the CI multiple times without failures.

The only pending item is to figure out what to do with the active_spans prefixing for TIs and DRs. I made the changes but left the dict format as it was.

xBis7 · 2025-05-07T16:13:29Z

The failure for the gremlin provider seems unrelated. The new CI tests passed.

xBis7 · 2025-05-19T15:00:26Z

@ashb Can you help get this PR merged?

xBis7 added 13 commits April 13, 2025 13:39

replace ti.key with ti.try_id in active_spans

786a7da

add otel to integrations

e234bbf

modify pytest_plugin.py

92682be

add redis integration to global_constants.py

33c28f7

fix ci failures

a3ef753

fix test_edge_command.py

4fbda4b

fix test_edge_executor.py

9e71f2f

increase test timeout

d21af67

increase test timeout

5a4c246

mark flaky tests with xfail

d65196f

add config for disabling spans from internal operations + disable dec…

a287246

…orated function spans

disable all debugging traces

9b5d168

keep redis env variables

81d9d85

xBis7 requested review from potiuk, ashb, jedcunningham, gopidesupavan, jscheffl, dstandish, hussein-awala, ephraimbuddy, XD-DENG, o-nikolas and pierrejeambrun as code owners April 13, 2025 11:03

boring-cyborg bot added area:DAG-processing area:dev-tools area:Executors-core LocalExecutor & SequentialExecutor area:providers area:Scheduler including HA (high availability) scheduler area:Triggerer labels Apr 13, 2025

xBis7 added 3 commits April 15, 2025 20:51

Merge remote-tracking branch 'origen/main' into otel_cleanup

5b355e0

Merge remote-tracking branch 'origen/main' into otel_cleanup

221ab47

Merge remote-tracking branch 'origen/main' into otel_cleanup

f5ec1f4

xBis7 added 2 commits April 17, 2025 17:49

Merge remote-tracking branch 'origen/main' into otel_cleanup

a78cf91

merge with main and resolve conflicts

e08543b

xBis7 requested a review from ashb April 24, 2025 15:18

ashb reviewed Apr 25, 2025

View reviewed changes

xBis7 added 8 commits April 28, 2025 15:22

Merge remote-tracking branch 'origen/main' into otel_cleanup

6b672d6

update config target version

35f93bc

convert leftover ti.id to str

248af6b

simplify TaskInstance query

d90d29e

replace dag_run.run_id with dag_run.id

74ebb42

remove conf.add_section from test_otel_tracer.py

3fee0b0

increase timeout for each test in test_otel.py

39abf69

replace class annotation with method annotation for a timeout in test…

e511c16

…_otel.py

xBis7 requested a review from amoghrajesh as a code owner April 29, 2025 09:46

xBis7 added 4 commits April 29, 2025 17:22

rename add_span to add_debug_span

4769ac9

initialize db only once at the start, in test_otel.py

afcc997

Merge remote-tracking branch 'origen/main' into otel_cleanup

71280ce

cleanup control_file in case of failure

e5e91d0

xBis7 requested a review from ashb May 6, 2025 14:56

merge with main and resolve conflicts

e27feb1

xBis7 added 2 commits May 8, 2025 17:34

convert log info to debug

2c86006

Merge branch 'main' into otel_cleanup

c56a02a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenTelemetry traces implementation cleanup #49180

OpenTelemetry traces implementation cleanup #49180

xBis7 commented Apr 13, 2025 •

edited

Loading

xBis7 commented Apr 15, 2025

xBis7 commented Apr 16, 2025 •

edited

Loading

xBis7 commented Apr 24, 2025

ashb Apr 25, 2025

xBis7 Apr 28, 2025 •

edited

Loading

ashb Apr 25, 2025

xBis7 Apr 28, 2025

ashb Apr 25, 2025

xBis7 Apr 29, 2025

xBis7 May 1, 2025

xBis7 commented May 1, 2025

xBis7 commented May 7, 2025

xBis7 commented May 19, 2025

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

OpenTelemetry traces implementation cleanup #49180

Are you sure you want to change the base?

OpenTelemetry traces implementation cleanup #49180

Conversation

xBis7 commented Apr 13, 2025 • edited Loading

xBis7 commented Apr 15, 2025

xBis7 commented Apr 16, 2025 • edited Loading

xBis7 commented Apr 24, 2025

ashb Apr 25, 2025

Choose a reason for hiding this comment

xBis7 Apr 28, 2025 • edited Loading

Choose a reason for hiding this comment

ashb Apr 25, 2025

Choose a reason for hiding this comment

xBis7 Apr 28, 2025

Choose a reason for hiding this comment

ashb Apr 25, 2025

Choose a reason for hiding this comment

xBis7 Apr 29, 2025

Choose a reason for hiding this comment

xBis7 May 1, 2025

Choose a reason for hiding this comment

xBis7 commented May 1, 2025

xBis7 commented May 7, 2025

xBis7 commented May 19, 2025

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

xBis7 commented Apr 13, 2025 •

edited

Loading

xBis7 commented Apr 16, 2025 •

edited

Loading

xBis7 Apr 28, 2025 •

edited

Loading