Content-Length: 614194 | pFad | https://github.com/apache/airflow/pull/49180

4F OpenTelemetry traces implementation cleanup by xBis7 · Pull Request #49180 · apache/airflow · GitHub
Skip to content

OpenTelemetry traces implementation cleanup #49180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

xBis7
Copy link
Contributor

@xBis7 xBis7 commented Apr 13, 2025

This is a follow up PR for #43941. It's also related to #43789.

The patch is composed by the following changes

  1. The active_spans dict has been using the ti.key as a key for task instance spans. ti.key will be replaced by ti.id. The change is based on this comment Provide an alternative OpenTelemetry implementation for traces that follows standard otel practices #43941 (comment). The comment is referring to try_id but it was removed by PR Ensure that TI.id is unique per try. #48749 in favor of id.

  2. The previous PR added some integration tests which aren't running on the CI. The patch fixes that. The tests are using the redis integration. I've verified that the providers tests with the redis integration, don't run on this new CI step.

  3. We are generating spans for a bunch of airflow's internal methods. The information that comes from the spans, might be relevant to developers but not to end users. Added a config flag to turn on/off the traces.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@xBis7
Copy link
Contributor Author

xBis7 commented Apr 15, 2025

That's probably because I updated it using a merge instead of a rebase, to avoid rewriting the commit history.

@xBis7
Copy link
Contributor Author

xBis7 commented Apr 16, 2025

@potiuk I've added a redis integration on the ci so that it runs my new otel integration tests. I've noticed that 1/10 times, it freezes right before the tests start and the entire step times out after 30 mins. I thought that it had to do with my tests and I added a timeout to make it fail fast if that's the case, but it seems to be something else causing it. Any insight or any idea why this might be happening?

https://github.com/apache/airflow/actions/runs/14476740125/job/40604515009?pr=49180#logs

Looks random

https://github.com/xBis7/airflow/actions/runs/14497144925/job/40668243402#step:5:4279

@xBis7
Copy link
Contributor Author

xBis7 commented Apr 24, 2025

@ashb @potiuk Can you take a look at this PR?

@xBis7 xBis7 requested a review from ashb April 24, 2025 15:18
@@ -69,7 +69,6 @@ class TaskInstance(BaseModel):

parent_context_carrier: dict | None = None
context_carrier: dict | None = None
queued_dttm: datetime | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an un-intended change/a bad rebase?

Copy link
Contributor Author

@xBis7 xBis7 Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is intentional. I added it in the previous PR in order to use it as the start_time for the task span in the base_executor.py.

https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/executors/base_executor.py#L446

When workloads.TaskInstance is initialized, queue time isn't available and the field ends up None. I removed it because it's redundant.

Comment on lines +1063 to +1065
for prefixed_key, span in self.active_spans.get_all().items():
# Use partition to split on the first occurrence of ':'.
prefix, sep, key = prefixed_key.partition(":")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if instead of prefixing the string we should store the keys as tuples:

("ti", str(ti.id)), ("dr", dr.id) etc?

Not much in it but maybe it makes things slightly clearer? (Though we'd have to be more careful of the type of the id we put in -- cos ("ti", UUID(...)) wouldn't match ("ti", "the_uuid")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I thought of using nested dictionaries

active_spans: {
    "ti": {
          str(ti.id): span,
          str(ti.id): span,
          str(ti.id): span,
          ...
    },
    "dr": {
          dr.id: span,
          dr.id: span,
          ...
    },
}

I went with the string prefix but if we change the dag_run key type to an integer then this approach might make more sense over the rest.

Though we'd have to be more careful of the type of the id we put in -- cos ("ti", UUID(...)) wouldn't match ("ti", "the_uuid")

The ti.id is always a UUID str.

https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/models/taskinstance.py#L717

https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/models/taskinstance.py#L555-L557

@@ -1287,6 +1291,7 @@ def test_scheduler_exits_forcefully_in_the_middle_of_the_first_task(
# Dag run should have succeeded. Test the spans in the output.
check_spans_without_continuance(output=out, dag=dag, is_recreated=True, check_t1_sub_spans=False)

@pytest.mark.xfail(reason="Tests with a control file are flaky when running on the remote CI.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the plan with these long term? Having them left as xfail adds very little and is there any point keeping them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that they are always passing locally and pretty fast, it takes around 6 mins for the entire class to run, while these tests are flaky on the ci and I don't see any errors. They mostly freeze and timeout.

We shouldn't remove them and I would like to have them running on the ci to make sure that future changes don't break the otel implementation. I'm looking at refactoring the test class to see if I can get rid of the flakiness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xBis7 xBis7 requested a review from amoghrajesh as a code owner April 29, 2025 09:46
@xBis7
Copy link
Contributor Author

xBis7 commented May 1, 2025

@ashb I think I fixed the flaky tests. I've run the CI multiple times without failures.

The only pending item is to figure out what to do with the active_spans prefixing for TIs and DRs. I made the changes but left the dict format as it was.

@xBis7 xBis7 requested a review from ashb May 6, 2025 14:56
@xBis7
Copy link
Contributor Author

xBis7 commented May 7, 2025

The failure for the gremlin provider seems unrelated. The new CI tests passed.

@xBis7
Copy link
Contributor Author

xBis7 commented May 19, 2025

@ashb Can you help get this PR merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:DAG-processing area:dev-tools area:Executors-core LocalExecutor & SequentialExecutor area:providers area:Scheduler including HA (high availability) scheduler area:Triggerer provider:edge Edge Executor / Worker (AIP-69)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/apache/airflow/pull/49180

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy