Skip to content

Example tutorial_objectstorage DAG failing #49149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 2 tasks
vatsrahul1001 opened this issue Apr 12, 2025 · 10 comments · Fixed by #50828
Closed
1 of 2 tasks

Example tutorial_objectstorage DAG failing #49149

vatsrahul1001 opened this issue Apr 12, 2025 · 10 comments · Fixed by #50828
Assignees
Labels
affected_version:3.0 Issues Reported for 3.0 area:core kind:bug This is a clearly a bug priority:medium Bug that should be fixed before next release but would not block a release

Comments

@vatsrahul1001
Copy link
Contributor

Apache Airflow version

3.0.0

If "Other Airflow 2 version" selected, which one?

No response

What happened?

Example DAG failing error

DAG CODE: https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/example_dags/tutorial_objectstorage.py

KeyError: "Only a column name can be used for the key in a dtype mappings argument. 'fmisid' not found in columns."

Image

What you think should happen instead?

No response

How to reproduce

  1. Run breeze start-airflow --executor CeleryExecutor --backend postgres --load-example-dag
  2. Run tutorial_objectstorage

Operating System

Linux

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@vatsrahul1001 vatsrahul1001 added affected_version:3.0.0rc area:core kind:bug This is a clearly a bug priority:high High priority bug that should be patched quickly but does not require immediate new release labels Apr 12, 2025
@vatsrahul1001
Copy link
Contributor Author

vatsrahul1001 commented Apr 12, 2025

Looks like API(https://opendata.fmi.fi/timeseries) being used in Example DAG is returning 400. We might need to use something else.

Image

Changing priority to medium and looks like DAG issue

@vatsrahul1001 vatsrahul1001 added priority:medium Bug that should be fixed before next release but would not block a release and removed priority:high High priority bug that should be patched quickly but does not require immediate new release labels Apr 12, 2025
@vatsrahul1001
Copy link
Contributor Author

vatsrahul1001 commented Apr 12, 2025

Ok, so I have updated the DAG with a new API to test this out, and I see issue below.

  1. Analyze task is failing with below error

[2025-04-12, 09:09:00] ERROR - Task failed with exception: source="task"
Error: ParamValidationError: Parameter validation failed:
Invalid bucket name "aws_default@airflow-tutorial-data": Bucket name must match the regex "^[a-zA-Z0-9.-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).:(s3|s3-object-lambda):[a-z-0-9]:[0-9]{12}:accesspoint[/:][a-zA-Z0-9-.]{1,63}$|^arn:(aws).*:s3-outposts:[a-z-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9-]{1,63}$"

Image

My new DAG CODE

from __future__ import annotations

# [START tutorial]
# [START import_module]
import pendulum
import requests

from airflow.sdk import ObjectStoragePath, dag, task

# [END import_module]

API = "https://air-quality-api.open-meteo.com/v1/air-quality"

aq_fields = [
    "pm10",
    "pm2_5",
    "carbon_monoxide",
    "nitrogen_dioxide",
    "sulphur_dioxide",
    "ozone",
    "european_aqi",
    "us_aqi",
]

# [START create_object_storage_path]
base = ObjectStoragePath("s3://aws_default@airflow-tutorial-data/")
# [END create_object_storage_path]


@dag(
    schedule=None,
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    catchup=False,
    tags=["example"],
)
def tutorial_objectstorage():
    """
    ### Object Storage Tutorial Documentation
    This is a tutorial DAG to showcase the usage of the Object Storage API.
    Documentation that goes along with the Airflow Object Storage tutorial is
    located
    [here](https://airflow.apache.org/docs/apache-airflow/stable/tutorial/objectstorage.html)
    """

    # [START get_air_quality_data]
    @task
    def get_air_quality_data(**kwargs) -> ObjectStoragePath:
        """
        #### Get Air Quality Data
        This task gets air quality data from the Finnish Meteorological Institute's
        open data API. The data is saved as parquet.
        """
        import pandas as pd

        logical_date = kwargs["logical_date"]
        start_time = kwargs["data_interval_start"]

        latitude = 28.6139
        longitude = 77.2090

        params = {
            "latitude": latitude,
            "longitude": longitude,
            "hourly": ",".join(aq_fields),
            "timezone": "Asia/Kolkata",
            "start_date": start_time.format("YYYY-MM-DD"),
            "end_date": logical_date.format("YYYY-MM-DD"),
        }



        response = requests.get(API, params=params)
        response.raise_for_status()

        data = response.json()
        hourly_data = data.get("hourly", {})

        df = pd.DataFrame(hourly_data)


        df["time"] = pd.to_datetime(df["time"])

        base.mkdir(exist_ok=True)

        formatted_date = logical_date.format("YYYYMMDD")
        path = base / f"air_quality_{formatted_date}.parquet"

        with path.open("wb") as file:
            df.to_parquet(file)
        return path

    # [END get_air_quality_data]

    # [START analyze]
    @task
    def analyze(path: ObjectStoragePath, **kwargs):
        """
        #### Analyze
        This task analyzes the air quality data, prints the results
        """
        import duckdb

        conn = duckdb.connect(database=":memory:")
        conn.register_filesystem(path.fs)
        conn.execute(f"CREATE OR REPLACE TABLE airquality_urban AS SELECT * FROM read_parquet('{path}')")

        df2 = conn.execute("SELECT * FROM airquality_urban").fetchdf()

        print(df2.head())

    # [END analyze]

    # [START main_flow]
    obj_path = get_air_quality_data()
    analyze(obj_path)
    # [END main_flow]


# [START dag_invocation]
tutorial_objectstorage()
# [END dag_invocation]
# [END tutorial]

@vatsrahul1001
Copy link
Contributor Author

I see the issue here its failing at conn.execute(f"CREATE OR REPLACE TABLE airquality_urban AS SELECT * FROM read_parquet('{path}')") as path in below xcom have aws_default which does not exits in bucket

If I pass exact path conn.execute(f"CREATE OR REPLACE TABLE airquality_urban AS SELECT * FROM read_parquet(' s3://airflow-tutorial-data/air_quality_20250412.parquet')") it works.

Maybe we should not add connection details to the path. Also the way XCOM is being displayed could be better

{'_data__': {'path': 's3://aws_default@airflow-tutorial-data/air_quality_20250412.parquet', 'kwargs': {}, 'conn_id': 'aws_default'}, '__version__': 1, '__classname__': 'airflow.sdk.io.path.ObjectStoragePath'}

Image

@vatsrahul1001 vatsrahul1001 added priority:high High priority bug that should be patched quickly but does not require immediate new release and removed priority:medium Bug that should be fixed before next release but would not block a release labels Apr 12, 2025
@vatsrahul1001
Copy link
Contributor Author

cc: @kaxil

@vatsrahul1001
Copy link
Contributor Author

Regarding this same behaviour is observed in 2.10.5 as well changing priority

I see the issue here its failing at conn.execute(f"CREATE OR REPLACE TABLE airquality_urban AS SELECT * FROM read_parquet('{path}')") as path in below xcom have aws_default which does not exits in bucket`

@vatsrahul1001 vatsrahul1001 added priority:medium Bug that should be fixed before next release but would not block a release and removed priority:high High priority bug that should be patched quickly but does not require immediate new release labels Apr 14, 2025
@amoghrajesh
Copy link
Contributor

@vatsrahul1001 an update, discussed with the team on this, i will try to take a stab at it tomorrow and try to squeeze it in for RC timeline. But im talking about the xcom issue. Could you create an issue for that?

@kaxil
Copy link
Member

kaxil commented Apr 24, 2025

Is this still an issue?

@vatsrahul1001
Copy link
Contributor Author

I think yes. Let me chek on this

@HuseynA28
Copy link

Hi @vatsrahul1001 , did you manage to solve the issue? I'm getting the same error. Do you know what might be causing it?

@vatsrahul1001
Copy link
Contributor Author

Closing fixed in 50828

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:3.0 Issues Reported for 3.0 area:core kind:bug This is a clearly a bug priority:medium Bug that should be fixed before next release but would not block a release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
@kaxil @amoghrajesh @vatsrahul1001 @HuseynA28 and others
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy