Apache Airflow
Apache Airflow
Overview
Apache Airflow is a platform for programmatically authoring,
scheduling, and monitoring workflows. It is completely open-source and is
especially useful in architecting complex data pipelines. It's written in
Python, so you're able to interface with any third party python API or
database to extract, transform, or load your data into its final destination. It
was created to solve the issues that come with long-running cron tasks that
execute hefty scripts.
With Airflow, workflows are architected and expressed as DAGs, with
each step of the DAG defined as a specific Task. It is designed with the
belief that all ETL (Extract, Transform, Load data processing) is best
expressed as code, and as such is a code-first platform that allows you to
iterate on your workflows quickly and efficiently. As a result of its code-
first design philosophy, Airflow allows for a degree of customizibility and
extensibility that other ETL tools do not support.
Core Concepts
DAG
DAG stands for "Directed Acyclic Graph". Each DAG represents a
collection of all the tasks you want to run and is organized to show
relationships between tasks directly in the Airflow UI. They are defined
this way for the following reasons:
1.Directed: If multiple tasks exist, each must have at least one
defined upstream or downstream task.
2.Acyclic: Tasks are not allowed to create data that goes on to self
reference. This is to avoid creating infinite loops.
3.Graph: All tasks are laid out in a clear structure with processes
occurring at clear points with set relationships to other tasks.
Tasks
Tasks represent each node of a defined DAG. They are visual
representations of the work being done at each step of the workflow, with
the actual work that they represent being defined by Operators.
Operators
Operators in Airflow determine the actual work that gets done. They define
a single task, or one node of a DAG. DAGs make sure that operators get
scheduled and run in a certain order, while operators define the work that
must be done at each step of the process.
User Interface
The Airflow UI makes it easy to monitor and troubleshoot the data
tpipelines.
API Endpoints
Consider Main(CASHE) Server –
HOST : 191.132.135.205
PORT : 8989
list_dags
List all the DAGs.
http://191.132.135.205:8989/admin/rest_api/api?api=list_dags
This API Endpoint gives the List of DAGs in the form of JSON.
trigger_dag
Triggers a Dag to Run.
http://191.132.135.205:8989/admin/rest_api/api?api=trigger_dag&dag_id=test_id
pause
Pauses a DAG
http://191.132.135.205:8989/admin/rest_api/api?api=pause&dag_id=test_id
unpause
Resume a paused DAG.
http://191.132.135.205:8989/admin/rest_api/api?api=unpause&dag_id=test_id
GOLANG API
Wrote API’s in GO using External Apache Airflow API Plugin.
The GoLang API, when an External Airflow API Endpoint is
called, Parses the Sample JSON Output and sends the response to
UI/FrontEnd Team.
For Example :
When list_dags API is called, the JSON Output for that API is
Parsed and the VALUE corresponding to KEY – “output” is send
to the UI Team.