airflow-hdinsight

Build status Documentation Status PyPi Version Supported versions Coverage status PyPi downloads

A set of airflow hooks, operators and sensors to allow airflow DAGs to operate with the Azure HDInsight platform, for cluster creation and monitoring as well as job submission and monitoring. Also included are some enhanced Azure Blob and Data Lake sensors.

This project is both an amalgamation and enhancement of existing open source airflow extensions, plus new extensions to solve the problem.

Installation

pip install airflow-hdinsight

Extensions

airflowhdi

Type Name What it does
Hook AzureHDInsightHook Uses the HDInsightManagementClient from the HDInsight SDK for Python to expose several operations on an HDInsight cluster - get cluster state, create, delete.
Operator AzureHDInsightCreateClusterOperator Use the AzureHDInsightHook to create a cluster
Operator AzureHDInsightDeleteClusterOperator Use the AzureHDInsightHook to delete a cluster
Operator ConnectedAzureHDInsightCreateClusterOperator Extends the AzureHDInsightCreateClusterOperator to allow fetching of the security credentials and cluster creation spec from an airflow connection
Operator AzureHDInsightSshOperator Uses the AzureHDInsightHook and SSHHook to run an SSH command on the master node of the given HDInsight cluster
Sensor AzureHDInsightClusterSensor A sensor to monitor the provisioning state or running state (can switch between either mode) of a given HDInsight cluster. Uses the AzureHDInsightHook.
Sensor WasbWildcardPrefixSensor An enhancement to the WasbPrefixSensor to support sensing on a wildcard prefix
Sensor AzureDataLakeStorageGen1WebHdfsSensor Uses airflow’s AzureDataLakeHook to sense a glob path (which implicitly supports wildcards) on ADLS Gen 1. ADLS Gen 2 is not yet supported in airflow.

airflowlivy

Type Name What it does
Hook LivyBatchHook Uses the Apache Livy Batch API to submit spark jobs to a livy server, get batch state, verify batch state by quering either the spark history server or yarn resource manager, spill the logs of the spark job post completion, etc.
Operator LivyBatchOperator Uses the LivyBatchHook to submit a spark job to a livy server
Sensor LivyBatchSensor Uses the LivyBatchHook to sense termination and verify completion, spill logs of a spark job submitted earlier to a livy server

Origins of the HDinsight operator work

The HDInsight operator work is loosely inspired from alikemalocalan/airflow-hdinsight-operators, however that has a huge number of defects, as to why it was never accepted to be merged into airflow in the first place. This project solves all of those issues and more, and is frankly a full rewrite.

Origins of the livy work

The livy batch operator is based on the work by panovvv’s project airfllow-livy-operators. It does some necessary changes:

  • Seperates the operator into a hook (LivyBatchHook), an operator (LivyBatchOperator) and a sensor (LivyBatchSensor)
  • Adds additional verification and log spilling to the sensor (the original sensor does not)
  • Removes additional verification and log spilling from the operator - hence alllowing a async pattern akin to the EMR add step operator and step sensor.
  • Creates livy, spark and YARN airflow connections dynamically from an Azure HDInsight connection
  • Returns the batch ID from the operator so that a sensor can use it after being passed through XCom
  • Changes logging to LoggingMixin calls
  • Allows templatization of fields

State of airflow livy operators in the wild..

As it stands today (June of 2020), there are multiple airflow livy operator projects out there:

Indices and tables