from __future__ import annotations import logging from datetime import datetime from airflow.sdk import dag, task log = logging.getLogger(__name__) @dag( schedule=None, start_date=datetime(2021, 1, 1), catchup=False, tags=['example'] ) def tutorial_taskflow_api_virtualenv(): @task.virtualenv( serializer='dill', system_site_packages=False, requirements=['funcsigs'] ) def extract(): import json data_string = "{'1001': 301.27, '1002': 433.21, '1003': 502.22}" return json.loads(data_string)
What does the "serializer='dill'" mean?
In this context (Airflow’s @task.virtualenv decorator),
means Airflow will use the dill library to serialize (pickle) the task’s inputs, outputs, and execution context when running the task in an isolated virtual environment.
Why serialization is needed here
@task.virtualenv runs your function in a separate Python process and virtualenv. To do that, Airflow must:
-
Serialize the Python function and its arguments
-
Send them to the virtualenv
-
Execute the function
-
Serialize the return value
-
Send it back to the main Airflow process (XCom)
That serialization step is what serializer controls.
What is dill?
dill is an enhanced alternative to Python’s built-in pickle.
Compared to pickle, dill can serialize more complex Python objects, such as:
-
Lambdas
-
Nested functions
-
Closures
-
Functions defined interactively
-
Some class instances that
picklecannot handle
Why choose dill instead of the default
Airflow’s default serializer is usually pickle. You’d specify dill when:
-
Your task function uses non-top-level functions
-
You pass complex objects as arguments
-
You’ve hit errors like “Can’t pickle local object”
Example:
pickle → ❌ fails
dill → ✅ works
Important note about your code
Unrelated to serializer='dill', this line will fail:
Because JSON requires double quotes, not single quotes. It should be:
Summary
-
serializer='dill'tells Airflow how to serialize task code and data -
dillis more powerful thanpickle -
It’s especially useful for complex Python objects
-
Commonly used with
@task.virtualenvto avoid serialization errors
import json data_string = "{'1001': 301.27, '1002': 433.21, '1003': 502.22}" print(json.loads(data_string))
(airflow-venv) frank@ZZHPC:~/dags$ python ztest.py
Traceback (most recent call last):
File "/home/frank/dags/ztest.py", line 5, in <module>
print(json.loads(data_string))
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
^^^^^^^^^^^^^^^^^^^^^^
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

浙公网安备 33010602011771号