ZhangZhihui's Blog  

 

from __future__ import annotations

import logging
from datetime import datetime

from airflow.sdk import dag, task

log = logging.getLogger(__name__)


@dag(
    schedule=None,
    start_date=datetime(2021, 1, 1),
    catchup=False,
    tags=['example']
)
def tutorial_taskflow_api_virtualenv():
    @task.virtualenv(
        serializer='dill',
        system_site_packages=False,
        requirements=['funcsigs']
    )
    def extract():
        import json

        data_string = "{'1001': 301.27, '1002': 433.21, '1003': 502.22}"

        return json.loads(data_string)

What does the "serializer='dill'" mean?

 

In this context (Airflow’s @task.virtualenv decorator),

serializer='dill'

means Airflow will use the dill library to serialize (pickle) the task’s inputs, outputs, and execution context when running the task in an isolated virtual environment.

Why serialization is needed here

@task.virtualenv runs your function in a separate Python process and virtualenv. To do that, Airflow must:

  1. Serialize the Python function and its arguments

  2. Send them to the virtualenv

  3. Execute the function

  4. Serialize the return value

  5. Send it back to the main Airflow process (XCom)

That serialization step is what serializer controls.

What is dill?

dill is an enhanced alternative to Python’s built-in pickle.

Compared to pickle, dill can serialize more complex Python objects, such as:

  • Lambdas

  • Nested functions

  • Closures

  • Functions defined interactively

  • Some class instances that pickle cannot handle

Why choose dill instead of the default

Airflow’s default serializer is usually pickle. You’d specify dill when:

  • Your task function uses non-top-level functions

  • You pass complex objects as arguments

  • You’ve hit errors like “Can’t pickle local object”

Example:

def outer():
    def inner():
        return 42
    return inner

pickle → ❌ fails
dill → ✅ works

Important note about your code

Unrelated to serializer='dill', this line will fail:

json.loads(data_string)

Because JSON requires double quotes, not single quotes. It should be:

data_string = '{"1001": 301.27, "1002": 433.21, "1003": 502.22}'

Summary

    • serializer='dill' tells Airflow how to serialize task code and data

    • dill is more powerful than pickle

    • It’s especially useful for complex Python objects

    • Commonly used with @task.virtualenv to avoid serialization errors

 

import json

data_string = "{'1001': 301.27, '1002': 433.21, '1003': 502.22}"

print(json.loads(data_string))

 

(airflow-venv) frank@ZZHPC:~/dags$ python ztest.py
Traceback (most recent call last):
  File "/home/frank/dags/ztest.py", line 5, in <module>
    print(json.loads(data_string))
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
               ^^^^^^^^^^^^^^^^^^^^^^
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

 

posted on 2025-12-15 10:27  ZhangZhihuiAAA  阅读(5)  评论(0)    收藏  举报