Diagnosing Airflow’s Auto-Scaling Flaw in AWS MWAA

Contents

⚠️ UPDATE – May 2025 ⚠️ This is no longer an issue, read full details below.

Apache Airflow is a fantastic platform for scheduling workflows for the ETL of a Data Warehouse. MWAA is the AWS managed implementation of this, allowing for easier management and scalability of workers among other improvements. The problem with MWAA however is that the automatic scaling of workers is currently broken, with a major architectural miss.

⚠️ UPDATE – May 2025 ⚠️

This issue has now been resolved in MWAA. The base architecture was rewritten and improved, and one outstanding bug was squashed as of May 2025. We have now been using auto-scaling in eight out of ten of our MWAA environments for two weeks with no issues (including using the replication steps listed in this article). Our final two environments will be using auto-scaling within the next month.

Big thanks to AWS and the MWAA team for working with us to resolve this issue.

⚠️ UPDATE – May 2025 ⚠️

The issue itself is that a worker can be assigned a task while it is being spun down by the auto-scaling container. There is a small window between a worker being sent the signal to spin down and that worker no longer accepting work, meaning that if some work is available during this window, it can be picked up by a dying worker, and never complete.

Diagnosis Hurdles

Diagnosing the issue has been a difficult journey.

  • As the workers are being spun down, they don’t log anything out of the ordinary while a task is in-progress, they simply cease to process any further work.
  • Advice points from AWS Support reduce the chance of it happening, but don’t eliminate the issue.
  • The DAGs we started seeing the issues with are complex with multiple tasks and a large amount of Python code within each Python Callable task.

The Symptoms

Following an upgrade to Airflow 2.0.2 from 1.10.12 in MWAA, DAGs started to terminate randomly with no explanation, logging, or reason. The scheduler marked them as completed, but Airflow UI marked them as failed.

Worker logs for the failing tasks either didn’t appear in Cloudwatch at all, or had very little logging. In one scenario, the following message appeared:

Executed failure callback for <TaskInstance: XXXXXXXXXX [failed]> in state failed

There were many errors regarding the workers losing their connection to the metadata DB, but AWS support state that these errors are related to a simultaneous writing of log lines to both the metadata DB and Cloudwatch.

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

[SQL: SELECT celery_taskmeta.id AS celery_taskmeta_id, celery_taskmeta.task_id AS celery_taskmeta_task_id, celery_taskmeta.status AS celery_taskmeta_status, celery_taskmeta.result AS celery_taskmeta_result, celery_taskmeta.date_done AS celery_taskmeta_date_done, celery_taskmeta.traceback AS celery_taskmeta_traceback
FROM celery_taskmeta
WHERE celery_taskmeta.task_id = %(task_id_1)s]
[parameters: {'task_id_1': '3f40a621-5832-4486-8b78-bb341f17c360'}]

While AWS support provided a multitude of suggestions in terms of adjusting configuration settings and adding verbose logging, we discussed these two symptoms in the development team in detail, and started to suspect there may be some sort of connection dropping issue with the workers. Our main theory was that the workers lost connection to the outside world (Cloudwatch, the metadata DB, etc) and therefore stopped processing, or the workers had an intermittent problem with their startup.

Theory Crafting

With a theory in place, we set out to try and prove it. On several of our MWAA environments we set the min/max worker count to 6/6 to rule-out the startup or shutdown of workers as the issue. We kept one of out environments on 1/10. In under a day, the issue occurred again on the 1/10 environment, but not yet on any of the 6/6 environments.

AWS support stated that this approach was “an odd solution to the issue” initially, though relayed these findings to the service team. After some investigation, the service team suggested the issue could have one of two possible causes (quote):

  1. Lots of tasks are being sent to the workers at once, overwhelming the Scheduler before MWAA has the change to autoscale.

  2. The issue is a known edge case where tasks are being scheduled during worker scaledown.

The second point was very similar to our working theory, such that we wanted to explore it further and attempt to get a definitive set of replication steps for it.

AWS support explained that if there are no running/queued tasks for more than 2 minutes, MWAA tells the ECS (Fargate) to scale down to the minimum number of workers. This scaledown takes 2-5 minutes, and if Airflow schedules a task during this period there is a chance that the task will land on an ECS container slated for removal.

We were advised to either increase or decrease the cadence between our DAG cron timings to reduce the chance of this issue occuring. The problem with this advice though, is that we can’t control the duration of our DAGs. Some DAG executions will process more data than others, so the gap between a task ending and the next one starting will always be an unknown variable.

AWS support assured us that this was an edge-case, however they estimated up to a year of focused effort to resolve the problem.

At this point we still couldn’t reliably replicate the issue, but we had tracked down the root cause: Poor MWAA architecture.

Reliable Replication

After being given an estimate of a year to resolve what we believe to be a major fundamental flaw in the architecture of the MWAA auto-scaling, we continued trying to find a reliable method of replication, in the hopes that providing this to AWS support would speed up the fix up.

With some tinkering, we developed a set of eight DAGs with specific cron timings, such that four would run every 10 minutes, then four more would run 2-3, 3-4, 4-5, and 5-6 minutes after each one of these had completed. These DAGs simply logged an INFO statement every 10 seconds before completing successfully after a couple of minutes. We set an environment to have a min/max worker count of 1/4, and started the DAGs.

We managed to replicate the issue in two of our DAGs roughly every 12 hours like clockwork. The two DAGs set up to execute 2-3 and 3-4 minutes after the first set completed were reliably replicating the issue, meaning this was no longer an edge case, but a confirmed and proven period of instability with the MWAA workers of at least 1 minute during the scaledown of workers.

After I sent this, like clockwork the next failure occurred at 2021-10-25T16:38:17.

AWS support confirmed the task was getting stuck in the SQS queue handling code, which causes the 12-hour gap between issues. Unfortunately however, our ability to replicate the issue reliably did not yield the desired results of increasing the priority of the issue on the AWS side (removed screenshot of AWS support’s response from here at Amazon’s request).

The code to replicate the issue can be found in the attached file below:

from datetime import datetime, timedelta
import logging as logger
import pendulum
import time as t

from airflow import DAG
from airflow.operators.python_operator import PythonOperator

local_tz = pendulum.timezone("Europe/London")

default_args = {
    'owner': 'airflow',
    'wait_for_downstream': False,
    'start_date': datetime(2021, 2, 25, tzinfo=local_tz),
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 0,
    'retry_delay': timedelta(minutes=2)
}

dag = DAG(
    'test_dag_1'
    , catchup=False
    , max_active_runs=1
    , default_args=default_args
    , schedule_interval='*/10 * * * *'
    , concurrency=1)

def dummy_callable(logger, **kwargs):
    execution = int(int(t.strftime('%M')) / 10)
    for i in range(1, 13 - execution):
        t.sleep(10)
        logger.info(f'Task is still running, time elapsed is {i*10} seconds')

##################################################
# DAG Tasks
##################################################

po_long_logger = PythonOperator(
    task_id='test_task_1'
    , python_callable=dummy_callable
    , op_kwargs={'logger': logger}
    , provide_context=True
    , dag=dag)

##################################################
# DAG Operator Flow
##################################################

po_long_logger

The only lines to change from one DAG to the next (other than the DAG name and task name) are the schedule_interval and the execution duration of each DAG (highlighted).

Conclusion

At this point, we’re hoping that AWS support and the service team can increase the priority of this issue, and we have already seen other people on forums encountering symptoms similar to ours, that may also be hitting the same issue.

Until this issue is resolved however, AWS MWAA worker auto-scaling is not fit for purpose, and cannot be relied upon in any Production environment.

⚠️ UPDATE – May 2025 ⚠️

This issue has now been resolved in MWAA. The base architecture was rewritten and improved, and one outstanding bug was squashed as of May 2025. We have now been using auto-scaling in eight out of ten of our MWAA environments for two weeks with no issues (including using the replication steps listed in this article). Our final two environments will be using auto-scaling within the next month.

Big thanks to AWS and the MWAA team for working with us to resolve this issue.

⚠️ UPDATE – May 2025 ⚠️

34 thoughts on “Diagnosing Airflow’s Auto-Scaling Flaw in AWS MWAA”

  1. Tenzin Choedak

    Hi there,
    Since your last time tackling this issue, has anything changed resulting in an improvement or complete avoidance of this issue?

    1. Hi Tenzin, I’ve been in contact with the MWAA team, and they’re working on a fix for this. They’re hoping to have it completed in the first half of 2022. Thanks.

          1. Hey Thom :)!
            One thing I couldn’t quite understand is if the 6/6 enviroment eventually showed similar issues, or did the approach of “hardcoding a non-scaling down of workers” work as a fix.
            I’m having a whole host of problems with MWAA as well, with one env actually failing to scale UP, and a few others having all DAGs stucked on queue eventually.

          2. Hey Pedro, 6/6 never showed the same issue, hardcoding away from the auto-scaling worked as a fix. That said, AWS MWAA should be releasing a fix for this issue within the next few weeks (it was due last Friday, but they found some bugs while testing).

  2. We are in the same boat with you. I’ll talk to our premium AWS support on the status of this issue. Thank you for your work here – it helped us save a lot of time.

    1. Hi, as of May 2025, the AWS MWAA team have rolled out some improvements to the auto-scaling, including finding an edge-case bug that was causing this issue. My team and I are no longer encountering this issue with auto-scaling and are now rolling the feature out to all our environments.

  3. Hi all! Thanks for this investigation!
    We were going crazy trying to fix this issue by tweaking the settings :)
    Are there any updates from AWS regarding the fix?

    1. The latest update we had from the AWS MWAA team is that they’re having to roll the fix out across all infrastructure, including applying it to older Airflow 1.10.12 instances too. Because of this it’s been delayed, but should be resolved soon.

  4. Hi Thom,
    Did AWS resolve this issue?
    We have the same problem and it is very inconvenient.

    1. The latest update we had from the AWS MWAA team is that they’re having to roll the fix out across all infrastructure, including applying it to older Airflow 1.10.12 instances too. Because of this it’s been delayed, but should be resolved soon.

  5. Derek Martin

    Don’t suppose this has been fixed yet? While I’ve encountered this problem, I’ve been encountering it alongside multiple instances of the Signals.SIGKILL: 9 error. I’ve been making changes to mwaa’s configuration to try and address it, but haven’t tried upping the minimum workers yet which I’ll try next.

    1. The latest update we had from the AWS MWAA team is that they’re having to roll the fix out across all infrastructure, including applying it to older Airflow 1.10.12 instances too. Because of this it’s been delayed, but should be resolved soon.

  6. No updates from AWS Yet I believe?
    It’s not satisfactory that they are fixing the issue for more than a year now and still “coming anytime soon…”

    1. They updated recently saying the issue has been resolved, but I’m yet to test it personally at this time.

      When I get a chance to test it (in the next couple of weeks), if it’s fixed, I’ll make an update to the original post.

        1. Hi, as of May 2025, the AWS MWAA team have rolled out some improvements to the auto-scaling, including finding an edge-case bug that was causing this issue. My team and I are no longer encountering this issue with auto-scaling and are now rolling the feature out to all our environments.

        1. Hi, as of May 2025, the AWS MWAA team have rolled out some improvements to the auto-scaling, including finding an edge-case bug that was causing this issue. My team and I are no longer encountering this issue with auto-scaling and are now rolling the feature out to all our environments.

  7. Zachary Naylor

    It is surprising that AWS have not jumped on their own ECS Task Scale-in protection feature from Nov-2022 to address the issue (or is this the feature they have implemented quietly on the resolved issue).

    Looking forward to hearing if it is fixed in your testing!

    1. Hi, as of May 2025, the AWS MWAA team have rolled out some improvements to the auto-scaling, including finding an edge-case bug that was causing this issue. My team and I are no longer encountering this issue with auto-scaling and are now rolling the feature out to all our environments.

  8. We have noticed this issue too, but when we reached out to AWS support about it, we were told that there were no plans to fix it. Still happening for us.

    1. Hi, as of May 2025, the AWS MWAA team have rolled out some improvements to the auto-scaling, including finding an edge-case bug that was causing this issue. My team and I are no longer encountering this issue with auto-scaling and are now rolling the feature out to all our environments.

  9. Hello, We still have that problem, could you share your settings of a MWAA env like “environment class” and also in the “Airflow configuration options”? If it is possible could you share how many DAGs you run every day? In my case, I got “celery.worker_autoscale 4,4”, “celery.sync_parallelism – 1”. I still have problems with queued tasks in the MWAA but more rarely when my celery.worker_autoslace was 2,2. Maybe I also should scale to 6,6. How to understand how many workers I need? I saw this formula (tasks running + tasks queued) / (tasks per worker) = (required workers) but don’t know if it works.

    1. Hi, as of May 2025, the AWS MWAA team have rolled out some improvements to the auto-scaling, including finding an edge-case bug that was causing this issue. My team and I are no longer encountering this issue with auto-scaling and are now rolling the feature out to all our environments.

  10. Hey Thom, any updates about testing AWS’ fix? We are experiencing very odd behavior with downstream tasks not being queued and I’m not sure if it could be related or not. We get the same “Executed failure callback for in state failed” error you mentioned in the article, then the task will succeed once the service it called (Databricks) has finished its job, but will result in the downstream task not being queued. The failure callback is called at some point in there, even though the end result of the task is a success… very strange, and seemingly random among the DAGs and schedules it is occurring on.

    1. Hi, as of May 2025, the AWS MWAA team have rolled out some improvements to the auto-scaling, including finding an edge-case bug that was causing this issue. My team and I are no longer encountering this issue with auto-scaling and are now rolling the feature out to all our environments.

  11. Happening for us as well. Looks like Amazon were just trying to keep you as a client throwing fake promises.

    1. Hi, as of May 2025, the AWS MWAA team have rolled out some improvements to the auto-scaling, including finding an edge-case bug that was causing this issue. My team and I are no longer encountering this issue with auto-scaling and are now rolling the feature out to all our environments.

  12. Almost a year since the last comment, is this still unfixed? Also wondering if Amazon’s suggested workaround of having a task within a dag that just runs continuously for the course of your job works at all.

    1. Hi, as of May 2025, the AWS MWAA team have rolled out some improvements to the auto-scaling, including finding an edge-case bug that was causing this issue. My team and I are no longer encountering this issue with auto-scaling and are now rolling the feature out to all our environments.

  13. The issue described is now apparently resolved.
    We have been using min/max worker and no longer see the problem.

    There is a change on 2024-05-10 described here: https://docs.aws.amazon.com/mwaa/latest/userguide/doc-history.html

    “Updated the following topic to reflect the new Amazon MWAA automatic scaling behavior when workers pick up new tasks as Fargate workers downscale.”

    On the documentation page it says: “If workers pick up new tasks while downscaling, Amazon MWAA keeps the Fargate resource and does not remove the worker.”

    This was pointed by customer support as and indication the fix has been released.

    1. Hi, as of May 2025, the AWS MWAA team have rolled out some improvements to the auto-scaling, including finding an edge-case bug that was causing this issue. My team and I are no longer encountering this issue with auto-scaling and are now rolling the feature out to all our environments.

Leave a Comment

Your email address will not be published. Required fields are marked *