Diagnosing Airflow’s Auto-Scaling Flaw in AWS MWAA

Apache Airflow is a fantastic platform for scheduling workflows for the ETL of a Data Warehouse. MWAA is the AWS managed implementation of this, allowing for easier management and scalability of workers among other improvements. The problem with MWAA however is that the automatic scaling of workers is currently broken, with a major architectural miss.

The issue itself is that a worker can be assigned a task while it is being spun down by the auto-scaling container. There is a small window between a worker being sent the signal to spin down and that worker no longer accepting work, meaning that if some work is available during this window, it can be picked up by a dying worker, and never complete.

Diagnosis Hurdles

Diagnosing the issue has been a difficult journey.

  • As the workers are being spun down, they don’t log anything out of the ordinary while a task is in-progress, they simply cease to process any further work.
  • Advice points from AWS Support reduce the chance of it happening, but don’t eliminate the issue.
  • The DAGs we started seeing the issues with are complex with multiple tasks and a large amount of Python code within each Python Callable task.

The Symptoms

Following an upgrade to Airflow 2.0.2 from 1.10.12 in MWAA, DAGs started to terminate randomly with no explanation, logging, or reason. The scheduler marked them as completed, but Airflow UI marked them as failed.

Worker logs for the failing tasks either didn’t appear in Cloudwatch at all, or had very little logging. In one scenario, the following message appeared:

Executed failure callback for <TaskInstance: XXXXXXXXXX [failed]> in state failed

There were many errors regarding the workers losing their connection to the metadata DB, but AWS support state that these errors are related to a simultaneous writing of log lines to both the metadata DB and Cloudwatch.

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

[SQL: SELECT celery_taskmeta.id AS celery_taskmeta_id, celery_taskmeta.task_id AS celery_taskmeta_task_id, celery_taskmeta.status AS celery_taskmeta_status, celery_taskmeta.result AS celery_taskmeta_result, celery_taskmeta.date_done AS celery_taskmeta_date_done, celery_taskmeta.traceback AS celery_taskmeta_traceback
FROM celery_taskmeta
WHERE celery_taskmeta.task_id = %(task_id_1)s]
[parameters: {'task_id_1': '3f40a621-5832-4486-8b78-bb341f17c360'}]

While AWS support provided a multitude of suggestions in terms of adjusting configuration settings and adding verbose logging, we discussed these two symptoms in the development team in detail, and started to suspect there may be some sort of connection dropping issue with the workers. Our main theory was that the workers lost connection to the outside world (Cloudwatch, the metadata DB, etc) and therefore stopped processing, or the workers had an intermittent problem with their startup.

Theory Crafting

With a theory in place, we set out to try and prove it. On several of our MWAA environments we set the min/max worker count to 6/6 to rule-out the startup or shutdown of workers as the issue. We kept one of out environments on 1/10. In under a day, the issue occurred again on the 1/10 environment, but not yet on any of the 6/6 environments.

AWS support stated that this approach was “an odd solution to the issue” initially, though relayed these findings to the service team. After some investigation, the service team suggested the issue could have one of two possible causes (quote):

  1. Lots of tasks are being sent to the workers at once, overwhelming the Scheduler before MWAA has the change to autoscale.

  2. The issue is a known edge case where tasks are being scheduled during worker scaledown.

The second point was very similar to our working theory, such that we wanted to explore it further and attempt to get a definitive set of replication steps for it.

AWS support explained that if there are no running/queued tasks for more than 2 minutes, MWAA tells the ECS (Fargate) to scale down to the minimum number of workers. This scaledown takes 2-5 minutes, and if Airflow schedules a task during this period there is a chance that the task will land on an ECS container slated for removal.

We were advised to either increase or decrease the cadence between our DAG cron timings to reduce the chance of this issue occuring. The problem with this advice though, is that we can’t control the duration of our DAGs. Some DAG executions will process more data than others, so the gap between a task ending and the next one starting will always be an unknown variable.

AWS support assured us that this was an edge-case, however they estimated up to a year of focused effort to resolve the problem.

At this point we still couldn’t reliably replicate the issue, but we had tracked down the root cause: Poor MWAA architecture.

Reliable Replication

After being given an estimate of a year to resolve what we believe to be a major fundamental flaw in the architecture of the MWAA auto-scaling, we continued trying to find a reliable method of replication, in the hopes that providing this to AWS support would speed up the fix up.

With some tinkering, we developed a set of eight DAGs with specific cron timings, such that four would run every 10 minutes, then four more would run 2-3, 3-4, 4-5, and 5-6 minutes after each one of these had completed. These DAGs simply logged an INFO statement every 10 seconds before completing successfully after a couple of minutes. We set an environment to have a min/max worker count of 1/4, and started the DAGs.

We managed to replicate the issue in two of our DAGs roughly every 12 hours like clockwork. The two DAGs set up to execute 2-3 and 3-4 minutes after the first set completed were reliably replicating the issue, meaning this was no longer an edge case, but a confirmed and proven period of instability with the MWAA workers of at least 1 minute during the scaledown of workers.

After I sent this, like clockwork the next failure occurred at 2021-10-25T16:38:17.

AWS support confirmed the task was getting stuck in the SQS queue handling code, which causes the 12-hour gap between issues. Unfortunately however, our ability to replicate the issue reliably did not yield the desired results of increasing the priority of the issue on the AWS side (removed screenshot of AWS support’s response from here at Amazon’s request).

The code to replicate the issue can be found in the attached file below:

from datetime import datetime, timedelta
import logging as logger
import pendulum
import time as t

from airflow import DAG
from airflow.operators.python_operator import PythonOperator

local_tz = pendulum.timezone("Europe/London")

default_args = {
    'owner': 'airflow',
    'wait_for_downstream': False,
    'start_date': datetime(2021, 2, 25, tzinfo=local_tz),
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 0,
    'retry_delay': timedelta(minutes=2)
}

dag = DAG(
    'test_dag_1'
    , catchup=False
    , max_active_runs=1
    , default_args=default_args
    , schedule_interval='*/10 * * * *'
    , concurrency=1)

def dummy_callable(logger, **kwargs):
    execution = int(int(t.strftime('%M')) / 10)
    for i in range(1, 13 - execution):
        t.sleep(10)
        logger.info(f'Task is still running, time elapsed is {i*10} seconds')

##################################################
# DAG Tasks
##################################################

po_long_logger = PythonOperator(
    task_id='test_task_1'
    , python_callable=dummy_callable
    , op_kwargs={'logger': logger}
    , provide_context=True
    , dag=dag)

##################################################
# DAG Operator Flow
##################################################

po_long_logger

The only lines to change from one DAG to the next (other than the DAG name and task name) are the schedule_interval and the execution duration of each DAG (highlighted).

Conclusion

At this point, we’re hoping that AWS support and the service team can increase the priority of this issue, and we have already seen other people on forums encountering symptoms similar to ours, that may also be hitting the same issue.

Until this issue is resolved however, AWS MWAA worker auto-scaling is not fit for purpose, and cannot be relied upon in any Production environment.

17 thoughts on “Diagnosing Airflow’s Auto-Scaling Flaw in AWS MWAA”

  1. Tenzin Choedak

    Hi there,
    Since your last time tackling this issue, has anything changed resulting in an improvement or complete avoidance of this issue?

    1. Hi Tenzin, I’ve been in contact with the MWAA team, and they’re working on a fix for this. They’re hoping to have it completed in the first half of 2022. Thanks.

          1. Hey Thom :)!
            One thing I couldn’t quite understand is if the 6/6 enviroment eventually showed similar issues, or did the approach of “hardcoding a non-scaling down of workers” work as a fix.
            I’m having a whole host of problems with MWAA as well, with one env actually failing to scale UP, and a few others having all DAGs stucked on queue eventually.

          2. Hey Pedro, 6/6 never showed the same issue, hardcoding away from the auto-scaling worked as a fix. That said, AWS MWAA should be releasing a fix for this issue within the next few weeks (it was due last Friday, but they found some bugs while testing).

  2. We are in the same boat with you. I’ll talk to our premium AWS support on the status of this issue. Thank you for your work here – it helped us save a lot of time.

  3. Hi all! Thanks for this investigation!
    We were going crazy trying to fix this issue by tweaking the settings :)
    Are there any updates from AWS regarding the fix?

    1. The latest update we had from the AWS MWAA team is that they’re having to roll the fix out across all infrastructure, including applying it to older Airflow 1.10.12 instances too. Because of this it’s been delayed, but should be resolved soon.

  4. Hi Thom,
    Did AWS resolve this issue?
    We have the same problem and it is very inconvenient.

    1. The latest update we had from the AWS MWAA team is that they’re having to roll the fix out across all infrastructure, including applying it to older Airflow 1.10.12 instances too. Because of this it’s been delayed, but should be resolved soon.

  5. Derek Martin

    Don’t suppose this has been fixed yet? While I’ve encountered this problem, I’ve been encountering it alongside multiple instances of the Signals.SIGKILL: 9 error. I’ve been making changes to mwaa’s configuration to try and address it, but haven’t tried upping the minimum workers yet which I’ll try next.

    1. The latest update we had from the AWS MWAA team is that they’re having to roll the fix out across all infrastructure, including applying it to older Airflow 1.10.12 instances too. Because of this it’s been delayed, but should be resolved soon.

  6. No updates from AWS Yet I believe?
    It’s not satisfactory that they are fixing the issue for more than a year now and still “coming anytime soon…”

    1. They updated recently saying the issue has been resolved, but I’m yet to test it personally at this time.

      When I get a chance to test it (in the next couple of weeks), if it’s fixed, I’ll make an update to the original post.

  7. Zachary Naylor

    It is surprising that AWS have not jumped on their own ECS Task Scale-in protection feature from Nov-2022 to address the issue (or is this the feature they have implemented quietly on the resolved issue).

    Looking forward to hearing if it is fixed in your testing!

  8. We have noticed this issue too, but when we reached out to AWS support about it, we were told that there were no plans to fix it. Still happening for us.

Leave a Comment

Your email address will not be published. Required fields are marked *