pretix.eu outage post-mortem
This morning, the availability of the pretix.eu Hosted service was partly interrupted multiple times over the course of approximately 10 hours. During this timeframe, all ticket shops were available and all ticket sales were processed correctly, however, the download of tickets, export files, invoices, as well as e-mail notifications and webhooks have been severly delayed or did not work at all.
We want to apologize to all our customers and users for the inconvenience this might have caused. The problems are now fixed and we're taking additional precautions to prevent this from happening again.
If you rely on email notifications, slack notifications or webhooks for your processes, please check your events if you missed any notifications between 04:00 and 12:00 CEST today. Otherwise, you do not need to take any action. If you are interested in the technical details of what happened today, you are invited to continue reading this blog post.
We first noticed that something was wrong when a customer called in around 10:35 this morning, asking why downloading a list of attendees in the backend didn't work but just kept showing a loading indicator. We were able to reproduce this behaviour, but it took us a while to realize the scope of the problem and find the root cause.
Lots of things in pretix that potentially take a bit longer (such as creating orders, creating PDF files, sending emails, communicating with third-party web services) are not handled by our web application itself. Instead, we have a bunch of task workers and we use the popular task queuing system Celery to distribute the workload to one of those workers and report the results back. If you click to export an attendee list in pretix, a task "generate attendee list" will be put on this task queue and wait for a worker to pick it up.
Only in this case no worker did pick it up. When looking into the log files of our worker processes, we noticed a
lot of tasks failing because our custom-built file storage system
rejected any file uploads with an HTTP status code of
A short look at the error log of one of our file storage nodes very quickly revealed the cause:
2018/11/26 08:00:59 [crit] 12740#12740: *5225976 pwrite() "/var/nanocdn/tmp/0000002057" failed (28: No space left on device),
And indeed, the file storage had apparently filled up around 04:00 am this morning. We didn't see that coming since the storage usually filles up at a rather predictable rate and when we left for the weekend, we were running at less than 40 % disk capacity. We do have monitoring for disk capacity in place, but since our monitoring system had a short network outage over the weekend, we had been swamped in monitoring alerts and missed the needle in the haystack.
Running a quick disk scan showed that one single customer created 60 gigabytes of cached ticket PDF files – over night! We've seen things like this before when customers use a multiple megabyte photo as their logo on invoices (which we prevent for a while already). In pretix, customers can upload arbitrary PDF files to be used as the background of their tickets, so we figured that this customer had just used a very large PDF file. As it turned out, the tickets generated for this event were indeed around 1.6 MB large and the customer had sold upwards of a thousand tickets, but that still doesn't explain a disk usage of more than one or two gigabytes. This was weird, but we'll get to that.
Fixing the error at hand was easy, we just deleted the files. They are just a cache, so if a user were to download their ticket again, it wouldn't cause any trouble. However, our task queue still contained tens of thousands of tasks waiting to be done, so the system was not yet ready to fulfill any new requests.
We had a similar issue already in June 2017, when a customer's SMTP misconfiguration caused our task queue to be swamped with failing retries of the same task. Back then, this caused pretix to cease working completely, since when the task queue was swamped with emails, no other tasks would be handled in time and therefore no tickets could be sold. To prevent this from happening again, we had split our task queue into four separate queues. One queue would handle everything required to create carts and orders, one queue would handle all e-mail sending, one queue would handle everything that was expected to be slow and no users are waiting for (such as clean-up tasks), and one queue would handle everything else. This way, no matter what happened to the email queue, orders would still be processed, and vice versa.
Thanks to this split, pretix didn't stop working, only parts of it did. To make those parts come up again as quickly as possible, we decided to drop all tasks currently queued and start fresh. In case of PDF generation requests, we figured the users who clicked that button stopped trying hours ago, so they wouldn't really care. The downside was that it also dropped any pending webhook or notification tasks that has not yet been sent out.
To make this less likely from happening again, we introduced a cleanup task for cached ticket files and started relaxing.
Just minutes later, though, we noticed the queue filling up with tasks again, faster than our workers could work through them. This is highly unusual, normally all tasks are grabbed by a free worker instantly. We looked into the log files to find that the worker processes all grabbed a task to generate a new PDF for the same customer that we talked about before and then did not report back for minutes, using up all CPU resources on their respective host system at the same time.
As it turns out, the issue wasn't that the background PDF used by the customer had too many bytes: Instead, it apparently was was structured in a way that made PyPDF2, our PDF processing library, take forever to parse it. The frontend code triggering the task, however, only waited for five minutes for the PDF to be generated before it started to assume that the process failed and requested the creation of the file for a second time. This way, every ticket was queued to be created an many many times – leading to the congestion in the task queue, and the full disk on the file storage.
We ended up replacing PyPDF2 with pdftk for this operation, as well as completely removing the (rather ancient) retry mechanism in the frontend and transformed ticket downloads to the same waiting mechanism we use everywhere else already. Since this was a larger software change, it took us a while to create, test, and deploy it with confidence that it would not introduce any new problems. Finally, the problem was solved.
When something like this happens, we never stop at fixing the problem, though, we always spend some time to think about how we could prevent even remotely similar problems from ever happening again, or at least from having a large impact.
This time, we moved webhook and email notifications to their own separate task queue. If we ever experience congestion on the "main" queue again, this means we can empty that queue without losing any precious notifications. Had we done this before, this outage would have caused much less actual impact, but only some delays. To go further, we also just extended our monitoring system to regularly look at the length of every task queue and alert us if it gets longer than usual.