
Mandrill, a Mailchimp add-on, announced an outage early this week. The company reported that users could send emails, but could not receive emails. Additionally, users experienced errors when using webhooks and scheduled mailings. Many users have expressed their disappointment with the lack of communication on the issue and the pace of correcting the issue.
The company eventually provided the following explanation:
Mandrill uses a sharded Postgres setup as one of our main datastores. On Sunday, February 3, at 10:30pm EST, 1 of our 5 physical Postgres instances saw a significant spike in writes. The spike in writes triggered a Transaction ID Wraparound issue. When this occurs, database activity is completely halted. The database sets itself in read-only mode until offline maintenance (known as vacuuming) can occur.
On Wednesday, February 5th, the company reported that it had resolved the issue. Because of the complexity of the problem, Mailchimp/Mandrill is not entirely sure how many users were affected, and the company opted to lose some customer data in order to get the service back to fully running. There may be some residual effects to the outage. For instance, stats and metrics during the outage will be incorrect or missing. Otherwise, the service should be back to full operation.
The company has reported that it will compensate affected customers for the outage. Additionally, the company will conduct a complete post-mortem investigation to better understand the situation. For now, it appears the issue has been resolved.