Atlassian Outage: A Week Gone, but Restoration May Take up Two More

essidsolutions

Enterprise software solutions provider Atlassian has been struggling to handle a week-long service outage of some of its cloud services. Worryingly, the company believes it will take two more weeks to resolve the issues!

By 11:37 PM EST on April 12, the company managed to restore the functionality of 45% of impacted users seven days after the outage began. The progress seems rather snail-paced considering the resources of the provider of some of the most coveted collaboration and project management products.

Just a reminder, Atlassian is a publicly-traded company with ~$68 billionOpens a new window market capitalization. It provides Jira Software, Jira Work Management, Jira Service Management, Confluence, Opsgenie Cloud, Statuspage, and Atlassian Access, all of which are down for nearly 400 customers (less than 1% of the total).

“While running a maintenance script, a small number of sites were disabled unintentionally,” Atlassian tweetedOpens a new window . “To be clear, this incident was not a cyber attack nor was it a failure of our systems to scale. Additionally, the majority of restored customers have had no data loss, while some have reported data loss for up to 5 minutes prior to the incident,” clarified Atlassian CTO Sri Viswanath.

Viswanath explained that a fully integrated standalone application, Insight – Asset Management, in Jira Service Management and Jira Software needed to be deactivated to remove native functionality. Here’s what went wrong:

“First, there was a communication gap between the team that requested the deactivation and the team that ran the deactivation. Instead of providing the IDs of the intended app being marked for deactivation, the team provided the IDs of the entire cloud site where the apps were to be deactivated,” Viswanath said.

“Second, the script we used provided both the ‘mark for deletion’ capability used in normal day-to-day operations (where recoverability is desirable), and the ‘permanently delete’ capability that is required to permanently remove data when required for compliance reasons. The script was executed with the wrong execution mode and the wrong list of IDs. The result was that sites for approximately 400 customers were improperly deleted.”

In other words, Atlassian fumbled with the maintenance, thus sparking customer data loss concerns.

Atlassian said, in its reply to a customer, “We expect most site recoveries to occur with minimal or no data loss.” Restoration efforts are ongoing at the Sydney, Australia-based company.

“The rebuild stage is particularly complex due to several steps that are required to validate sites and verify data. These steps require extra time, but are critical to ensuring the integrity of rebuilt sites,” the company added.

See More: Google Fixes YouTube Outage but Mum on What Caused It

According to multiple users, this extra time to restore services may be as much as two more weeks.

A Reddit user’s organization receivedOpens a new window the following reply on their support ticket: “We were unable to confirm a more firm ETA until now due to the complexity of the rebuild process for your site. While we are beginning to bring some customers back online, we estimate the rebuilding effort to last for up to 2 more weeks.”

Atlassian just gave us an estimate on our support ticket…it’s not pretty.Opens a new window from sysadminOpens a new window

The same timeline was communicatedOpens a new window to another Atlassian customer.

Technical details of the recovery are available hereOpens a new window . “Currently, we are restoring customers in batches of up to 60 tenants at a time. End-to-end, it takes between 4 and 5 elapsed days to hand a site back to a customer.”

Its on-premise products and services, scheduled to be phased out by February 2024, have not been impacted.

The company said it deployed “hundreds of engineers” to get the services up and running again. “Our global engineering teams are working 24/7 to make progress on this incident.”

Stay tuned for more updates on the Atlassian outage.

Are you impacted by the Atlassian outage? Let us know on LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We would love to hear from you!

MORE ON SERVICE DISRUPTIONS