Chat Server Outage – May 20, 2016

Intro

We take availability of our service very seriously at Ryver. Having experienced our second connectivity issue in as many weeks, after having essentially no down time since we came out of beta last October, it is understandable that some customers might be concerned, as well as rightfully upset by the inconveniences recently encountered. I’d like to take some time here to explain the issues we encountered, and describe how we are handling things to ensure reliable service.

For those who don’t want to read through the details below, here is a quick summary:

On May 11, our customer base finally grew to a point where an unknown underlying issue with a code library used in our chat service got triggered. We had some intermittent connection issues due to a server being overrun with too many concurrent chat connections. We fixed the part of our system that was supposed to safeguard against that problem, scaled up our system to handle future growth, and put additional monitoring in place. On May 20th, we had an issue with that safeguard. We now have end-to-end protection in place for the full chat server communication cycle, including the safeguard itself, but we’re not stopping there (see last two paragraphs of this article).

Outage Description

Around 4:43am PST on May 20th, the available connections on our chat server filled up and people would have started seeing constant “reconnecting” messages in the Ryver client. You could still access the application and even create Posts and Comments, but you would not be able to send chat messages or get real-time notifications of new activity. This issue lasted for about 3.5 hours.

Previous week issue

In the morning of May 11, some customers started experiencing issues connecting to the chat server. It was an intermittent problem that didn’t impact everybody, and would come and go.

Problem/Solution

On May 11th, we discovered that one of our servers was being asked for more concurrent connections than it was configured to handle. A third party code library we were using had a theoretical safeguard from this, but it was not functioning as expected. We got the safeguard functioning, confirmed that we no longer had an issue with the server being overrun with too many connections, and monitored closely during peak hours for the next few days.

On May 20th, when a full chat outage occurred, we discovered that while the original problem was still fully addressed, there was a different problem with the same 3rd party library, causing the safeguard to lock up and prevent new connections from being made with the chat server. We have identified the source of the issue and put an additional level of protection in place, along with a monitor on the safeguard itself.

But wait, there’s more

We do feel good now about having identified all of the vulnerability points in the chat server communication process, and we feel good about our solutions. But we aren’t satisfied. We are additionally working on self-recovery mechanisms that would allow something like the May 20 outage to be instantly addressed through automation, rather than relying on manual inspection and fixing of the problem. We will also put together a public-facing status page so that you can see in real time if there are problems with our servers.

Conclusion

We are grateful to our customers for putting their trust in Ryver to solve their team communication needs. Anything less than 100% availability outside of specified maintenance updates is considered unacceptable by us, and we will work tirelessly not only to make sure we provide the service you expect, but also to improve our communication and response time when there is a problem. Thank you for your support, and please feel free to contact us at any time with questions or feedback!

Thank you,
The Ryver Team