If you were online using Ryver in the late evening of Sept 19 (PST time), you would have encountered a decrease in performance, inability to access the NOTIFICATIONS view, and probably some periodic 504 errors over the course of the hour following the upgrade. Ultimately, we made the decision to pull the system back into maintenance mode as our primary DB was under severe stress and in the process of becoming unusable.
We are very sorry that you encountered this significant disruption in service, and we wanted to share some information about what happened, and steps we are taking for the future.
First, here are some changes in procedure we intend to make around upgrades in general:
- Avoid significant back-end upgrades during the week if at all possible, even if we think it will only take a couple of minutes. Do them on the weekend, because even if we choose a “slow” time during the week, any issues encountered can extend the upgrade/maintenance into business hours for people depending on where they are located.
- Provide better info on our lock screen when going down for maintenance, including a link to http://status.ryver.com
- Provide notice of our maintenance plans further in advance
- Ultimately, get to where we can do live updates without any user down time
As for what happened with this particular upgrade:
It was completely unrelated to the primary features we were rolling out. A developer had done some “minor code cleanup” in a few areas, and ended up leaving off a set of parentheses that turned a database query from an AND (OR OR) to AND OR OR. This difference resulted in massive numbers of table scans on our DB once all of the users came back online following the upgrade.
Moving forward, we are going to:
- Implement new monitoring of our DB query plans so we can catch problems even in a testing environment
- Improve our load testing
- Create some automation tools for quickly getting out of a bind if something like this happens again
- Implement a “Limp Mode” similar to what automobiles sometimes have for limping along down the road when there are transmission or other problems. In our case, we would do everything we can to make sure chat remains available, even if other subsystems need to be temporarily taken offline.
We take this experience very seriously and strive for zero unplanned downtime for your communication platform.