V1.5 added info on pinned tweets (15th July 2017 7am AEST)
I am a big fan of companies having external (and internal) self-service status pages that list the statuses of application and services. If you have an online presence, are developing an API or service you should consider developing an automated status page to list your services statuses.
Currently, users can check if a website is down by visiting sites like http://downforeveryoneorjustme.com/
An app with multiple/ secure back ends will be harder for customers to diagnose if they are down so offering inbuilt status screens is essential. It is a good idea to create a dedicated system status page (e.g https://status.youproduct.com) and have that page show various statuses from a separate server or the same server, you don’t need a dedicated subdomain (a subfolder will do). Apple and Google use subpage status pages.
A good status page will show the status of services you offer. e.g.
- Online shopping cart: UP
- Online forum: UP
- Payment Gateway: UP
- SMS gateway: UP (Resolved connection issue 12 mins ago).
- App API: DOWN (expected restoration in 12 mins).
If things are down it is good idea to add a balloon message or alert to live systems (and link to your status page), not everyone remembers maintenance windows or keeps up to date.
The status page can also contain other data that may help internal teams diagnose faults like:
- Server Room Air conditioner temperature: 39c.
- Room temperature: 41c.
- Floor water sensor #1: TRUE.
- Floor water sensor #2: FALSE.
- Humidity: 89%.
- Secure Server room 001 photos ( link 1, link 2)
- AD server: UP.
- DNS: UP.
- Server Rack 001 Intake Temperature: 38c.
- Server Rack 001 Internal Temperature: 78c.
- Server Rack 001 External Temperature: 64c.
Pro Avtive v Reactive monitoring
It is a good idea to pro-actively detect and automatically remediate issues before you are forced to reactively resolve something. Don’t rely on an email from a monitoring service saying your server is down (or was down) or for a user to report an issue (users will often sit back and use the outage to do something else (this affects your service reputation and tusks your business)).
I would monitor in this order.
- External HTTP checking (External monitor checking your server).
- External Application checking (external verification of logins or application services).
- Internal Server stats (network link up status, link speed, network connections and network failure rates. A status screen can be easily built importing server stats and server performance).
- Known historical issues (monitor what has caused your sites to break before).
- Data from applications (historical patterns or known triggers).
- User Error Reports
Waiting for user users to report errors is bad. Sites like www.trello.com and www.onesignal.com have good programmable services like web push, mobile push and or phone and SMS alerts that can be connected into your support processes.
Showing current service performance and endpoint status allow your customers to set their expectations and this shows you take your services uptime seriously.
If you have logs or data available from applications you may as well automate and sumarise it. “Without data you’re just another person with an opinion.” – W. Edwards Deming
Ignoring data and not reporting issues is a recipe for poor service.
Archiving multiple data points
It is a good idea to log and archive network usage, service CPU and usage (app, web server, I/O etc) to allow you to find correlation data and failure points. Analysis is key.
Do provide ETA’s on resolutions when things fail as you resolve an issue.
Listing planned and scheduled maintenance (e.g code rollouts, server reboots etc) allow you prevent support calls.
You can automate many things from a status page, if a certain event happens you can attempt an automated resolution (e.g reboot a server) or let diagnosing staff know a resolution has happened.
You can automatically change email autoresponder text (mentioning things are down) when you reply to incoming emails, tickets and or automatically post status changes to social media. Automatically informing users (instead of ignoring and burying problems) this goes a long way to building trust.
You can automate the notifications of potential problems to internal staff from the status page and automatically inform key staff when certain things happened (e.g when say secure certificates will expire, when the network or API is overloaded nr network is congested).
A good status page will list when the status was last updated (e.g. 3 minutes ago).
Statistics and Graphs
Statistics like up time and historical graphs (uptime and latency) can be good to help keep track of the reliability trends.
Inform your user when everything is ok.
Don’t forget to inform staff when everything up is, generally, staff will stop using a product or service until a system is back up. Generally, users will not sit there pressing refresh for long. Offer web push or RSS feeds.
Improve your documentation, having good documentation (and known past problems and resolutions) on hand will allow for quicker resolutions in future.
Allows customers to subscribe to status changes (via RSS) or use dedicated status accounts on social media. Providing a JSON feed also shows your commitment to openness to your service.
Adding website headers to inform users of upcoming outages is a good idea. The Department it Industry, Innovation and Science do it right.
You should also setup social media status accounts and pin status information like civocloud do
Allow customers to see your past problems (description, date, time and resolution), allows the customer to know the risks and allows you to focus on remediation.
Example Status Pages
Apple Status Page ( systems, validity, ticket )
AWS Status Page ( history, more, regions, subscribe, validity ).
Digital Ocean Status Page ( history, description and resolution ).
Rack Space Status Page ( general notices, current status, maintenance ).
Heroku Status Page ( history, apps, tools, services, subscribe ).
Discord Status Page ( services, history ).
Google Cloud Status Page ( history, description and resolution ).
Shopify Status Page ( response times, services, validity, subscribe, history ).
Playstation Status Page ( services ).
Github Status Page ( validity, response time, history ).
Team Viewer Status Page ( validity, services, history, subscribe ).
Office 365 Status Page ( services ).
G Suite Status Page ( services, history ).
Telstra Status Page ( Status, webpage )
Commercial Status Page Services
If you are not into developing a custom status page you can use a commercial status page service (but they are expensive)
e.g https://www.statuspage.io -$49 a month ( Atlassian owned ).
V1.2 data and analysis