UPDATED Jan 2017.
If you have read my other guides on http://www.fearby.com you may tell I like the self-managed Ubuntu servers you can buy from Digital Ocean for as low as $5 a month (click here to get $10 in free credit and start your server in 5 minutes ). Digital ocean is a great place to start up your own server in the cloud, install some software and deploy some web apps or backend (API/databases/content) for mobile apps or services. If you need more memory, processor cores or hard drive storage simple shutdown your Digital Ocean server, click a few options to increase your server resources and you are good to go (this is called “scaling up“).
This scalability guide is a work in progress (watch this space). My aim is to get 2000 concurrent users a second serving geo queries (like PokeMon Go) for under $80 a month (1x server and 1x mongoDB cluster). Currently serving 600~1200/sec.
Planning for success
Anyone who has researched application scalability has come across articles on apps that have launched and crashed under load at launch. Even governments can spend tens of millions on developing a scalable solution, plan for years and fail dismally on launch (check out the Australian Census disaster). The Australian government contracted IBM to develop a solution to receive up to 15 million census submissions between the 28th of July to the 5th of September. IBM designed a system and a third party performance test planned up to 400 submissions a second but the maximum submissions received on census night before the system crashed was only o154 submissions a second. Predicting application usage can be hard, in the case of the Australian census the bulk of people logged on to submit census data on the recommended night of the 9th of August 2016.
Sticking to a budget
This guide is not for people with deep pockets wanting to deploy a service to 15 million people but for solo app developers or small non-funded startups on a serious budget . If you want a very reliable scalable solution or service provider you may want to skip this article and check out services by the following vendors.
- Azure (good guides by Troy Hunt: here, here and here).
- Amazon Web Services
- Google Cloud
- NGINX Plus
The above vendors have what seems like an infinite array of products and services that can form part of your solution but beware, the more products you use the more complex it will be and the higher the costs. A popular app can be an expensive app. That’s why I like Digital Ocean as you don’t need a degree to predict and plan you servers average usage and buy extra resource credits if you go over predicted limits. With Digital Ocean you buy a virtual server and you get known Memory, Storage and Data transfer limits.
Let’s go over topics that you will need to consider when designing or building a scalable app on a budget.
Your application needs will ultimately decide the technology and servers you require.
- A simple business app that shares events, products and contacts would require a basic server and MySQL database.
- A turn-based multiplayer app for a few hundred people would require more server resources and endpoints (a NGINX, NODEJS and an optimised MySQL database would be ok).
- A larger augmented reality app for thousands of people would require a mix of databases and servers to separate the workload (a NGINX webserver and NodeJS powered API talking to a MySQL database to handle logins and a single server NoSQL database for the bulk of the shared data).
- An augmented reality app with tens of thousands of users (a NGINX web server, NodeJS powered API talking to a MySQL database to handle logins and NoSQL cluster for the bulk of the shared data).
- A business critical multi-user application with real-time chat – are you sure you are on a budget as this will require a full solution from Azure Firebase or Amazon Web Serers.
A native app, hybrid app or full web app can drastically change how your application works ( learn the difference here ).
Location, location, location.
You want your server and resources to be as close to your customers as possible, this is one rule that cannot be broken. If you need to spend more money to buy a server in a location closer to your customers do it.
I have a Digital Ocean server with 2 cores and 2GB of ram in Singapore that I use to test and develop apps. That one server has MySQL, NGINX, NodeJS , PHP and many scripts running on it in the background. I also have a MongoDB cluster (3 servers) running on AWS in Sydney via MongoDB.com. I looked into CouchDB via Cloudant but needed the Geo JSON features with fair dedicated pricing. I am considering moving a Ubuntu server off Digital Ocean (in Singapore) and onto AWS server (in Sydney). I am using promise based NodeJS calls where possible to prevent non blocking calls to the operating system, database or web.
Here is a benchmark for HTTP and HTTPS request from Rural NSW to Sydney Australia, then Melbourne, then Adelaide the Perth then Singapore to a Node Server on a NGINX Server that does a call back to Sydney Australia to get a GeoQuery from a large database and return to back to the customer via Singapore.
Here is a breakdown of the hops from my desktop in Regional NSW making a network call to my Digital Ocean Server in Singapore (with private parts redacted or masked).
<span class="kd">traceroute to destination-server-redacted.com (###.###.###.##), 64 hops max, 52 byte packets
1 192-168-1-1 (192.168.1.1) 11.034 ms 6.180 ms 2.169 ms
2 xx.xx.xx.xxx.isp.com.au (xx.xx.xx.xxx) 32.396 ms 37.118 ms 33.749 ms
3 xxx-xxx-xxx-xxx (xxx.xxx.xxx.xxx) 40.676 ms 63.648 ms 28.446 ms
4 syd-gls-har-wgw1-be-100 (18.104.22.168) 38.736 ms 38.549 ms 29.584 ms
5 203-219-107-198.static.tpgi.com.au (22.214.171.124) 27.980 ms 38.568 ms 43.879 ms
6 tengige0-3-0-19.chw-edge901.<strong>sydney</strong>.telstra.net (126.96.36.199) 30.304 ms 35.090 ms 43.836 ms
7 bundle-ether13.chw-core10.sydney.telstra.net (188.8.131.52) 29.477 ms 28.705 ms 40.764 ms
8 bundle-ether8.exi-core10.<strong>melbourne</strong>.telstra.net (184.108.40.206) 41.885 ms 50.211 ms 45.917 ms
9 bundle-ether5.way-core4.<strong>adelaide</strong>.telstra.net (220.127.116.11) 66.795 ms 59.570 ms 59.084 ms
10 bundle-ether5.pie-core1.<strong>perth</strong>.telstra.net (18.104.22.168) 90.671 ms 91.315 ms 89.123 ms
11 22.214.171.124 (126.96.36.199) 80.295 ms 82.578 ms 85.224 ms
12 i-0-0-1-0.skdi-core01.bx.telstraglobal.net (<strong>Singapore) </strong>(188.8.131.52) 132.445 ms 129.205 ms 147.320 ms
13 i-0-1-0-0.istt04.bi.telstraglobal.net (184.108.40.206) 156.488 ms
220.127.116.11 (18.104.22.168) 161.982 ms
i-0-0-0-4.istt04.bi.telstraglobal.net (22.214.171.124) 160.952 ms
14 unknown.telstraglobal.net (126.96.36.199) 155.392 ms 152.938 ms 197.915 ms
15 * * *
16 destination-server-redacted.com (xx.xx.xx.xxx) <strong>177.883 ms 158.938 ms 153.433 ms</strong></span>
160ms to send a request to the server. This is on a good day when the Netflix Effect is not killing links across Australia.
Here is the route for a call from the server above to the MongoDB Cluster on an Amazon Web Services in Sydney from the Digital Ocean Server in Singapore.
<span class="kd">traceroute to redactedname-shard-00-00-nvjmn.mongodb.net (##.##.##.##), 30 hops max, 60 byte packets
1 ###.###.###.### (###.###.###.###) 0.475 ms ###.###.###.### (###.###.###.###) 0.494 ms ###.###.###.### (###.###.###.###) 0.405 ms
2 188.8.131.52 (184.108.40.206) 0.367 ms 220.127.116.11 (18.104.22.168) 0.392 ms 0.377 ms
3 unknown.telstraglobal.net (22.214.171.124) 1.460 ms 126.96.36.199 (188.8.131.52) 0.283 ms unknown.telstraglobal.net (184.108.40.206) 1.456 ms
4 i-0-2-0-10.istt-core02.bi.telstraglobal.net (220.127.116.11) 1.338 ms i-0-4-0-0.istt-core02.bi.telstraglobal.net (18.104.22.168) 3.817 ms unknown.telstraglobal.net (22.214.171.124) 1.443 ms
5 i-0-2-0-9.istt-core02.bi.telstraglobal.net (126.96.36.199) 1.270 ms i-0-1-0-0.pthw-core01.bx.telstraglobal.net (188.8.131.52) 50.869 ms i-0-0-0-0.pthw-core01.bx.telstraglobal.net (184.108.40.206) 49.789 ms
6 i-0-1-0-5.sydp-core03.bi.telstraglobal.net (220.127.116.11) 107.395 ms 108.350 ms 105.924 ms
7 i-0-1-0-5.sydp-core03.bi.telstraglobal.net (18.104.22.168) 105.911 ms 21459.tauc01.cu.telstraglobal.net (22.214.171.124) 108.258 ms 107.337 ms
8 21459.tauc01.cu.telstraglobal.net (126.96.36.199) 107.330 ms unknown.telstraglobal.net (188.8.131.52) 101.459 ms 102.337 ms
9 * unknown.telstraglobal.net (184.108.40.206) 102.324 ms 102.314 ms
10 * * *
11 220.127.116.11 (18.104.22.168) 103.016 ms 103.892 ms 105.157 ms
12 * * 22.214.171.124 (126.96.36.199) 103.843 ms
13 * * *
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *</span>
It appears Telstra Global or AWS block the tracking of network path closer to the destination so I will ping to see how long the trip takes
<span class="kd">bytes from ec2-##-##-##-##.ap-southeast-2.compute.amazonaws.com (##.##.##.##): icmp_seq=1 ttl=50 <strong>time=103 ms</strong></span>
It is obvious the longest part of the response to the client is not the GeoQuery on the MongoDB cluster or processing in NodeJS but the travel time for the packet and securing the packet.
My server locations are not optimal, I cannot move the AWS MongoDB to Singapore because MongoDB don’t have servers in Singapore and Digital Ocean don’t have servers in Sydney. I should move my services on Digital Ocean to Sydney but for now lets let’s see how far this Digital Ocean Server in Singapore and MongoDB cluster in Sydney can go.
Secure (SSL) communication is now mandatory for Apple and Android apps talking over the internet so we can’t eliminate that to speed up the connection but we can move the server. I am using more modern SSL ciphers in my SSL certificate so they may slow down the process also. Here is a speed test of my servers cyphers. If you use stronger security so I expect a small CPU hit.
fyi: I have a few guides on adding a commercial SSL certificate to a Digital Ocean VM and Updating OpenSSL on a Digital Ocean VM. Guide on configuring NGINX SSL and SSL. Limiting ssh connection rates to prevent brute force attacks.
Server Limitations and Benchmarking
If you are running your website on a shared server (e.g CPanel domain) you may encounter resource limit warnings as web hosts and some providers want to charge you more for moderate to heavy use.
<span class="kd">Resource Limit Is Reached 508
The website is temporarily unable to service your request as it exceeded resource limit. Please try again later.
I have never received a resource limit reached warning with Digital Ocean.
Most hosts (AWS/Digital Ocean/Azure etc) all have limitations on your server and when you exceed a magical limit they restrict your server or start charging excess fees (they are not running a charity). AWS and Azure have different terminology for CPU credits and you really need to predict your applications CPU usage to factor in the scalability and monthly costs. Servers and databases generally have a limited IOPS (Input/Output operations a second) and lower tier plans offer lower IOPS. MongoDB Atlas lower tiers have < 120 IOPS a second, middle tiers have 240~2400 IOPS and higher tiers have 3,000,20,000 IOPS
Know your bottlenecks
The siege HTTP stress testing tool is good, the below command will throw 400 local HTTP connections to your website.
siege -t1m -c400 'http://your.server.com/page'
The results seem a bit low: 47.3 trans/sec. No failed transactions though
** SIEGE 3.0.5
** Preparing 400 concurrent users for battle.
The server is now under siege...
Lifting the server siege.. done.
Transactions: 2803 hits
Availability: 100.00 %
Elapsed time: 59.26 secs
Data transferred: 79.71 MB
Response time: 7.87 secs
Transaction rate: 47.30 trans/sec
Throughput: 1.35 MB/sec
Successful transactions: 2803
Failed transactions: 0
Longest transaction: 8.56
Shortest transaction: 2.37
Sites like http://loader.io/ allow you to hit your web server or web page with many hits a second from outside of your server. Below I threw 50 concurrent users at a node API endpoint that was hitting a geo query on my MongoDB cluster.
The server can easily handle 50 concurrent users a second. Latency is an issue though.
I can see the two secondary MongoDB servers being queried 🙂
Node has decided to only use one CPU under this light load.
I tried 100 concurrent users over 30 seconds. CPU activity was about 40% of one core.
I tried again with a 100-200 concurrent user limit (passed). CPU activity was about 50% using two cores.
I tried again with a 200-400 concurrent user limit over 1 minute (passed). CPU activity was about 80% using two cores.
It is nice to know my promised based NodeJS code can handle 400 concurrent users requesting a large dataset from GeoJSON without timeouts. The result is about the same as siege (47.6 trans/sec) The issue now is the delay in the data getting back to the user.
I checked the MongoDB cluster and I was only reaching 0.17 IOPS (maximum 100) and 16% CPU usage so the database cluster is not the bottleneck here.
Out of curiosity, I ran a 400 connection benchmark to the node server over HTTP instead of HTTPS and the results were near identical (400 concurrent connections with 8000ms delay).
I really need to move my servers closer together to avoid the delays in responding. 47.6 served geo queries (4,112,640 a day) a second with a large payload is ok but it it not good enough for my application yet.
I may limit access to my API based on geo lookups ( http://ipinfo.io is a good site that allows you to programmatically limit access to your app services) and auth tokens but this will slow down uncached requests.
I can always add more cores or memory to my server in minutes but that requires a shutdown. 400 concurrent users does max my CPU and raise the memory to above 80% so adding more cores and memory would be beneficial.
Digital Ocean does allow me to permanently or temporarily raise the resources of the virtual machine. To obtain 2 more cores (4) and 4x the memory (8GB) I would need to jump to the $80/month plan and adjust the NGINX and Node configuration to use the more cores/ram.
If my app is profitable I can certainly reinvest.
With MongoDB clusters I can easily clone ( shard ) a cluster and gain extra throughput if I need it, but with 0.17% of my existing cluster being utilised I should focus on moving servers closer together.
NGINX do have commercial level products that handle scalability but this costs thousands. I could scale out manually by setting up a Node Proxies to point to multiple servers that receive parent calls. This may be more beneficial as Digital Ocean servers start at $5 a month but this would add a whole lot of complexity.
- Nginx Caching
- OpCache if you are using PHP.
- Node-cache – In memory caching.
- Redis – In memory caching.
Monitoring your server and resources is essential in detecting memory leaks and spikes in activity. HTOP is a great monitoring tool from the command line in Linux.
http://pm2.keymetrics.io/ is a good node package monitoring app but it does go a bit crazy with processes on your box.
It is a good idea to inform users of server status and issues with delayed queries and when things go down inform people early.
UPDATE: 17th August 2016
I set up an Amazon Web Services ECS server ( read AWS setup guide here ) with only 1 CPU and 1GB ram and have easily achieved 700 concurrent connections. That’s 41,869 geo queries served a minute.
The MongoDB Cluster CPU was 25% usage with 200 query opcounters on each secondary server.
I think I will optimise the AWS OS ‘swappiness’ and performance stats and aim for 2000 queries.
This is how many hits I can get with the CPU remaining under 95% (794 geo serves a second). AMAZING.
Another recent benchmark:
UPDATE: 3rd Jan 2017
I decided to ditch the cluster of three AWS servers running MongoDB and instead setup a single MongoDB instance on a Amazon t2.medium server (2 CPU/4GB ram) server for about $50 a month. I can always upgrade to the AWS MongoDB cluster later if I need it.
Ok, I just threw 2000 concurrent users at the new AWS single server MongoDB server and the server was able to handle the delivery (no dropped connections but the average response time was 4,027 ms, this is not ideal but this is 2000 users a second (and that is after API handles the ip (banned list), user account validity, last 5 min query limit check (from MySQL), payload validation on every field and then MongoDB geo query).
The two cores on the server were hitting about 95% usage. The benchmark here is the same dataset as above but the API has a whole series of payload, user limiting and logging
Benchmarking with 1000 maintained users a second the average response time is a much lower 1,022 ms. Honestly if I have 1000-2000 users queries a second I would upgrade the server or add in a series of lower spec AWS t2.miro servers and create my own cluster.
If this guide has helped please consider donating a few dollars.