08-04 Server Migration

2019-08-04

This weekend, I was finally ready to do some much-needed devops work.

Problem

The main AVWX API servers and cache were hosted in separate regions. This was due to poor planning and just getting things to work in the beginning when I was handling 10k per day instead of per minute.

The caching layer was implemented with Azure CosmosDB with a MongoDB client. This worked well for high read, low write caching because CosmosDB pricing is based on writes. However, while you don't need to worry about clusters or migrations, the table throughput would not autoscale. It also would become too expensive to include metric counting because analytics are low read, high write.

Solution

Replace CosmosDB with traditional DB cluster and host it in the same data center as the API.

Process

First, I replaced CosmosDB with a MongoDB cluster hosted in Azure US East 2 managed by MongoDB Cloud Atlas. I was already using the Mongo client, so no code changes were necessary. Because CosmosDB was only holding cache data, I did not need to worry about data migration either.

Next, I created a new Azure Web App in US East 2 to host the Docker container. The configuration was identical to the previous server, but Azure does not support duplicating a Linux container app between regions. Once the server was properly configured, the A record was updated to point to the new IP address.

A happy accident that I found was that traffic slowly began migrating to the new server as the updated record propogated to the various DNS systems. This allowed the new cache to gracefully build up and for me to quickly address any issues that arrose during the switch.

Result

I now have the API and database in the same region. This cut the average response time from from 48ms down to 11ms. As of this post, 85% of traffic has moved to the new server in the first 16 hours with the old server scaled down to handle the remaining traffic. It will eventually be deleted as the traffic gets near zero. There are still some 5xx errors popping up that inflate the average response time still, but these are erratic and should either disappear or be addressed later as they represent less than 0.001% of traffic.

Next Steps

Now that the database is more friendly to high write data, I need to enable the station counting system previously tested.