Site Reliability Engineering at Starship

Photo by Ben Davis, Instagram slovaceck_

Running autonomous robots on city streets is very much a software engineering challenge. Some of this software runs on the robot itself but a lot of it actually runs in the backend. Things like remote control, path finding, matching robots to customers, fleet health management but also interactions with customers and merchants. All of this needs to run 24×7, without interruptions and scale dynamically to match the workload.

SRE at Starship is responsible for providing the cloud infrastructure and platform services for running these backend services. We’ve standardized on Kubernetes for our Microservices and are running it on top of AWS. MongoDb is the primary database for most backend services, but we also like PostgreSQL, especially where strong typing and transactional guarantees are required. For async messaging Kafka is the messaging platform of choice and we’re using it for pretty much everything aside from shipping video streams from robots. For observability we rely on Prometheus and Grafana, Loki, Linkerd and Jaeger. CICD is handled by Jenkins.

A good portion of SRE time is spent maintaining and improving the Kubernetes infrastructure. Kubernetes is our main deployment platform and there’s always something to improve, be it fine tuning autoscaling settings, adding Pod disruption policies or optimizing Spot instance usage. Sometimes it is like laying bricks — simply installing a Helm chart to provide particular functionality. But oftentimes the “bricks” must be carefully picked and evaluated (is Loki good for log management, is Service Mesh a thing and then which) and occasionally the functionality doesn’t exist in the world and has to be written from scratch. When this happens we usually turn to Python and Golang but also Rust and C when needed.

Another big piece of infrastructure that SRE is responsible for is data and databases. Starship started out with a single monolithic MongoDb — a strategy that has worked well so far. However, as the business grows we need to revisit this architecture and start thinking about supporting robots by the thousand. Apache Kafka is part of the scaling story, but we also need to figure out sharding, regional clustering and microservice database architecture. On top of that we are constantly developing tools and automation to manage the current database infrastructure. Examples: add MongoDb observability with a custom sidecar proxy to analyze database traffic, enable PITR support for databases, automate regular failover and recovery tests, collect metrics for Kafka re-sharding, enable data retention.

Finally, one of the most important goals of Site Reliability Engineering is to minimize downtime for Starship’s production. While SRE is occasionally called out to deal with infrastructure outages, the more impactful work is done on preventing the outages and ensuring that we can quickly recover. This can be a very broad topic, ranging from having rock solid K8s infrastructure all the way to engineering practices and business processes. There are great opportunities to make an impact!

A day in the life of an SRE

Arriving at work, some time between 9 and 10 (sometimes working remotely). Grab a cup of coffee, check Slack messages and emails. Review alerts that fired during the night, see if we there’s anything interesting there.

Find that MongoDb connection latencies have spiked during the night. Digging into the Prometheus metrics with Grafana, find that this is happening during the time backups are running. Why is this suddenly a problem, we’ve run those backups for ages? Turns out that we’re very aggressively compressing the backups to save on network and storage costs and this is consuming all available CPU. It looks like the load on the database has grown a bit to make this noticeable. This is happening on a standby node, not impacting production, however still a problem, should the primary fail. Add a Jira item to fix this.

In passing, change the MongoDb prober code (Golang) to add more histogram buckets to get a better understanding of the latency distribution. Run a Jenkins pipeline to put the new probe to production.

At 10 am there’s a Standup meeting, share your updates with the team and learn what others have been up to — setting up monitoring for a VPN server, instrumenting a Python app with Prometheus, setting up ServiceMonitors for external services, debugging MongoDb connectivity issues, piloting canary deployments with Flagger.

After the meeting, resume the planned work for the day. One of the planned things I planned to do today was to set up an additional Kafka cluster in a test environment. We’re running Kafka on Kubernetes so it should be straightforward to take the existing cluster YAML files and tweak them for the new cluster. Or, on second thought, should we use Helm instead, or maybe there’s a good Kafka operator available now? No, not going there — too much magic, I want more explicit control over my statefulsets. Raw YAML it is. An hour and a half later a new cluster is running. The setup was fairly straightforward; just the init containers that register Kafka brokers in DNS needed a config change. Generating the credentials for the applications required a small bash script to set up the accounts on Zookeeper. One bit that was left dangling, was setting up Kafka Connect to capture database change log events — turns out that the test databases are not running in ReplicaSet mode and Debezium cannot get oplog from it. Backlog this and move on.

Now it is time to prepare a scenario for the Wheel of Misfortune exercise. At Starship we’re running these to improve our understanding of systems and to share troubleshooting techniques. It works by breaking some part of the system (usually in test) and having some misfortunate person try to troubleshoot and mitigate the problem. In this case I’ll set up a load test with hey to overload the microservice for route calculations. Deploy this as a Kubernetes job called “haymaker” and hide it well enough so that it doesn’t immediately show up in the Linkerd service mesh (yes, evil 😈). Later run the “Wheel” exercise and take note of any gaps that we have in playbooks, metrics, alerts etc.

In the last few hours of the day, block all interrupts and try and get some coding done. I’ve reimplemented the Mongoproxy BSON parser as streaming asynchronous (Rust+Tokio) and want to figure out how well this works with real data. Turns out there’s a bug somewhere in the parser guts and I need to add deep logging to figure this out. Find a wonderful tracing library for Tokio and get carried away with it …

Disclaimer: the events described here are based on a true story. Not all of it happened on the same day. Some meetings and interactions with coworkers have been edited out. We are hiring.