Autonomous robots out in the wild — a software engineering challenge

Starship is bringing autonomous delivery to the world. We are here to solve the last mile delivery problem with fleets of sidewalk delivery robots. Starship was the first company to begin operating autonomous delivery robots in public spaces already in 2017 without safety drivers. Today Starship is the leader in the autonomous delivery space and has completed more than 500,000 deliveries to customers. We’re operating in 5 countries( including the US and the UK) with hundreds of robots delivering 7 days a week, 365 days a year. More information about the company background is available here.

People may think Starship is all about Robotics. In fact, only a handful of engineers in Starship work directly on that. We are building a rich set of products to automate delivery, and Starship provides multiple challenges to solve, both from a hardware and software point of view. In Starship, we design and build our robots — both the hardware and embedded software. We’re also building the backend infrastructure and services to communicate with the robots. On top of that, we have the marketplace connecting the consumers of the service with the merchants. Robots are autonomous, but in some situations human help is needed. Remote operations enables us to deal with the trickier situations where automation is too risky or technically expensive. We also have people in the service areas where we operate, ready to help them out and charge them overnight. Since the robots need a cm-level of accuracy, we have built our own GIS-based 3D mapping solution (you can read more about it from the Starships Engineering blog).

What makes Starship a great place for engineers is the broad range of engineering disciplines we’re covering, e.g. our autonomous driving department is building machine learning based solutions that gives the intelligence for the robots to drive autonomously. We have the infrastructure to communicate with the robots in a time critical manner including getting access to all the sensor data, like the video streams, from the robot.

Our fleet management solutions are matching the right robot with the right tasks to optimise delivery times and fulfil the ETA promises we’ve made to our consumers. We’re also building our own marketplace which is a full blown e-commerce challenge. The marketplace handles the order flow end to end, starting from when the consumers make the order through our consumer application all the way to it being fulfilled by the kitchens or grocery stores. To summarise, Starship is full of fascinating engineering challenges, and many of them haven’t yet been solved by any other company in the world:

Our Marketplace team is responsible for connecting the two sided market of consumers and merchants. Essentially it’s about building a full-blown e-commerce solution on top of our robot infrastructure. The consumer mobile application is the first point of contact with Starship for most of our customers allowing customers to select their favourite restaurants or grocery stores, see the ETAs (Estimated Time of Arrival), fill the shopping basket, make the payment and track the delivery. Once the robot has arrived, customers unlock the robot using the application.

Our marketplace includes order state management and a range of payment integrations (varying from credit card payments to US university dining dollars) . The merchants are using the fulfilment solutions provided by us such as the kitchen and runner applications to accept the order and manage it’s state as well as physically interact with the robots by scanning them to identify the right robot, unlocking and eventually loading the robots and sending them on their journey towards the customers. In addition, the business logic, inventory and stock management, product enrichment and product pricing of our marketplace offering is handled by the team. Backend is mainly built with Node.js and GoLang using SQS and Kafka for messaging and GraphQL for the endpoints. The mobile applications are built with ReactNative.

Each of our sites has tens to hundreds of robots, and demand for even more orders at any given time. Therefore, deciding which robot should do which delivery given many competing goals is a non-trivial optimisation exercise. This optimisation is based on estimates for the many delivery steps, made up to an hour ahead; which robot could handle the task the quickest, when will that merchant have the goods ready, how long will it take the robot to drive through that local geography. And of course, all estimates are wrong. This is the probabilistic land of logistics optimisation, genetic algorithms, and random forests. Niels Bohr once said, “prediction is hard, especially if it is about the future”, and we excitedly agree. Fleet Orchestration is very much a data science application evolving through continuous real life iterations.

Core Backend deals with the link between robots and their duties. Robots know exactly where they are and what is happening around them as our Core Backend guides the robots. Routeserver provides the best available route with its characteristics, Orchestration server provisions commands and Commandserver enables all the data flow which both robots and remote operators provide. Handling 5k+ messages per second can be a hustle sometimes, but we manage. Scaling, optimising messaging, cutting milliseconds from latencies, reducing the number of messages per robot, aggregating and constantly improving orchestration is the prime focus of the team, working hand-in-hand with Hardware and Autonomous Driving teams.

Another aspect for the team is systems reliability – when robots lose connectivity, they are eventually unable to drive – so it better not happen.

Team uses mainly Golang and Node.JS to implement services, also some Elixir and Python is mixed into the bundle.

Running autonomous robots on city streets is very much a software engineering challenge. Some of this software runs on the robot itself but a lot of it actually runs in the backend. Things like remote control, path finding, matching robots to customers, fleet health management but also interactions with customers and merchants. All of this needs to run 24×7, without interruptions and scale elastically to match the workload.

SRE at Starship is responsible for providing the cloud infrastructure and platform services for running these backend services. We’ve standardised on Kubernetes for our microservices and are running it on top of AWS. MongoDb is the primary database for most backend services, but we also like PostgreSQL, especially where strong typing and transactional guarantees are required. For async messaging Kafka is the messaging platform of choice and we’re using it for pretty much everything aside from shipping video streams from robots. For observability we rely on Prometheus and Grafana, Loki, Linkerd and Jaeger. CICD is handled by Jenkins.

A good portion of SRE time is spent maintaining and improving the Kubernetes infrastructure. Another big piece of infrastructure that SRE is responsible for is data and databases where we mainly rely on MongoDB and PostgreSQL.

Finally, one of the most important goals of Site Reliability Engineering is to minimise the downtime for Starship’s production environment. While SRE is occasionally called out to deal with infrastructure outages, the more impactful work is done on preventing the outages and ensuring that we can quickly recover. This can be a very broad topic, ranging from having rock solid K8s infrastructure all the way to engineering practices and business processes. There are great opportunities to make an impact. Read more about SRE team.

In addition to the external customers, one of the most important users of our solutions are the Starship people who support the robots on and offline (either remotely operating them or working physically on the field). The remote operations panel is like an airplane’s cockpit – enabling the operations live video feeds, sensorial data and remote control of the robots. The frontend is built with React and Redux, API backend is mainly Node.js with support of persistent Golang servers that handle time-critical communication.

Field operators work is guided through the field assistant application that helps them solve the daily duties like preparing the fleet to be rolled out in the mornings, charging them overnight and occasionally changing a wheel here and there. The app is built with ReactNative and uses GraphQL exposed endpoints.

The team also creates developer tools to simulate past and future robot events and debug any problematic scenarios we have encountered in real life enabling other engineering teams to be more creative and productive.

Autonomous driving in human spaces is at the heart of our robot, and is one of the widest and most interesting software engineering challenges today. The AD team develops the software to solve these probabilistic problems on the robot in real-time and without internet. This has been developed into an over two million line code base handling many requirements of an autonomous vehicle, including but not limited to image recognition using deep learning, shape recognition and tracking plus path planning.

Other challenges the team solves include things like determining robot orientation and location in space, driving gracefully in the vicinity of pedestrians, safety analytics, signal processing for radar and ultrasonic signals and hardware fault detection.

The utmost priority is safety of our robots and people. As we can’t test all real-life scenarios on the field, we rely on extensive simulation and in silico testing to make sure the software we deploy actually works before releasing it to our test ground (on a nightly basis).

Our autonomous driving software is primarily written in C++ for both the CPU and GPU, the remainder written in Rust and Python. Python is also of course used in our neural network training across multiple frameworks.

Starship is a data-driven company, and our petabytes are precious to us. In fact Data guides the way is one of our company cultural values. In addition to the data lake containing robot data feeds, we also have a structured data warehouse with 600+ tables, and an extensive set of analytical dashboards to provide insights into all kinds of aspects of our business. The data stack is in continuous development by our data engineers and data scientists, keeping in sync with our business priorities. We use Spark, Databricks, Tableau, Redash and Airflow.

The business problems we are tackling come from a very wide range of topics. What kinds of customers use us more frequently? How many remote operators should we schedule for tomorrow? What is the optimal robot fleet allocation between sites? Does bad terrain make robot wheels break more often? Which are the new cities and sites we should launch our service in?

Our hardware team is responsible for the electronics and mechanics of the robot but also embedded software, operating system and communication layer of the robot and the infrastructure around the robot.

Challenges the electronics team is facing are twofold — how to get the best possible signal from the real world, which is rather messy, noisy and unpredictable and how to design things reliable enough so they work in the rather harsh conditions (water, vibration, heat, snow) our robots are facing while wandering around in neighbourhoods or university campuses.

In order to solve those challenges, a very good understanding of electronics, signal processing and physical process control is needed. As with any complicated system, troubleshooting and debugging is fun on it’s own.

We design the majority of the hardware components (both electrical and mechanical) in-house.

Inside the robot we have a main compute unit (Tegra TK1 in older robots and AMD Ryzen based system in newers) — this gives us enough processing power to perform computations needed by autonomous drive. Many of those computations are dealing with signal and image processing so they are very GPU heavy. Because of the network latency not all computations can be offloaded to servers in the cloud.

Besides the main computing unit we have a number of different sensors: cameras, radars, ultrasound, gyros and many actuators: motors, bogies and locks, controllers and processors. Some of the signal processing is so time-sensitive that it requires an FPGA to perform that work. The controller software is also written in-house by embedded software developers.

To summarise things, Starship is building an end to end autonomous delivery platform all the way from designing and building the robots to building the consumer and merchant facing applications.

If bringing autonomous deliveries to the world is a mission you’d be keen on helping us with, do have a look at open positions in https://www.starship.xyz/careers/ or feel free to get in touch with me directly.

You can also check out our Starship Engineering Youtube channel here — https://www.youtube.com/channel/UC6Vee4Zqd6oJayLO5uIcb9Q

— — — — — —

PS. When our Co-Founder Ahti Heinla started the company, I’m sure he wasn’t thinking of sheep detection algorithms 😉