🦀 Serving ML at the speed of Rust

Published in

Glance

6 min readJul 28, 2022

Serving 150+ million users is no joke and also not cheap

At Glance, we run recommender systems that rank content on the lock screens of over 150+ million users. Not all users have the same recommendation algorithm. We call each recommendation algorithm a Prediction Service.

To keep up with this traffic we can do two things:

Horizontally scale up the prediction services
Score more items per second or request

First is super easy but is also crazy expensive.The second is much harder as no silver bullet exists to solve this. Also, note that our Prediction Services are written in Python, which leaves you with only a handful of tricks to add more speed.

The second is imperative but how do we get there?

Our PS consists of two classes of operations:

Network calls to feature-store (irreducible)
CPU cycles spent on parsing + ranking compute + post-processing

To solve for 2, I decided to implement one of the largest PS (which does ~1.5 million predictions/second with 20% traffic) in a compiled language. After a bit of research, I decided to write it in Rust.Why? because:

Actix showed up as one of the top web frameworks in this benchmark (https://www.techempower.com/benchmarks/)
Rust is memory safe, fast and package management is 100x better than C/C++

(Why not Go? Check footnotes[1])

What was it like to port the Python prediction framework to Rust?

In the first few hours, I made a lot of mistakes but the compiler was really nice for politely telling me what I needed to correct. Rust wasn’t that hard to pick up after that.
Writing endpoints in Actix was fairly straightforward thanks to their rich documentation.
Although Rust has really strong community support, it did not have a client implemented to call the Vertex Feature-Store. Fortunately, Rust has great cryptography libraries, which allowed me to implement an auth-enabled client from scratch in a couple of hours.
The structure of a Rust project felt very close to a React project. Adding a package dependency was as simple as adding a line to the Cargo.toml file
Rust’s tooling system just works

Initial Benchmarking Results:

Finally, it was time to put the prediction service to test. I ran some stress tests and this is what I got:

Requests Per Sec (RPS) :

Rust barely reaching 60 RPS :(

Latencies:

This did not make sense! After a certain load, the model latencies started rising exponentially. Note that the Python PS was able to easily do ~160 RPS.

Rust lied to me!?

I was scratching my head at this point. I was promised a great deal and I thought that Rust lied to me. I thought.

I spent a couple of days digging deeper and found this epic blog by ScyllaDB on their debugging experience with Rust. I had a new shiny tool in my arsenal: Flamegraphs!

What are Flamegraphs? (quoting the above link)

“Flame graphs are a visualization of hierarchical data, created to visualize stack traces of profiled software so that the most frequent code-paths to be identified quickly and accurately.”

How to interpret these graphs? (quoting ScyllaDB blog):

“a rule of thumb is to look for operations that take up the majority of the total width of the graph — the width indicates time spent on executing a particular operation.”

Generating a Flamegraph of your rust service is as easy as:

Here is the Flamegraph of the Rust service:

Nothing too suspicious but I do see huge chunks of OpenSSL taking 27–30% of the CPU cycles. Fortunately, I found rusttls an alternative to OpenSSL which was way more reliable and easier to use. Here is the flame graph after switching from OpenSSL to rusttls:

Ok cool, flames looking nice and pointy and no process hogging all CPU resources. Then this should solve it right?

Well yes but actually no…

Although the RPS went slightly up it was nowhere near, Python’s prediction service RPS! Also, the latencies kept rising exponentially as the load increased. Something was still wrong!

Down the debugging rabbit hole

I decided to check the Flamegraph when the latencies started rising exponentially. But something unexpected happened: I found that when I stress-tested the Rust service locally, the latencies didn’t balloon up and in fact, the throughput too was able to cross Python service throughput. If the same code performs differently in two different environments (local vs prod) then the issue had to be in the docker image.

To keep the docker image light-weight, I was using scratch as the base docker image. Furthermore, I was generating my binary for the following target:

The difference is that this image uses, musl-libc, and my local environment uses glibc.

Upon further digging, I found a post by one of the musl-libc author talking about major problems in the implementation of malloc.

Tl;dr: musl-libc’s malloc implementation can be really slow under high load [2].

To fix this we can simply use different memory allocators [3]. But an even simpler way was to not be greedy about image size and use a more sensible docker base image that uses glibc. 😐 And I wasn't the first one to make this mistake.

Cool. Lesson learned. We use slim-buster. Let’s keep moving…

Finally after switching to slim-buster here are the metrics:

Epic! This comfortably crosses 900+ RPS on a higher load and with a p99 latency of < 90ms!

To summarize:

To handle the production traffic the Rust service would require max of 4 VMs whereas the Python service requires minimum 20 VMs.

This is pretty sweet! Lots of $$$

Bringing Rust home to meet your parents

If you are tempted to bring Rust into your engineering stack, you better make a really strong case for it. Consider these if you plan to pitch Rust as a language to your team:

Actix is a crazy fast web framework and if performance matters to you look no further. Having a high-throughput service will also keep infra costs in check. Although productionizing a Rust microservice can have several caveats if you are new to it.
Rust is not mature enough to support ML out of the box. But it worked in this case because we do minimal math on top of pre-computed scores. If you want to serve your Xgboost or deep learning model then Rust is not the right choice (but hopefully one day).
Rust is very elegantly designed and has a powerful compiler. You will have to try really hard to write a program that breaks. Fault tolerance is built-in.
Rust has a learning curve but if you are familiar with C/C++ or Java it will take you hardly take an hour to become productive.

Footnotes:

Why not use Golang? — I simply didn’t have enough time. But Rags did and it’s equally epic
Musl-libc is working on a much more performant implementation of malloc: https://github.com/richfelker/mallocng-draft
Here is a detailed performance comparison of various memory allocators: https://www.linkedin.com/pulse/testing-alternative-c-memory-allocators-pt-2-musl-mystery-gomes

Follow me on twitter if you want to bully your computers into going fast: https://twitter.com/shvbsle