Next steps

As of today, I am a software engineer in Facebook's Distributed Artificial Intelligence group. I will be working on the infrastructure side of things, tackling the scaling problems of training large machine learning models. AWS's SageMaker is a publicly-available product that shares many of the challenges the group wrestles with: see the SageMaker documentation on distributed training.

Specifically, the group aims to provide the underlying infrastructure for splitting large training tasks to multiple nodes, and to develop new methods of distributed training that better leverage GPUs. Here are excerpts from two Facebook research papers that sketch what the group has been working on:

"Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective" (2018)

Traditionally, models were trained on a single machine. (...) Given that the data needed for training is increasing over time, hardware limitations can result in an unacceptable increase in overall training latency and time to convergence. Distributed training is one solution for overcoming these hardware limitations and reducing latency. (...)

A common assumption is that for data parallelism across machines, a specialized interconnect is required. However, during our work on distributed training, we have found Ethernet-based networking to be sufficient, providing near-linear scaling capability. (...)

If models become exceptionally large, model parallelism training can be employed, where the model layers are grouped and distributed to optimize for throughput with activations pipelined between machines. (...) In many cases, during inference, the DNN models themselves are designed to be run on a single machine, as partitioning the model graph amongt he machines can result in large amount of communication. But major services are consistently weighing the cost/benefits of scaling their models.

"Understanding Training Efficiency of Deep Learning Recommendation Models at Scale" (2021)

While training deep learning recommendation models on CPUs offer memory capacity advantage, a large degree of parallelism in the training process, which could be unlocked by utilizing accelerators, is left unexploited. An alternative is to train recommendation models on Facebook's Big Basin GPU servers, originally designed for non-recommendation AI workloads. (...) training Facebook-scale recommendation models on the Big Basin system is not straightforward — optimization techniques tailored for the placement of embedding tables are needed to overcome the capacity requirement while addressing the potential increase in the embedding vector access latency. (...)

For training on CPU-only servers, embedding tables can be simply placed on system memory. For accelerated systems, the optimal embedding placement strategy might differ given the hardware properties. We discuss four strategies for storing embedding tables for GPU platforms: GPU memory, system memory of the GPU server, system memory of remote CPU servers, hybrid system and GPU memory. (...)

Numbers of three types of servers need to be configured for each workflow: number of trainer servers, number of parameter servers (if used) and number of reader servers. Selection of number of servers is made based on the throughput requirement for training and the memory capacity requirement of the embedding tables. (...) Trainer servers dictate the data parallelism with which we train. (...) Parameter servers hold the model parameters in memory. (...) Readers access model training data in parallel from remote storage Hive, Facebook's exabyte-scale data warehouse.

I am excited to join a team where I can work on hands-on distributed systems problems while staying close to the company's core business, e.g., personalized ads. While the exact details of what I will be working on are not yet clear, I will likely start on the more traditional, CPU-centric side of distributed training.