As of today, I am a software engineer in Facebook's Distributed Artificial Intelligence group. I will be working on the infrastructure side of things, tackling the scaling problems of training large machine learning models. AWS's SageMaker is a publicly-available product that shares many of the challenges the group wrestles with: see the SageMaker documentation on distributed training.
Specifically, the group aims to provide the underlying infrastructure for splitting large training tasks to multiple nodes, and to develop new methods of distributed training that better leverage GPUs. Here are excerpts from two Facebook research papers that sketch what the group has been working on:
Traditionally, models were trained on a single machine. (...) Given that the data needed for training is increasing over time, hardware limitations can result in an unacceptable increase in overall training latency and time to convergence. Distributed training is one solution for overcoming these hardware limitations and reducing latency. (...)
A common assumption is that for data parallelism across machines, a specialized interconnect is required. However, during our work on distributed training, we have found Ethernet-based networking to be sufficient, providing near-linear scaling capability. (...)
If models become exceptionally large, model parallelism training can be employed, where the model layers are grouped and distributed to optimize for throughput with activations pipelined between machines. (...) In many cases, during inference, the DNN models themselves are designed to be run on a single machine, as partitioning the model graph amongt he machines can result in large amount of communication. But major services are consistently weighing the cost/benefits of scaling their models.
While training deep learning recommendation models on CPUs offer memory capacity advantage, a large degree of parallelism in the training process, which could be unlocked by utilizing accelerators, is left unexploited. An alternative is to train recommendation models on Facebook's Big Basin GPU servers, originally designed for non-recommendation AI workloads. (...) training Facebook-scale recommendation models on the Big Basin system is not straightforward — optimization techniques tailored for the placement of embedding tables are needed to overcome the capacity requirement while addressing the potential increase in the embedding vector access latency. (...)
For training on CPU-only servers, embedding tables can be simply placed on system memory. For accelerated systems, the optimal embedding placement strategy might differ given the hardware properties. We discuss four strategies for storing embedding tables for GPU platforms: GPU memory, system memory of the GPU server, system memory of remote CPU servers, hybrid system and GPU memory. (...)
Numbers of three types of servers need to be configured for each workflow: number of trainer servers, number of parameter servers (if used) and number of reader servers. Selection of number of servers is made based on the throughput requirement for training and the memory capacity requirement of the embedding tables. (...) Trainer servers dictate the data parallelism with which we train. (...) Parameter servers hold the model parameters in memory. (...) Readers access model training data in parallel from remote storage Hive, Facebook's exabyte-scale data warehouse.
I am excited to join a team where I can work on hands-on distributed systems problems while staying close to the company's core business, e.g., personalized ads. While the exact details of what I will be working on are not yet clear, I will likely start on the more traditional, CPU-centric side of distributed training.
Speaking of distributed systems, I have been attending Aleksey Charapko's Distributed Systems Reading Group, and I cannot recommend it enough. Learning distributed systems on your own is difficult: there is no canonical first-course textbook that covers all the important recent developments, and reading research papers without the foundational training can be cumbersome. Charapko's reading group gathers academics and practitioners in the field to present and discuss recent papers, providing the necessary context for understanding them properly. Having failed a few times at trying to assemble my own reading group for distributed systems papers, I very much appreciate this consistently-running weekly group full of incredibly knowledgeable people to learn from.
I still do intend to go through the foundational material in the near future, as an ad-hoc reading of a whole bunch of papers I do not understand well is not a substitute for a ground-up study of the first principles. Specifically, I plan to go through Robert Morris's famous distributed systems course at MIT (which now has lecture videos available online). I also plan to carefully study at least one book on distributed algorithms, e.g., Lynch or Attiya–Welch.
But, for now, I must focus on the less lofty (or perhaps even more impossible?) goal of developing a basic proficiency in C++, the infrastructure programming language of choice at Facebook. To give myself a deadline to work with, I signed up for a company internal two-day training course on performance optimization in C++, scheduled for the end of May. Beyond consulting the standard textbooks by Stroustrup, I am thinking it might be beneficial to spend some time learning the foundations of multiprocessor programming from a recently-released new edition of The Art of Multiprocessor Programming. I likely won't get around to it before the one-month deadline, but we'll see!