MLEnv: Standardizing ML at Pinterest Beneath One ML Engine to Speed up Innovation | by Pinterest Engineering | Pinterest Engineering Weblog | Sep, 2023

Pinterest Engineering
Pinterest Engineering Blog

Pong Eksombatchai | Principal Engineer; Karthik Anantha Padmanabhan | Supervisor II, Engineering

Reading glasses sitting on top of a laptop’s spacebar with code on the screen behind it
Picture from https://unsplash.com/photos/w7ZyuGYNpRQ

Pinterest’s mission is to carry everybody the inspiration to create a life they love. We depend on an in depth suite of AI powered merchandise to attach over 460M customers to a whole lot of billions of Pins, leading to a whole lot of hundreds of thousands of ML inferences per second, a whole lot of hundreds of ML coaching jobs per thirty days by simply a few a whole lot of ML engineers.

In 2021, ML was siloed at Pinterest with 10+ completely different ML frameworks counting on completely different deep studying frameworks, framework variations, and boilerplate logic to attach with our ML platform. It was a significant bottleneck for ML innovation at Pinterest as a result of the quantity of engineering assets spent by every ML staff to take care of their very own ML stack was immense and there was restricted data sharing throughout groups.

To repair these issues we launched MLEnv — a standardized ML engine at Pinterest now leveraged by 95% of ML jobs at Pinterest (ranging from <5% in 2021). Since launching our platform we have now:

  • Noticed a 300% improve within the variety of coaching jobs, world-class 88 Internet Promoter Rating (NPS) for MLEnv and a 43% improve in ML Platform NPS
  • Shifted the paradigm for ML improvements and delivered combination features in Pinner engagement on the order of mid-double digit percentages
The chart shows the impressive growth of MLEnv Jobs over all Pinterest ML jobs over time. MLEnv was started in Q3 of 2021 and by Q1 of 2023, almost all Pinterest ML jobs are MLEnv jobs.
Progress of MLEnv over all of Pinterest ML jobs over time

Once we began engaged on the mission, ML growth at Pinterest was in a siloed state the place every staff would personal most of their very own distinctive ML stack. With standardization in tooling and fashionable ML libraries roughly providing the identical functionalities, sustaining a number of ML stacks in an organization at Pinterest scale is suboptimal for ML productiveness and innovation. Each ML and ML platform engineers felt the complete brunt of this situation.

For ML Engineers, this may imply:

  • Having to take care of their very own atmosphere together with work to make sure code high quality and maintainability, the runtime atmosphere and CI/CD pipeline. Questions that the staff has to reply and constantly keep embrace learn how to allow unit/integration testing, how to make sure consistency between coaching and serving atmosphere, what coding finest practices to implement, and so forth.
  • Dealing with integrations to leverage instruments and frameworks which can be essential for developer velocity. Heavy engineering work is required for primary high quality of life functionalities. For instance, the mission must combine with MLFlow to trace coaching runs, with Pinterest inner ML coaching and serving platform to coach and serve fashions at scale, and so forth.
  • Enabling superior ML capabilities to correctly develop cutting-edge ML at scale. ML has had an explosion of improvements in recent times, particularly with the prominence of huge language fashions and generative AI, and are rather more difficult than simply coaching the mannequin on one GPU and serving on CPU. Groups have to spend an inordinate period of time and assets to reinvent the wheels for various platforms to allow distributed coaching, re-implement state-of-the artwork algorithms on TensorFlow, optimize serving, and so forth.
  • Worst of all is that all the pieces is completed in a silo. There may be quite a lot of repeated work by every staff to take care of their very own environments and deal with varied integrations. All the trouble put into enabling superior ML capabilities can solely be utilized to a person mission due every mission having a singular ML stack.
The diagram summarizes vital pillars which can be essential for ML productiveness and for it to work at scale through which groups spend substantial assets and repeated efforts in sustaining their very own ML stacks.
Groups wrestle to take care of/allow all functionalities within the pillars attributable to how a lot useful resource and energy every of them requires.

For Platform Engineers, this may imply:

  • Main struggles within the creation and adoption of platform instruments which severely restricted the worth that might be added by platform groups to ML engineers. It is extremely troublesome for platform engineers to construct good standardized instruments that match numerous ML stacks. The platform staff additionally must work carefully with ML stacks one after the other to be able to combine choices from ML Platform — instruments like a distributed coaching platform, automated hyperparameter tuning and so forth. took for much longer than wanted because the work needed to be repeated for each staff.
  • Having to construct experience in each TensorFlow and PyTorch stretched ML platform engineering assets to the restrict. The nuances of the underlying deep studying framework must be thought-about to be able to construct a high-performance ML system. The platform staff spent a number of occasions the trouble wanted attributable to having to help a number of deep studying frameworks and variations (PyTorch vs TensorFlow vs TensorFlow2).
  • Incapacity to drive software program and {hardware} upgrades. Particular person groups had been very far behind in ML-related software program upgrades despite the fact that every improve brings quite a lot of new functionalities. Moderately than the improve course of being dealt with by platform engineers, most groups ended up utilizing a really outdated model of TensorFlow, CUDA and so forth. due to how cumbersome the improve course of normally is. Equally, it’s also very troublesome to drive {hardware} upgrades which limits Pinterest’s potential to benefit from the newest NVIDIA accelerators. {Hardware} upgrades normally require months of collaboration with varied consumer groups to get software program variations which can be lagging behind up-to-date.
MLEnv structure diagram with main elements

In mid 2021, we gained alignment from varied ML stakeholders at Pinterest and constructed the ML Setting (MLEnv), which is a full-stack ML developer framework that goals to make ML engineers extra productive by abstracting away technical complexities which can be irrelevant to ML modeling. MLEnv instantly addresses the assorted points talked about within the earlier part and gives 4 main elements for ML builders.

Code Runtime and Construct Setting

MLEnv gives a standardized code runtime and construct atmosphere for its customers. MLEnv maintains a monorepo (single code repository) for all ML tasks, a single shared atmosphere for all ML tasks that coaching and serving are executed on by leveraging Docker and the CI/CD pipeline that clients can leverage highly effective elements that aren’t simply out there resembling GPU unit assessments and ML coach integration assessments. Platform engineers deal with the heavy lifting work of setting them up as soon as for each ML mission at Pinterest to simply re-use.

ML Dev Toolbox

MLEnv gives ML builders with the ML Dev toolbox of generally used instruments that helps them be extra productive in coaching and deploying fashions. Many are common third occasion instruments resembling MLFlow, Tensorboard and profilers, whereas others are inner instruments and frameworks which can be constructed by our ML Platform staff resembling our mannequin deployment pipeline, ML serving platform and ML coaching platform.

The toolbox permits ML engineers to make use of dev velocity instruments via an interface and skip integrations that are normally very time consuming. One instrument to spotlight is the coaching launcher CLI which makes the transition between native growth and coaching the mannequin at scale on Kubernetes via our inner coaching platform seamless. All of the instruments mixed created a streamlined ML growth expertise for our engineers the place they can rapidly iterate on their concepts, use varied instruments to debug, scale coaching and deploy the mannequin for inference.

Superior Functionalities

MLEnv offers buyer entry to superior functionalities that had been up to now solely out there internally to the staff creating them due to our earlier siloed state. ML tasks now have entry to a portfolio of coaching methods that assist velocity up their coaching like distributed coaching, combined precision coaching and libraries resembling Speed up, DeepSpeed and so forth. Equally on the serving facet, ML tasks have entry to extremely optimized ML elements for on-line serving in addition to newer applied sciences resembling GPU serving for recommender fashions.

Native Deep Studying Library

With the earlier three elements mixed, ML builders can deal with the attention-grabbing half which is the logic to coach their mannequin. We took additional care to not add any abstraction to the modeling logic which might pollute the expertise of working with well-functioning deep studying libraries resembling TensorFlow2 and PyTorch. In our framework, what finally ends up taking place is that ML engineers have full management over the dataset loading, mannequin structure and coaching loop carried out utilizing native deep studying libraries whereas getting access to complementary elements outlined above.

After MLEnv basic availability in late 2021, we entered a really attention-grabbing time interval the place there have been speedy developments in ML modeling and the ML platform at Pinterest which resulted in large enhancements in advice high quality and our potential to serve extra inspiring content material to our Pinners.

ML Growth Velocity

The direct affect of MLEnv is a large enchancment in ML dev velocity at Pinterest of ML engineers. The capabilities to dump many of the ML boilerplate engineering work, entry to an entire set of helpful ML instruments via an easy-to-use interface and easy accessibility to superior ML capabilities are sport changers in creating and deploying cutting-edge ML fashions.

ML builders are very happy with the brand new tooling. MLEnv maintains an NPS of 88 which is world-class and is a key contributor in enhancing ML Platform NPS by 43%. In one of many organizations that we work with, the NPS improved by 93 factors as soon as MLEnv had been absolutely rolled out.

Groups are additionally rather more productive because of this. We see a number of occasions progress within the quantity of ML jobs (i.e. offline experiments) that every staff runs despite the fact that the variety of ML engineers are roughly the identical. They will now additionally take fashions to on-line experimentation in days fairly than months leading to a a number of occasions enchancment of the variety of on-line ML experiments.

Explosion within the variety of ML jobs over time attributable to developer velocity enhancements

ML Platform 2.0

MLEnv made the ML Platform staff rather more productive by permitting the staff to deal with a single ML atmosphere. The ML Platform staff can now construct standardized instruments and cutting-edge ML capabilities, and drive adoption via a single integration with MLEnv.

An instance on the ML coaching platform facet is Coaching Compute Platform (TCP), which is our in-house distributed coaching platform. Earlier than MLEnv, the staff struggled to take care of the platform attributable to having to help numerous ML environments with completely different deep studying framework libraries and setup. The staff additionally struggled with adoption attributable to having to onboard varied consumer groups one after the other with various must the platform. Nonetheless, with MLEnv, the staff was capable of significantly cut back upkeep overhead by narrowing all the way down to a single unified atmosphere whereas gaining explosive progress within the variety of jobs on the platform. With the a lot lowered upkeep overhead the staff was capable of deal with pure extensions to TCP. Extra superior functionalities like distributed coaching, automated hyperparameter tuning and distributed information loading via Ray grew to become easy for the staff to implement and are launched via MLEnv for consumer groups to undertake and use with minimal effort.