The Machine Learning (ML) development process requires a great deal of coding and tooling support from the engineering team. In this blog, we will describe how our engineering group applies software engineering principles to our methods, techniques, methodologies, and tools in order to support various ML use cases.
As PayPal’s ML engineering organization, we power our data science with automated tools to create models and features at scale. Like other organizations, we face growing challenges in managing software development specific to machine learning (ML) as well as in dealing with unique issues encountered during the development of large-scale ML infrastructure and applications.
To deal with the challenges, we adapt software engineering (SE) discipline to be ML-centric:
- We define methods and techniques based on SE principles to integrate ML development practices
- On top of this set of methods & techniques, we establish model and features life-cycle methodologies
- We create automated tools to develop, productize, and manage ML products according to the methodologies
Application of Principles
Software engineering principles and their application to ML
In this blog, we will describe how we apply the following SE principles to our methods, techniques, methodologies, and tools:
- Rigor and formality
- Separation of concerns
- Anticipation of change
Rigor and formality
We work in one scrum team of data scientists and software engineers to create heterogeneous teams with strengths in both ML research / modeling and engineering. While allowing for a high degree of creativity and innovation, we follow a well-defined ML software development process of which ML specifics are a central part.
In the framework of this process, the development proceeds in several clearly defined phases:
Data on-boarding — With the continuing inflow of business requirements, there is a constant need to onboard new sources of data. We create standards for the data model and on-boarding processes to curate collected data and to ensure consistency, reliability, and trustworthiness.
Research — In this phase, data scientists focus on modeling and prototyping. This phase is highly exploratory; thus, the focus of our engineering organization here is to provide tools to shorten research iterations and speed up the overall process.
Productization — The primary goal of this phase is to deliver production-grade ML in terms of performance and maintainability. We invest significant time and effort in doing architecture and design so as to ensure that we develop high-quality software according to the SE principles and our own best practices. As part of this process, we sometimes need to take a step back to the research phase to adjust algorithms to be feasible in terms of resource consumption and stability. To accelerate the deployment process to production environments, we constantly invest in the automation of tests to quickly identify problems in the early stages and to build CI / CD pipelines.
Monitoring — We practice system monitoring to ensure that the pipelines are running and functional as well as to continuously assess model performance and stability with general metrics and anomaly-based thresholds.
Model Refresh — This is an automated process for models to learn continuously over time to from newer data and achieve stable model performance without impacting day-to-day production.
This formal development process helps us to work efficiently in heterogeneous teams across multiple sites.
To bring more formality to critical areas, we define and track KPIs to reward retirement of features, measure feature adoption, assess stability and accuracy of our products, as well as continuous monitoring and test coverage.
ML workflows are highly non-linear and contain several feedback loops. The peculiarity of feature engineering is related to the amount of experimentation needed to establish a good model for the problem. We work in iterative-incremental development fashion, following the “prototype fast and get feedback quickly” methodology.
- Feature engineering is basically about creating new features from existing ones or raw data. Existing features data is stored and managed in the Feature Store and can be easily accessed through PayPal Notebooks or UI tools.
- We develop UI tools for quick and easy simple feature creation. Data scientists can create tens of variations of some features with just a few clicks.
Get feedback quickly
Feature selection requires historical data. For new features, we need to simulate their logic over the past time frame in a timely and cost-effective manner, since some models might require access to older data. The guidelines we define here are:
- Features should be developed in such a way that they can run on any Point in Time (PiT) of historical data (aka Time Travel for feature); we call it feature simulatability.
- We have a layered feature structure wherein online features are created on top of batch and Near Real Time (NRT) features. To simulate such features, we need to trigger upstream simulation, which requires good lineage management and upstream features data availability in Feature Store.
- Sometimes it is not feasible to run batch features for historical data, so here we use several specific techniques: (i) sample raw data for a specific population, (ii) run simulation on reduced population or (iii) decrease granularity of resulting PiT snapshots — for example, run monthly computations instead of daily.
We employ significant resources and effort in creating an isolated simulation environment with high parity to production data and compute platforms. It allows us to run experiments freely, without placing stress on production services. Another valuable piece of feedback we get during simulation is about feature performance, which enables us to tune and optimize logic before the productization phase.
This methodology provides us with speedy feedback loops for feature engineering. After several cycles, most predictive features for the model are selected, and we move to the productization phase.
Separation of concerns
With business needs growing in breadth and depth, delivering ML solutions requires increasing levels of direct effort from engineers, which is not a sustainable scaling strategy.
To address that challenge, we separate ML projects into three levels of complexity, an approach that allows us to unlock the dependencies and streamline the process.
For simple features, we provide self-service tools that data scientists can use to create features without direct support from engineers. We expect a data scientist to be able to go from idea to production within hours. Behind the scenes, the system operates at a much lower TCO by ensuring efficient and consistent result computation and by eliminating feature logic duplications and effort duplications across teams and locations.
For more complex projects, we plan for side-by-side development to reduce the number of iterations between data scientists and software engineers. To create a bridge between prototyping and productization, we encourage the use of PayPal Notebooks with built-in production coding standards and templates.
For state-of-the-art feature generation, we develop platforms such as graph of customer assets.
The key attribute of engineering large-scale software systems is modularity. In software, the primary units of reuse are functions, algorithms, libraries, and modules.
We define ML features as functions to reuse or customize, and they constitute the basic unit of delivery. We catalog features so that they can be easily found. We encourage a separate code base for each feature so that it can be easily used or extended by downstream features and models.
Another major standard we imply is the interoperability of features, meaning that existing features can be easily used by other features or models, regardless of the feature creation tool or underline online / offline / NRT computation platform.
Feature modularity and reuse are critical to our data science organization, which includes hundreds of data scientists and engineers across different site locations.
Differences in research and production environments increases the effort for communication and solution implementation. For example, during the research phase, data scientists mostly leverage data warehouse (SQL-based product) tools to develop models while in production we use various platforms like real-time service, streaming, and big data batch jobs for inferencing. To solve this problem, we use some domain-specific language (DSL) to abstract the detail of the execution platforms:
- Ontology-based variable creation, which enables data scientists to concentrate only on the logic itself. Underlying DSL can be compiled to different platforms to execute.
- PiT support in customized SQL enables data scientists to research models using the historical snapshot of the database; under the hood we can select the PiT snapshot data and do the de-duping of underlying data if needed.
Nowadays, model spec representation varies according to the training algorithm and frameworks. To enable the platform to support all the types of models, we developed an in-house unified model spec format and model inference engine that can support all the types of models we’re now using in our solutions (Linear, Tree-based, Neural Networks[PMML, H2O], deep learning tensor flow-based etc).
Anticipation of change
Changes are inevitable, so for every model / feature we build, we plan to retire it gracefully. To achieve this, we do two main things:
- Manage lineage between the features and models. This helps to identify all the redundant features and analyze upstream dependencies for the features to be retired.
- Fast and simple process to conduct the Impact of Change analysis. The straightforward approach would be to do the impact analysis in the staging environment, and then compare production and stage environment results using model result comparison tools. In contrast, we use simulation to speed up this process and assess the impact in hours instead of weeks.
Aside from business deliveries, we constantly search for general infra / tools that can support numerous similar-use cases.
We have engineering teams across different sites, so design review serves as a meeting point for the teams. We ask two main questions:
- Can we solve this problem with an existing (reused) solution? If yes, the other solution is using the generality principle.
- Is this tool / capability able to solve multiple problems? If so, it should be a good generalized solution.
Benefits of generality:
- Software is more reusable because it solves problems that are more general and thus can be easily adapted or extended to solve similar problems.
- The software can be developed faster because we reuse existing general-purpose modules / services / tools.
The above-described principles, techniques, methodologies, and tools help us to speed up the ML development and deployment process, save precious hardware and human resources, and allow us to scale with the business and ultimately serve our global customer base.
Thanks to Zhu Danick for co-authoring this article.
Please subscribe to our blog on medium or reach out to us at firstname.lastname@example.org with any questions.