Amazon Retail Systems Uses Apache Spark and Deep Java Library (DJL) for Building Propensity Models to Enhance Customer Experience
In the contemporary landscape of personalized marketing, many companies, including Amazon, are leveraging advanced machine learning techniques to tailor content and recommendations to individual customers. A critical component in achieving this personalization is understanding each customer’s propensity to engage with different product categories based on their past behaviors and preferences. This propensity data allows for more targeted marketing efforts, including personalized email campaigns, advertisements, and website banners.
At Amazon, the retail systems team has developed a multi-label classification model using MXNet to gauge customer propensity across a vast array of product categories. The goal of this model is to enhance the customer experience by delivering more relevant recommendations and promotions. In this post, we’ll delve into the challenges we faced while constructing these propensity models and explain how we addressed them using Apache Spark and the Deep Java Library (DJL). DJL is an open-source library designed to facilitate deep learning in Java, offering flexibility and performance for large-scale applications.
Challenges
One of the primary challenges was developing a production system capable of scaling to meet Amazon’s extensive demands while remaining easy to maintain. Apache Spark emerged as a critical tool for managing and scaling our data processing needs within the desired runtime. For our machine learning framework, MXNet proved to be highly effective. It handled our large dataset of hundreds of millions of records efficiently, offering superior execution times and model accuracy compared to other frameworks.
Another challenge was reconciling the differing preferences of our team members. Our engineering team, which specializes in Java and Scala, wanted to build a robust production system using Apache Spark. In contrast, our research scientists preferred Python-based frameworks. To bridge this gap, we turned to DJL, which supports multiple machine learning frameworks and allowed us to integrate MXNet with our Java-based system. Scientists could develop and train models using MXNet’s Python API, while the engineering team could use DJL to deploy these models and perform inference within Spark, all in Scala. DJL’s framework-agnostic nature means that if the team decides to switch to another ML framework in the future, such as PyTorch or TensorFlow, minimal changes to the existing codebase would be required.
Data
Training our classification model required careful management of two critical data sets: features and labels.
Feature Data
Feature data is crucial for any machine learning model. By using multi-label classification, we could streamline our feature generation process. This approach allows us to utilize a single pipeline to capture signals from multiple product categories. Consequently, we can maintain one comprehensive multi-label classification model instead of managing several binary classification models. This consolidation not only simplifies the operational overhead but also enhances our ability to derive customer propensity across diverse product categories.
Conclusion
By integrating Apache Spark with DJL, Amazon’s retail systems team has successfully created a scalable and efficient machine learning infrastructure that enhances personalized marketing efforts. This combination of technologies allows us to build and deploy sophisticated models that drive better customer experiences and optimize marketing strategies. The ability to handle large-scale data processing and model deployment seamlessly demonstrates the power and versatility of modern machine learning tools in a real-world enterprise setting.