How TalkingData Uses AWS Open Source Deep Java Library with Apache Spark for Scalable Machine Learning Inference
TalkingData is a leading data intelligence service provider, specializing in delivering actionable insights on consumer behavior, preferences, and trends. A core component of their offering is leveraging advanced machine learning and deep learning models to predict consumer behaviors. For instance, a car dealer might use these insights to target ads more effectively, focusing on potential buyers who are predicted to purchase a car within the next few months.
Initially, TalkingData relied on an XGBoost model for such predictions. However, their data science team sought to explore whether deep learning models could deliver superior performance for their use case. After extensive experimentation, they developed a deep learning model using PyTorch, an open-source deep learning framework. This new model demonstrated a 13% improvement in recall rate, meaning it provided more accurate predictions while maintaining a consistent level of precision.
Despite these improvements, deploying deep learning models at TalkingData’s scale presented significant challenges. The company needed to generate hundreds of millions of predictions daily, which required robust processing capabilities. Previously, they used Apache Spark, an open-source distributed processing engine, to manage large-scale data processing tasks. While Spark excels at distributing tasks across multiple instances for faster processing, it is a Java/Scala-based platform that can encounter issues when integrating with Python-based applications. Specifically, Spark’s Java garbage collector often struggles to manage memory usage effectively for Python programs, leading to potential crashes and inefficiencies.
Although the XGBoost model had native support for Java, allowing TalkingData to deploy it directly within Spark, PyTorch did not offer a similar Java API. This lack of native support created a problem: TalkingData could not directly execute their PyTorch model within Apache Spark due to the aforementioned memory management issues. To address this, they had to transfer data from Spark to a separate GPU instance for model inference. This workaround not only increased the overall processing time but also added complexity and maintenance overhead.
A breakthrough came when TalkingData’s production team learned about DJL (Deep Java Library) through the article “Implement Object Detection with PyTorch in Java in 5 Minutes with DJL.” DJL, an open-source deep learning framework developed by AWS, is designed to run deep learning models in Java. It supports various deep learning engines, including PyTorch, and provides a solution to integrate deep learning models with Java-based environments like Apache Spark.
By adopting DJL, TalkingData was able to execute their PyTorch model directly within Apache Spark, eliminating the need for separate GPU instances. This integration streamlined their processing pipeline, resulting in a 66% reduction in running time and significant cuts in maintenance costs. DJL’s compatibility with Spark allowed TalkingData to optimize their deep learning deployment, achieving greater efficiency and performance.
In summary, the use of DJL enabled TalkingData to overcome the challenges associated with deploying deep learning models at scale, integrating seamlessly with their existing Apache Spark infrastructure. This solution not only improved processing efficiency but also simplified maintenance, illustrating how advancements in technology can lead to substantial operational benefits.