Chapter 3 Design • You can use Spark’s ML.lib or other ML libraries such as Tensorflow, PyTorch, etc. that allows data scientists to focus on their data problems and models instead of solving the complexities of distributed data, such as infrastructure and configurations, etc. • Machine learning algorithms involve a sequence of tasks, including preprocessing, feature extraction, and model fitting, in identifying outliers. In the case of Apache Spark, ML Pipeline is a high-level API for ML provided by Spark that provides a sequence of stages handled with distributed processing capabilities. • Data streams and processing: • Data stream processing helps data engineers and data scientist to process real-time data from sources including stream engines such as Apache Kafka, Rabbit MQ, Redis Simple Message Queue (RSMQ) and Flume. • Search and query web services: • Processed data can be pushed out to file systems, databases, and live dashboards using web services. • Web services are exposed to the UI dashboard, as shown in Figure 3- 26. You can trigger a query using a Web API. These WEB APIs further interact with an ML-based trained model; the model loads and processes the real-time data and returns prediction results back to databases and UI dashboards. 103
Building Digital Experience Platforms Page 122 Page 124