RandomForest with PySpark - A ground up implementation

Apr 30, 2023 · 1 min read

(REPO LINK HERE)[https://github.com/jethrocsau/msbd5003-random-forest]

Abstract: In this study, we will leverage the Spark Python API to execute the random forest algorithm. This distributed variant of the random forest tree has the capacity to manage large datasets, and offers parallel training to increase the overall training time. To appraise the performance and scalability of the distributed random forest, we have conducted a comparative analysis on various factors. These include the algorithm’s performance across different dataset sizes, executor counts, numbers of RDD partitions, and its performance in comparison to the built-in available algorithm in MLlib.

image image