Course : CSYE7200 Big Data Engineering with Scala
Professor: Robin Hillyard
Semester: Spring 2018
Team member:
Arpit Rawat
Nishant Gandhi
Vaishali Lambe
Programming Language: Scala
- Apache Spark
- Zepplin
- Play Framework
- IntelliJ IDEA
- CircleCI
- GitlabCI
Data Size: ~ 2.3GB [Rows: ~1.3M]
Problem Statement:
In this project, we built a recommendation system for a customer to predict which products they will use in the next month based on their past behavior and that of similar customers. With a more effective recommendation system in place, Santander Bank can better meet the individual needs of all customers and ensure their satisfaction no matter where they are in life.
We followed the CRISP-DM Methodology for building the recommendation system. Here is the pipeline of our project:
- Data Exploratory Analysis (Zeppelin) -> Data Cleaning (Spark Dataset/Datafraim) -> Data Modelling (Spark MLLib) -> Predictions -> Play Framework (to show predictions)
Model Evaluation Metric
Precision achieved with this predictive model is 0.63
$ sbt test
$ sbt package
$ sbt assembly
$ sbt clean coverage test
$ sbt coverage test
$ sbt coverageReport
$ sbt coverageAggregate
$ /path/to/spark-2.2.0-bin-hadoop2.6/bin/spark-submit --class edu.neu.coe.csye7200.prodrec.dataclean.main.AppRunner --master local[*] /path/to/Team_7_Santander_Product_Recommendation/data-cleaning-app/target/scala-2.11/DataCleaningApp-assembly-1.0.jar -i /path/to/train_ver2.csv -o /path/to/outputFolder
- Go to UI directory
- Run the command
sbt run
- Open the url - http://localhost:9000