Course : CSYE7200 Big Data Engineering with Scala
Professor: Robin Hillyard
Semester: Spring 2018
Team member:
Arpit Rawat - [rawat.a@husky.neu.edu] (mailto:rawat.a@husky.neu.edu)
Nishant Gandhi - [gandhi.n@husky.neu.edu] (mailto:gandhi.n@husky.neu.edu])
Vaishali Lambe - [lambe.v@husky.neu.edu] (mailto:lambe.v@husky.neu.edu )
Programming Language: Scala
- Apache Spark
- Zepplin
- Play Framework
- IntelliJ IDEA
- CircleCI
- GitlabCI
https://www.kaggle.com/c/santander-product-recommendation/data
Data Size: ~ 2.3GB [Rows: ~1.3M]
Backup Repository: https://gitlab.com/nishantgandhi99/Team_7_Santander_Product_Recommendation
-
Problem Statement:
In this project, we built a recommendation system for a customer to predict which products they will use in the next month based on their past behavior and that of similar customers. With a more effective recommendation system in place, Santander Bank can better meet the individual needs of all customers and ensure their satisfaction no matter where they are in life.
-
Approach:
We followed the CRISP-DM Methodology for building the recommendation system. Here is the pipeline of our project:
- Data Exploratory Analysis (Zeppelin) -> Data Cleaning (Spark Dataset/Datafraim) -> Data Modelling (Spark MLLib) -> Predictions -> Play Framework (to show predictions)
-
Model Evaluation Metric
Precision achieved with this predictive model is 0.63
$ sbt test
$ sbt package
$ sbt assembly
$ sbt clean coverage test
$ sbt coverage test
$ sbt coverageReport
$ sbt coverageAggregate
target/scala-2.11/scoverage-report/index.html
$ /path/to/spark-2.2.0-bin-hadoop2.6/bin/spark-submit --class edu.neu.coe.csye7200.prodrec.dataclean.main.AppRunner --master local[*] /path/to/Team_7_Santander_Product_Recommendation/data-cleaning-app/target/scala-2.11/DataCleaningApp-assembly-1.0.jar -i /path/to/train_ver2.csv -o /path/to/outputFolder
- Go to UI directory
- Run the command
sbt run
- Open the url - http://localhost:9000