3) - Nasdag
3) - Nasdag
3) - Nasdag
NASDAG.org
A data scientist blog, by Philippe Dagher
RSS
Blog
Archives
This post will guide you through a step by step installation and configuration of a Local Data Lake on ubuntu with packages such as Hadoop, Hive, Spark, Thriftserver, Maven, Scala, Python, Jupyter and Zeppelin.
It is the first of a series of 3 posts that will allow you to familiarize with state of the art tools to practice Data Science on Big Data.
In the first post we will setup the environment on ubuntu using a cloud host or a virtual machine. In the second post we will crunch incoming data and expose it to data mining and machine learning tools. In the third post, we will apply machine learning and data
science techniques to conventional business cases.
You will need to install ubuntu 15.10 on a virtual machine, either locally on a PC with at least 8Gb of RAM or on the cloud with DigitalOcean, Azure or AWS with at least 4Gb of RAM. If you choose to continue on a local machine, we will be installing and
configuring IntelliJ IDEA as well in our Data Lake environment to code with scala. I have tested what will follow on both a DigitalOcean droplet 15.10x64 and a VMware Workstation 12 Pro running ubuntu-15.10-desktop-amd64.iso (if you prefer to download a
preconfigured virtual machine just follow this link - password: ghghgh).
Once you get your host booted, make sure that you have access as root and as another user with root privileges - that I will call nasdag in this turorial. Otherwise from your root account add this user:
1 adduser nasdag
Grant nasdag root privileges: type visudo, this will open up nano to edit the sudoers file. Find the line that says root ALL=(ALL:ALL) ALL and give nasdag a line beneath that which says nasdag ALL=(ALL:ALL) ALL. Save by hitting Ctrl-o and then Enter when
asked for the file name. Exit nano with Ctrl-x.
If you have an account with root privileges, you can execute a command as root by preceding it with sudo or operate as root with the following command: sudo su -
Now stop being root and start being nasdag using su - nasdag from your root account or logging directly to your host as nasdag.
We will start by installing some basics. Make sure that your system is up-to-date:
Let’s synchronize first with your GitHub account. Install git, configure with your e-mail and name; generate a public key:
Copy-paste it into GitHub.com / Personal Settings / SSH Keys / Add SSH Key.
We will need later nasdag to connect with ssh to localhost so let’s add the key to recognized keys and install the ssh server:
Python
Then we will install python with some basic packages; sklearn; textblob; jupyter notebook; pyhs2 which we will use to test the JDBC connection to Hive:
1 ipython
2 In [1]: from IPython.lib import passwd
3 In [2]: passwd()
4 Enter password:
5 Verify password:
6 Out[2]: 'sha1:c7a1db8c1db8:67e3xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
Create a certificate valid for 365 days with both the key and certificate data written to the same file:
1 openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
1 c = get_config()
2 c.IPKernelApp.pylab = 'inline' # if you want plotting support always
3 c.NotebookApp.certfile = u'/home/nasdag/mycert.pem'
4 c.NotebookApp.ip = '*'
5 c.NotebookApp.open_browser = False
6 c.NotebookApp.password = u'sha1:c7a1db8c1db8:67e3xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
1 de 4 28/08/2017 20:48
Step by Step Installation of a Local Data Lake (1/3) - NASDAG.org http://nasdag.github.io/blog/2015/12/12/step-by-step-installation-of-a-loc...
7 c.NotebookApp.port = 4334
8 c.NotebookApp.base_url = '/pyspak/'
9 c.NotebookApp.webapp_settings = {'static_url_prefix':'/pyspark/static/'}
10 c.NotebookApp.notebook_dir = '/home/nasdag/tutorials/pyspark/'
Test ipython notebook; browse to https://host_ip_address:4334/pyspark/. Do not run the test notebooks that you downloaded from GitHub now, as you need to start other services as explained later.
Java 7
Mysql
We need to prepare a metastore database for Hive. Download Hive from https://hive.apache.org/downloads.html and get hive-schema-1.2.0.mysql.sql and hive-txn-schema-0.13.0.derby.sql
1 wget http://apache.crihan.fr/dist/hive/stable/apache-hive-1.2.1-bin.tar.gz
2 tar -zxvf apache-hive-1.2.1-bin.tar.gz apache-hive-1.2.1-bin/scripts/metastore/upgrade/mysql/hive-schema-1.2.0.mysql.sql apache-hive-1.2.1-bin/scripts/metastore/upgrade/mysql/hive-txn-schema-0.13.0.mysql.sql
3 cd apache-hive-1.2.1-bin/scripts/metastore/upgrade/mysql/
4
5 mysql -u root -p
6 Enter password:
7 mysql> CREATE DATABASE metastore;
8 mysql> USE metastore;
9 mysql> SOURCE hive-schema-1.2.0.mysql.sql;
10 mysql> CREATE USER 'hiveuser'@'%' IDENTIFIED BY 'hivepassword';
11 mysql> GRANT all on *.* to 'hiveuser'@localhost identified by 'hivepassword';
12 mysql> flush privileges;
13 mysql> exit;
You can delete now all downloaded Hive files as we will no longer use them: cd; rm -r apache-hive-1.2.1-bin*.
Scala
1 wget http://downloads.typesafe.com/scala/2.10.6/scala-2.10.6.tgz
2 sudo tar -xzf scala-2.10.6.tgz -C /usr/local/share
3 rm scala-2.10.6.tgz
Maven
1 wget http://mirrors.ircam.fr/pub/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
2 sudo tar -xzf apache-maven-3.3.9-bin.tar.gz -C /usr/local/share
3 sudo mv /usr/local/share/apache-maven-3.3.9 /usr/local/share/maven-3.3.9
4 rm apache-maven-3.3.9-bin.tar.gz
Hadoop
1 wget http://apache.mirrors.ovh.net/ftp.apache.org/dist/hadoop/common/hadoop-2.6.2/hadoop-2.6.2.tar.gz
2 sudo tar -xzf hadoop-2.6.2.tar.gz -C /usr/local/share
3 rm hadoop-2.6.2.tar.gz
4 sudo chown -R nasdag:nasdag /usr/local/share/hadoop-2.6.2/
5 sudo mkdir /var/local/hadoop
6 sudo chown -R nasdag:nasdag /var/local/hadoop
1 <configuration>
2
3 <property>
4 <name>hadoop.tmp.dir</name>
5 <value>/var/local/hadoop/tmp</value>
6 </property>
7
8 <property>
9 <name>fs.default.name</name>
10 <value>hdfs://localhost:54310</value>
11 </property>
12
13 </configuration>
1 <configuration>
2
3 <property>
4 <name>mapred.job.tracker</name>
5 <value>localhost:54311</value>
6 </property>
7
8 </configuration>
1 <configuration>
2
3 <property>
4 <name>dfs.replication</name>
5 <value>1</value>
6 </property>
7
8 </configuration>
vi /usr/local/share/hadoop-2.6.2/etc/hadoop/hadoop-env.sh and add the following at the very end to tell hadoop where java 7 is:
1 export JAVA_HOME=/usr/lib/jvm/java-7-oracle
It is time to set some env variables in .bashrc - we will do it now for all what is coming as well anticipating spark, zeppelin and ideaIC:
2 de 4 28/08/2017 20:48
Step by Step Installation of a Local Data Lake (1/3) - NASDAG.org http://nasdag.github.io/blog/2015/12/12/step-by-step-installation-of-a-loc...
1 export JAVA_HOME=/usr/lib/jvm/java-7-oracle
2 export SCALA_HOME=/usr/local/share/scala-2.10.6
3 export MAVEN_HOME=/usr/local/share/maven-3.3.9
4 export PATH=$PATH:$PATH:$MAVEN_HOME/bin:$SCALA_HOME/bin:/home/nasdag/idea-IC/bin/
5 export IBUS_ENABLE_SYNC_MODE=1
6
7 export HADOOP_HOME=/usr/local/share/hadoop-2.6.2
8 export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
9 unalias fs &> /dev/null
10 alias fs="hadoop fs"
11 unalias hls &> /dev/null
12 alias hls="fs -ls"
13
14 export SPARK_HOME=/usr/local/share/spark-1.5.2
15 export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
16 export HADOOP_USER_CLASSPATH_FIRST=true
17 export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
18
19 export PYSPARK_SUBMIT_ARGS="--packages com.databricks:spark-csv_2.11:1.1.0 pyspark-shell"
20
21 export PATH=$PATH:$PATH:/home/nasdag/zeppelin/bin
Exit and login as user nasdag one more time for the settings to be applied.
Spark
1 wget http://mirrors.ircam.fr/pub/apache/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz
2 sudo tar -xzf spark-1.5.2-bin-hadoop2.6.tgz -C /usr/local/share
3 sudo mv /usr/local/share/spark-1.5.2-bin-hadoop2.6 /usr/local/share/spark-1.5.2
4 rm spark-1.5.2-bin-hadoop2.6.tgz
Allow nasdag to write in spark logs: sudo mkdir -p /usr/local/share/spark-1.5.2/logs; sudo chmod 777 /usr/local/share/spark-1.5.2/logs
Create hive-site.xml in conf folder with the configuration below sudo vi /usr/local/share/spark-1.5.2/conf/hive-site.xml:
1 <configuration>
2 <property>
3 <name>javax.jdo.option.ConnectionURL</name>
4 <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
5 <description>metadata is stored in a MySQL server</description>
6 </property>
7 <property>
8 <name>javax.jdo.option.ConnectionDriverName</name>
9 <value>com.mysql.jdbc.Driver</value>
10 <description>MySQL JDBC driver class</description>
11 </property>
12 <property>
13 <name>javax.jdo.option.ConnectionUserName</name>
14 <value>hiveuser</value>
15 <description>user name for connecting to mysql server</description>
16 </property>
17 <property>
18 <name>javax.jdo.option.ConnectionPassword</name>
19 <value>hivepassword</value>
20 <description>password for connecting to mysql server</description>
21 </property>
22 </configuration>
sudo vi /usr/local/share/spark-1.5.2/conf/spark-defaults.conf and add the following to tell spark where jdbc connector for mysql is:
1 spark.driver.extraClassPath /usr/share/java/mysql-connector-java.jar
2 spark.master local[2]
1 import sys
2 sys.path.append('/usr/local/share/spark-1.5.2/python/')
3 sys.path.append('/usr/local/share/spark-1.5.2/python/lib/py4j-0.8.2.1-src.zip')
1 import pyspark
2 sc = pyspark.SparkContext()
Zeppelin
Get the latest source from GitHub and compile it with Maven:
1 cd ~
2 git clone http://github.com/apache/incubator-zeppelin
3 mv incubator-zeppelin zeppelin
4 cd zeppelin
5 export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=128m"
6 mvn install -DskipTests -Dspark.version=1.5.2 -Dhadoop.version=2.6.2
1 export SPARK_HOME=/usr/local/share/spark-1.5.2
2 export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.11:1.1.0 --jars /usr/share/java/mysql-connector-java.jar"
You can run now the tutorial at http://host_ip_address:8080/. But first you need to start hadoop and the zeppelin daemon:
1 start-dfs.sh
2 zeppelin-daemon.sh start
IntelliJ IDEA
1 wget https://download.jetbrains.com/idea/ideaIC-15.0.2.tar.gz
2 tar -xzf ideaIC-15.0.2.tar.gz -C ~
3 mv ~/idea-IC-143.1184.17 ~/idea-IC
4 rm ideaIC-15.0.2.tar.gz
We will have to specify the scala version and the maven version that we are using in the setup.
Start Hadoop
3 de 4 28/08/2017 20:48
Step by Step Installation of a Local Data Lake (1/3) - NASDAG.org http://nasdag.github.io/blog/2015/12/12/step-by-step-installation-of-a-loc...
Start Thriftserver
start-thriftserver.sh
Start Jupyter
You should have cloned already the tutorials from my GitHub git clone http://github.com/nasdag/pyspark - browse now to the notebooks https://host_ip_address:4334/pyspark/notebooks/test1.ipynb or https://host_ip_address:4334/pyspark
/notebooks/test2.ipynb and follow the self explanatory testings …
Setting ideaIC
Launch ìdea.sh. Select I do not have a previous version - Skip All and Set defaults.
Configure - Project Defaults/Project Structure - Platform Settings - SDKs - Add New SDK - /usr/lib/jvm/java-7-oracle
Configure - Project Defaults/Project Structure - Project Settings - Project - Project SDK - 1.7
Create New Project - Maven - Project SDK 1.7 - … - Open Module Settings (F4) - Add Scala Support - /usr/local/share/scala-2.10.6 - … - main / New Directory / scala / Mark Directory As Sources Root - test / New Directory / scala / Mark Directory As Test
Sources Root
Swap
Add:
Continue with:
Add:
1 vm.swappiness=10
2 vm.vfs_cache_pressure = 50
Cheers,
Philippe
http://linkedin.com/in/nasdag
Tweet
« Metis Discourse How to Install Step by Step a Local Data Lake (2/3) »
Recent Posts
GitHub Repos
mll
nasdag.github.io
pyspark
metis
Bootcamp projects
@nasdag on GitHub
4 de 4 28/08/2017 20:48