3) - Nasdag

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Step by Step Installation of a Local Data Lake (1/3) - NASDAG.org http://nasdag.github.io/blog/2015/12/12/step-by-step-installation-of-a-loc...

NASDAG.org
A data scientist blog, by Philippe Dagher
RSS

Blog
Archives

Step by Step Installation of a Local Data Lake (1/3)


Dec 12th, 2015 8:35 pm

This post will guide you through a step by step installation and configuration of a Local Data Lake on ubuntu with packages such as Hadoop, Hive, Spark, Thriftserver, Maven, Scala, Python, Jupyter and Zeppelin.

It is the first of a series of 3 posts that will allow you to familiarize with state of the art tools to practice Data Science on Big Data.

In the first post we will setup the environment on ubuntu using a cloud host or a virtual machine. In the second post we will crunch incoming data and expose it to data mining and machine learning tools. In the third post, we will apply machine learning and data
science techniques to conventional business cases.

You will need to install ubuntu 15.10 on a virtual machine, either locally on a PC with at least 8Gb of RAM or on the cloud with DigitalOcean, Azure or AWS with at least 4Gb of RAM. If you choose to continue on a local machine, we will be installing and
configuring IntelliJ IDEA as well in our Data Lake environment to code with scala. I have tested what will follow on both a DigitalOcean droplet 15.10x64 and a VMware Workstation 12 Pro running ubuntu-15.10-desktop-amd64.iso (if you prefer to download a
preconfigured virtual machine just follow this link - password: ghghgh).

Once you get your host booted, make sure that you have access as root and as another user with root privileges - that I will call nasdag in this turorial. Otherwise from your root account add this user:

1 adduser nasdag

Grant nasdag root privileges: type visudo, this will open up nano to edit the sudoers file. Find the line that says root ALL=(ALL:ALL) ALL and give nasdag a line beneath that which says nasdag ALL=(ALL:ALL) ALL. Save by hitting Ctrl-o and then Enter when
asked for the file name. Exit nano with Ctrl-x.

If you have an account with root privileges, you can execute a command as root by preceding it with sudo or operate as root with the following command: sudo su -

Now stop being root and start being nasdag using su - nasdag from your root account or logging directly to your host as nasdag.

We will start by installing some basics. Make sure that your system is up-to-date:

1 sudo apt-get update

Git and ssh

Let’s synchronize first with your GitHub account. Install git, configure with your e-mail and name; generate a public key:

1 sudo apt-get install git


2
3 git config --global user.email your@email.address
4 git config --global user.name "Your Username"
5
6 ssh-keygen -t rsa -P ""
7 cat ~/.ssh/id_rsa.pub

Copy-paste it into GitHub.com / Personal Settings / SSH Keys / Add SSH Key.

Verify your access: ssh -T git@github.com

We will need later nasdag to connect with ssh to localhost so let’s add the key to recognized keys and install the ssh server:

1 cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys


2 sudo apt-get install openssh-server

Test with: ssh localhost

Python

Then we will install python with some basic packages; sklearn; textblob; jupyter notebook; pyhs2 which we will use to test the JDBC connection to Hive:

1 sudo apt-get install python-pip


2 sudo apt-get install python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose
3 sudo pip install textblob
4 python -m textblob.download_corpora
5 sudo pip install --upgrade ipython
6 sudo pip install jupyter
7 sudo apt-get install libsasl2-dev
8 sudo pip install sasl
9 sudo pip install pyhs2

Let’s now secure the connection to the ipython notebook.

Prepare a hashed password with ipython:

1 ipython
2 In [1]: from IPython.lib import passwd
3 In [2]: passwd()
4 Enter password:
5 Verify password:
6 Out[2]: 'sha1:c7a1db8c1db8:67e3xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

Create a certificate valid for 365 days with both the key and certificate data written to the same file:

1 openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem

Generate a configuration file and input the configuration as listed below:

1 jupyter notebook --generate-config


2 mkdir -p ~/tutorials
3 cd ~/tutorials
4 git clone http://github.com/nasdag/pyspark
5 vi ~/.jupyter/jupyter_notebook_config.py

add the following lines at the beginning of the jupyter_notebook_config.py file

1 c = get_config()
2 c.IPKernelApp.pylab = 'inline' # if you want plotting support always
3 c.NotebookApp.certfile = u'/home/nasdag/mycert.pem'
4 c.NotebookApp.ip = '*'
5 c.NotebookApp.open_browser = False
6 c.NotebookApp.password = u'sha1:c7a1db8c1db8:67e3xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

1 de 4 28/08/2017 20:48
Step by Step Installation of a Local Data Lake (1/3) - NASDAG.org http://nasdag.github.io/blog/2015/12/12/step-by-step-installation-of-a-loc...

7 c.NotebookApp.port = 4334
8 c.NotebookApp.base_url = '/pyspak/'
9 c.NotebookApp.webapp_settings = {'static_url_prefix':'/pyspark/static/'}
10 c.NotebookApp.notebook_dir = '/home/nasdag/tutorials/pyspark/'

We will need later to edit ~/.ipython/profile_default/startup/initspark.py to include the path to pyspark.

Test ipython notebook; browse to https://host_ip_address:4334/pyspark/. Do not run the test notebooks that you downloaded from GitHub now, as you need to start other services as explained later.

Java 7

1 sudo apt-get install python-software-properties


2 sudo add-apt-repository ppa:webupd8team/java
3 sudo apt-get update
4 sudo apt-get install oracle-jdk7-installer

Test your version: java -version

Mysql

1 sudo apt-get install mysql-server


2 sudo apt-get install libmysql-java

We need to prepare a metastore database for Hive. Download Hive from https://hive.apache.org/downloads.html and get hive-schema-1.2.0.mysql.sql and hive-txn-schema-0.13.0.derby.sql

1 wget http://apache.crihan.fr/dist/hive/stable/apache-hive-1.2.1-bin.tar.gz
2 tar -zxvf apache-hive-1.2.1-bin.tar.gz apache-hive-1.2.1-bin/scripts/metastore/upgrade/mysql/hive-schema-1.2.0.mysql.sql apache-hive-1.2.1-bin/scripts/metastore/upgrade/mysql/hive-txn-schema-0.13.0.mysql.sql
3 cd apache-hive-1.2.1-bin/scripts/metastore/upgrade/mysql/
4
5 mysql -u root -p
6 Enter password:
7 mysql> CREATE DATABASE metastore;
8 mysql> USE metastore;
9 mysql> SOURCE hive-schema-1.2.0.mysql.sql;
10 mysql> CREATE USER 'hiveuser'@'%' IDENTIFIED BY 'hivepassword';
11 mysql> GRANT all on *.* to 'hiveuser'@localhost identified by 'hivepassword';
12 mysql> flush privileges;
13 mysql> exit;

You can delete now all downloaded Hive files as we will no longer use them: cd; rm -r apache-hive-1.2.1-bin*.

Scala

Go to http://scala-lang.org/ - Dowload - All downloads - and get version 2.10.6:

1 wget http://downloads.typesafe.com/scala/2.10.6/scala-2.10.6.tgz
2 sudo tar -xzf scala-2.10.6.tgz -C /usr/local/share
3 rm scala-2.10.6.tgz

Maven

Go to https://maven.apache.org/download.cgi - and get the latest version:

1 wget http://mirrors.ircam.fr/pub/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
2 sudo tar -xzf apache-maven-3.3.9-bin.tar.gz -C /usr/local/share
3 sudo mv /usr/local/share/apache-maven-3.3.9 /usr/local/share/maven-3.3.9
4 rm apache-maven-3.3.9-bin.tar.gz

Hadoop

Go to http://hadoop.apache.org/releases.html - and get the version 2.6.2:

1 wget http://apache.mirrors.ovh.net/ftp.apache.org/dist/hadoop/common/hadoop-2.6.2/hadoop-2.6.2.tar.gz
2 sudo tar -xzf hadoop-2.6.2.tar.gz -C /usr/local/share
3 rm hadoop-2.6.2.tar.gz
4 sudo chown -R nasdag:nasdag /usr/local/share/hadoop-2.6.2/
5 sudo mkdir /var/local/hadoop
6 sudo chown -R nasdag:nasdag /var/local/hadoop

Now we have to edit the configuration files:

vi /usr/local/share/hadoop-2.6.2/etc/hadoop/core-site.xml and replace the content with:

1 <configuration>
2
3 <property>
4 <name>hadoop.tmp.dir</name>
5 <value>/var/local/hadoop/tmp</value>
6 </property>
7
8 <property>
9 <name>fs.default.name</name>
10 <value>hdfs://localhost:54310</value>
11 </property>
12
13 </configuration>

vi /usr/local/share/hadoop-2.6.2/etc/hadoop/mapred-site.xml and replace the content with:

1 <configuration>
2
3 <property>
4 <name>mapred.job.tracker</name>
5 <value>localhost:54311</value>
6 </property>
7
8 </configuration>

vi /usr/local/share/hadoop-2.6.2/etc/hadoop/hdfs-site.xml and replace the content with:

1 <configuration>
2
3 <property>
4 <name>dfs.replication</name>
5 <value>1</value>
6 </property>
7
8 </configuration>

vi /usr/local/share/hadoop-2.6.2/etc/hadoop/hadoop-env.sh and add the following at the very end to tell hadoop where java 7 is:

1 export JAVA_HOME=/usr/lib/jvm/java-7-oracle

It is time to set some env variables in .bashrc - we will do it now for all what is coming as well anticipating spark, zeppelin and ideaIC:

vi ~/.bashrc and add the following at the very end:

2 de 4 28/08/2017 20:48
Step by Step Installation of a Local Data Lake (1/3) - NASDAG.org http://nasdag.github.io/blog/2015/12/12/step-by-step-installation-of-a-loc...

1 export JAVA_HOME=/usr/lib/jvm/java-7-oracle
2 export SCALA_HOME=/usr/local/share/scala-2.10.6
3 export MAVEN_HOME=/usr/local/share/maven-3.3.9
4 export PATH=$PATH:$PATH:$MAVEN_HOME/bin:$SCALA_HOME/bin:/home/nasdag/idea-IC/bin/
5 export IBUS_ENABLE_SYNC_MODE=1
6
7 export HADOOP_HOME=/usr/local/share/hadoop-2.6.2
8 export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
9 unalias fs &> /dev/null
10 alias fs="hadoop fs"
11 unalias hls &> /dev/null
12 alias hls="fs -ls"
13
14 export SPARK_HOME=/usr/local/share/spark-1.5.2
15 export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
16 export HADOOP_USER_CLASSPATH_FIRST=true
17 export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
18
19 export PYSPARK_SUBMIT_ARGS="--packages com.databricks:spark-csv_2.11:1.1.0 pyspark-shell"
20
21 export PATH=$PATH:$PATH:/home/nasdag/zeppelin/bin

Exit and login as user nasdag one more time for the settings to be applied.

Format the HDFS (hadoop filesystem)

1 hdfs namenode -format

Spark

Go to http://spark.apache.org/downloads.html - and get a prebuilt version for Hadoop 2.6:

1 wget http://mirrors.ircam.fr/pub/apache/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz
2 sudo tar -xzf spark-1.5.2-bin-hadoop2.6.tgz -C /usr/local/share
3 sudo mv /usr/local/share/spark-1.5.2-bin-hadoop2.6 /usr/local/share/spark-1.5.2
4 rm spark-1.5.2-bin-hadoop2.6.tgz

Allow nasdag to write in spark logs: sudo mkdir -p /usr/local/share/spark-1.5.2/logs; sudo chmod 777 /usr/local/share/spark-1.5.2/logs

Create hive-site.xml in conf folder with the configuration below sudo vi /usr/local/share/spark-1.5.2/conf/hive-site.xml:

1 <configuration>
2 <property>
3 <name>javax.jdo.option.ConnectionURL</name>
4 <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
5 <description>metadata is stored in a MySQL server</description>
6 </property>
7 <property>
8 <name>javax.jdo.option.ConnectionDriverName</name>
9 <value>com.mysql.jdbc.Driver</value>
10 <description>MySQL JDBC driver class</description>
11 </property>
12 <property>
13 <name>javax.jdo.option.ConnectionUserName</name>
14 <value>hiveuser</value>
15 <description>user name for connecting to mysql server</description>
16 </property>
17 <property>
18 <name>javax.jdo.option.ConnectionPassword</name>
19 <value>hivepassword</value>
20 <description>password for connecting to mysql server</description>
21 </property>
22 </configuration>

sudo vi /usr/local/share/spark-1.5.2/conf/spark-defaults.conf and add the following to tell spark where jdbc connector for mysql is:

1 spark.driver.extraClassPath /usr/share/java/mysql-connector-java.jar
2 spark.master local[2]

vi ~/.ipython/profile_default/startup/initspark.py and add the following to tell ipython where is pyspark:

1 import sys
2 sys.path.append('/usr/local/share/spark-1.5.2/python/')
3 sys.path.append('/usr/local/share/spark-1.5.2/python/lib/py4j-0.8.2.1-src.zip')

When starting a jupyter notebook, you need to initiate with:

1 import pyspark
2 sc = pyspark.SparkContext()

Of course, you need to start hadoop first: start-dfs.sh.

Zeppelin

Get the latest source from GitHub and compile it with Maven:

1 cd ~
2 git clone http://github.com/apache/incubator-zeppelin
3 mv incubator-zeppelin zeppelin
4 cd zeppelin
5 export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=128m"
6 mvn install -DskipTests -Dspark.version=1.5.2 -Dhadoop.version=2.6.2

vi zeppelin/conf/zeppelin-env.sh and input the following:

1 export SPARK_HOME=/usr/local/share/spark-1.5.2
2 export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.11:1.1.0 --jars /usr/share/java/mysql-connector-java.jar"

You can run now the tutorial at http://host_ip_address:8080/. But first you need to start hadoop and the zeppelin daemon:

1 start-dfs.sh
2 zeppelin-daemon.sh start

Securing the access to Zeppelin is outside the scope of this post.

IntelliJ IDEA

Go to https://www.jetbrains.com/idea/download/ - and get the Community version for linux:

1 wget https://download.jetbrains.com/idea/ideaIC-15.0.2.tar.gz
2 tar -xzf ideaIC-15.0.2.tar.gz -C ~
3 mv ~/idea-IC-143.1184.17 ~/idea-IC
4 rm ideaIC-15.0.2.tar.gz

We will have to specify the scala version and the maven version that we are using in the setup.

Start Hadoop

3 de 4 28/08/2017 20:48
Step by Step Installation of a Local Data Lake (1/3) - NASDAG.org http://nasdag.github.io/blog/2015/12/12/step-by-step-installation-of-a-loc...

start-dfs.shand check with jpsthe runnning servers.

Start Thriftserver

start-thriftserver.sh

Start Jupyter

You should have cloned already the tutorials from my GitHub git clone http://github.com/nasdag/pyspark - browse now to the notebooks https://host_ip_address:4334/pyspark/notebooks/test1.ipynb or https://host_ip_address:4334/pyspark
/notebooks/test2.ipynb and follow the self explanatory testings …

Setting ideaIC

Launch ìdea.sh. Select I do not have a previous version - Skip All and Set defaults.

Configure - Plugins - Install Scala - Restart

Configure - Settings - Build/Build Tools - Maven - Maven Home Directory - /usr/local/share/maven-3.3.9

Configure - Project Defaults/Project Structure - Platform Settings - SDKs - Add New SDK - /usr/lib/jvm/java-7-oracle

Configure - Project Defaults/Project Structure - Project Settings - Project - Project SDK - 1.7

Create New Project - Maven - Project SDK 1.7 - … - Open Module Settings (F4) - Add Scala Support - /usr/local/share/scala-2.10.6 - … - main / New Directory / scala / Mark Directory As Sources Root - test / New Directory / scala / Mark Directory As Test
Sources Root

Swap

If sudo swapon -s is empty then I suggest to create a swapfile of 4Gb:

1 sudo fallocate -l 4G /swapfile


2 sudo chmod 600 /swapfile
3 sudo mkswap /swapfile
4 sudo swapon /swapfile
5 sudo vi /etc/fstab

Add:

1 /swapfile none swap sw 0 0

Continue with:

1 sudo sysctl vm.swappiness=10


2 sudo sysctl vm.vfs_cache_pressure=50
3 sudo vi /etc/sysctl.conf

Add:

1 vm.swappiness=10
2 vm.vfs_cache_pressure = 50

Cheers,

Philippe

http://linkedin.com/in/nasdag

Posted by Philippe DAGHER Dec 12th, 2015 8:35 pm

Tweet

« Metis Discourse How to Install Step by Step a Local Data Lake (2/3) »

Recent Posts

Classifying Bees With Google TensorFlow


Predicting Heart Disease With Hadoop, Spark and Python
How to Install Step by Step a Local Data Lake (2/3)
Step by Step Installation of a Local Data Lake (1/3)
Metis Discourse

GitHub Repos

mll

Machine Learning Luxembourg workshops

nasdag.github.io

pyspark

metis

Bootcamp projects

@nasdag on GitHub

Copyright © 2016 - Philippe DAGHER - Powered by Octopress

4 de 4 28/08/2017 20:48

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy