3) - Nasdag

Step by Step Installation of a Local Data Lake (1/3) - NASDAG.org http://nasdag.github.io/blog/2015/12/12/step-by-step-installation-of-a-loc...
NASDAG.org
A data scientist blog, by Philippe Dagher
RSS
Blog
Archives
Step by Step Installation of a Local Data Lake (1/3)

Dec 12th, 2015 8:35 pm
This post will guide you through a step by step installation and configuration of a Local Data Lake on ubuntu with packages such as Hadoop, Hive, Spark, Thriftserver, Maven, Scala, Python, Jupyter and Zeppelin.
It is the first of a series of 3 posts that will allow you to familiarize with state of the art tools to practice Data Science on Big Data.
In the first post we will setup the environment on ubuntu using a cloud host or a virtual machine. In the second post we will crunch incoming data and expose it to data mining and machine learning tools. In the third post, we will apply machine learning and data
science techniques to conventional business cases.
You will need to install ubuntu 15.10 on a virtual machine, either locally on a PC with at least 8Gb of RAM or on the cloud with DigitalOcean, Azure or AWS with at least 4Gb of RAM. If you choose to continue on a local machine, we will be installing and
configuring IntelliJ IDEA as well in our Data Lake environment to code with scala. I have tested what will follow on both a DigitalOcean droplet 15.10x64 and a VMware Workstation 12 Pro running ubuntu-15.10-desktop-amd64.iso (if you prefer to download a
preconfigured virtual machine just follow this link - password: ghghgh).
Once you get your host booted, make sure that you have access as root and as another user with root privileges - that I will call nasdag in this turorial. Otherwise from your root account add this user:
1 adduser nasdag
Grant nasdag root privileges: type visudo, this will open up nano to edit the sudoers file. Find the line that says root ALL=(ALL:ALL) ALL and give nasdag a line beneath that which says nasdag ALL=(ALL:ALL) ALL. Save by hitting Ctrl-o and then Enter when
asked for the file name. Exit nano with Ctrl-x.
If you have an account with root privileges, you can execute a command as root by preceding it with sudo or operate as root with the following command: sudo su -
Now stop being root and start being nasdag using su - nasdag from your root account or logging directly to your host as nasdag.
We will start by installing some basics. Make sure that your system is up-to-date:
1 sudo apt-get update
Git and ssh
Let’s synchronize first with your GitHub account. Install git, configure with your e-mail and name; generate a public key:
1 sudo apt-get install git

2
3 git config --global user.email your@email.address
4 git config --global user.name "Your Username"
5
6 ssh-keygen -t rsa -P ""
7 cat ~/.ssh/id_rsa.pub
Copy-paste it into GitHub.com / Personal Settings / SSH Keys / Add SSH Key.
Verify your access: ssh -T git@github.com
We will need later nasdag to connect with ssh to localhost so let’s add the key to recognized keys and install the ssh server:
1 cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

2 sudo apt-get install openssh-server
Test with: ssh localhost
Python
Then we will install python with some basic packages; sklearn; textblob; jupyter notebook; pyhs2 which we will use to test the JDBC connection to Hive:
1 sudo apt-get install python-pip

2 sudo apt-get install python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose
3 sudo pip install textblob
4 python -m textblob.download_corpora
5 sudo pip install --upgrade ipython
6 sudo pip install jupyter
7 sudo apt-get install libsasl2-dev
8 sudo pip install sasl
9 sudo pip install pyhs2
Let’s now secure the connection to the ipython notebook.
Prepare a hashed password with ipython:
1 ipython
2 In [1]: from IPython.lib import passwd
3 In [2]: passwd()
4 Enter password:
5 Verify password:
6 Out[2]: 'sha1:c7a1db8c1db8:67e3xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
Create a certificate valid for 365 days with both the key and certificate data written to the same file:
1 openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
Generate a configuration file and input the configuration as listed below:
1 jupyter notebook --generate-config

2 mkdir -p ~/tutorials
3 cd ~/tutorials
4 git clone http://github.com/nasdag/pyspark
5 vi ~/.jupyter/jupyter_notebook_config.py
add the following lines at the beginning of the jupyter_notebook_config.py file
1 c = get_config()
2 c.IPKernelApp.pylab = 'inline' # if you want plotting support always
3 c.NotebookApp.certfile = u'/home/nasdag/mycert.pem'
4 c.NotebookApp.ip = '*'
5 c.NotebookApp.open_browser = False
6 c.NotebookApp.password = u'sha1:c7a1db8c1db8:67e3xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
1 de 4 28/08/2017 20:48
7 c.NotebookApp.port = 4334
8 c.NotebookApp.base_url = '/pyspak/'
9 c.NotebookApp.webapp_settings = {'static_url_prefix':'/pyspark/static/'}
10 c.NotebookApp.notebook_dir = '/home/nasdag/tutorials/pyspark/'
We will need later to edit ~/.ipython/profile_default/startup/initspark.py to include the path to pyspark.
Test ipython notebook; browse to https://host_ip_address:4334/pyspark/. Do not run the test notebooks that you downloaded from GitHub now, as you need to start other services as explained later.
Java 7
1 sudo apt-get install python-software-properties

2 sudo add-apt-repository ppa:webupd8team/java
3 sudo apt-get update
4 sudo apt-get install oracle-jdk7-installer
Test your version: java -version
Mysql
1 sudo apt-get install mysql-server

2 sudo apt-get install libmysql-java
We need to prepare a metastore database for Hive. Download Hive from https://hive.apache.org/downloads.html and get hive-schema-1.2.0.mysql.sql and hive-txn-schema-0.13.0.derby.sql
1 wget http://apache.crihan.fr/dist/hive/stable/apache-hive-1.2.1-bin.tar.gz
2 tar -zxvf apache-hive-1.2.1-bin.tar.gz apache-hive-1.2.1-bin/scripts/metastore/upgrade/mysql/hive-schema-1.2.0.mysql.sql apache-hive-1.2.1-bin/scripts/metastore/upgrade/mysql/hive-txn-schema-0.13.0.mysql.sql
3 cd apache-hive-1.2.1-bin/scripts/metastore/upgrade/mysql/
4
5 mysql -u root -p
6 Enter password:
7 mysql> CREATE DATABASE metastore;
8 mysql> USE metastore;
9 mysql> SOURCE hive-schema-1.2.0.mysql.sql;
10 mysql> CREATE USER 'hiveuser'@'%' IDENTIFIED BY 'hivepassword';
11 mysql> GRANT all on *.* to 'hiveuser'@localhost identified by 'hivepassword';
12 mysql> flush privileges;
13 mysql> exit;
You can delete now all downloaded Hive files as we will no longer use them: cd; rm -r apache-hive-1.2.1-bin*.
Scala
Go to http://scala-lang.org/ - Dowload - All downloads - and get version 2.10.6:
1 wget http://downloads.typesafe.com/scala/2.10.6/scala-2.10.6.tgz
2 sudo tar -xzf scala-2.10.6.tgz -C /usr/local/share
3 rm scala-2.10.6.tgz
Maven
Go to https://maven.apache.org/download.cgi - and get the latest version:
1 wget http://mirrors.ircam.fr/pub/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
2 sudo tar -xzf apache-maven-3.3.9-bin.tar.gz -C /usr/local/share
3 sudo mv /usr/local/share/apache-maven-3.3.9 /usr/local/share/maven-3.3.9
4 rm apache-maven-3.3.9-bin.tar.gz
Hadoop
Go to http://hadoop.apache.org/releases.html - and get the version 2.6.2:
1 wget http://apache.mirrors.ovh.net/ftp.apache.org/dist/hadoop/common/hadoop-2.6.2/hadoop-2.6.2.tar.gz
2 sudo tar -xzf hadoop-2.6.2.tar.gz -C /usr/local/share
3 rm hadoop-2.6.2.tar.gz
4 sudo chown -R nasdag:nasdag /usr/local/share/hadoop-2.6.2/
5 sudo mkdir /var/local/hadoop
6 sudo chown -R nasdag:nasdag /var/local/hadoop
Now we have to edit the configuration files:
vi /usr/local/share/hadoop-2.6.2/etc/hadoop/core-site.xml and replace the content with:
1 <configuration>
2
3 <property>
4 <name>hadoop.tmp.dir</name>
5 <value>/var/local/hadoop/tmp</value>
6 </property>
7
8 <property>
9 <name>fs.default.name</name>
10 <value>hdfs://localhost:54310</value>
11 </property>
12
13 </configuration>
vi /usr/local/share/hadoop-2.6.2/etc/hadoop/mapred-site.xml and replace the content with:
1 <configuration>
2
3 <property>
4 <name>mapred.job.tracker</name>
5 <value>localhost:54311</value>
6 </property>
7
8 </configuration>
vi /usr/local/share/hadoop-2.6.2/etc/hadoop/hdfs-site.xml and replace the content with:
1 <configuration>
2
3 <property>
4 <name>dfs.replication</name>
5 <value>1</value>
6 </property>
7
8 </configuration>
vi /usr/local/share/hadoop-2.6.2/etc/hadoop/hadoop-env.sh and add the following at the very end to tell hadoop where java 7 is:
1 export JAVA_HOME=/usr/lib/jvm/java-7-oracle
It is time to set some env variables in .bashrc - we will do it now for all what is coming as well anticipating spark, zeppelin and ideaIC:
vi ~/.bashrc and add the following at the very end:
2 de 4 28/08/2017 20:48
1 export JAVA_HOME=/usr/lib/jvm/java-7-oracle
2 export SCALA_HOME=/usr/local/share/scala-2.10.6
3 export MAVEN_HOME=/usr/local/share/maven-3.3.9
4 export PATH=$PATH:$PATH:$MAVEN_HOME/bin:$SCALA_HOME/bin:/home/nasdag/idea-IC/bin/
5 export IBUS_ENABLE_SYNC_MODE=1
6
7 export HADOOP_HOME=/usr/local/share/hadoop-2.6.2
8 export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
9 unalias fs &> /dev/null
10 alias fs="hadoop fs"
11 unalias hls &> /dev/null
12 alias hls="fs -ls"
13
14 export SPARK_HOME=/usr/local/share/spark-1.5.2
15 export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
16 export HADOOP_USER_CLASSPATH_FIRST=true
17 export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
18
19 export PYSPARK_SUBMIT_ARGS="--packages com.databricks:spark-csv_2.11:1.1.0 pyspark-shell"
20
21 export PATH=$PATH:$PATH:/home/nasdag/zeppelin/bin
Exit and login as user nasdag one more time for the settings to be applied.
Format the HDFS (hadoop filesystem)
1 hdfs namenode -format
Spark
Go to http://spark.apache.org/downloads.html - and get a prebuilt version for Hadoop 2.6:
1 wget http://mirrors.ircam.fr/pub/apache/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz
2 sudo tar -xzf spark-1.5.2-bin-hadoop2.6.tgz -C /usr/local/share
3 sudo mv /usr/local/share/spark-1.5.2-bin-hadoop2.6 /usr/local/share/spark-1.5.2
4 rm spark-1.5.2-bin-hadoop2.6.tgz
Allow nasdag to write in spark logs: sudo mkdir -p /usr/local/share/spark-1.5.2/logs; sudo chmod 777 /usr/local/share/spark-1.5.2/logs
Create hive-site.xml in conf folder with the configuration below sudo vi /usr/local/share/spark-1.5.2/conf/hive-site.xml:
1 <configuration>
2 <property>
3 <name>javax.jdo.option.ConnectionURL</name>
4 <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
5 <description>metadata is stored in a MySQL server</description>
6 </property>
7 <property>
8 <name>javax.jdo.option.ConnectionDriverName</name>
9 <value>com.mysql.jdbc.Driver</value>
10 <description>MySQL JDBC driver class</description>
11 </property>
12 <property>
13 <name>javax.jdo.option.ConnectionUserName</name>
14 <value>hiveuser</value>
15 <description>user name for connecting to mysql server</description>
16 </property>
17 <property>
18 <name>javax.jdo.option.ConnectionPassword</name>
19 <value>hivepassword</value>
20 <description>password for connecting to mysql server</description>
21 </property>
22 </configuration>
sudo vi /usr/local/share/spark-1.5.2/conf/spark-defaults.conf and add the following to tell spark where jdbc connector for mysql is:
1 spark.driver.extraClassPath /usr/share/java/mysql-connector-java.jar
2 spark.master local[2]
vi ~/.ipython/profile_default/startup/initspark.py and add the following to tell ipython where is pyspark:
1 import sys
2 sys.path.append('/usr/local/share/spark-1.5.2/python/')
3 sys.path.append('/usr/local/share/spark-1.5.2/python/lib/py4j-0.8.2.1-src.zip')
When starting a jupyter notebook, you need to initiate with:
1 import pyspark
2 sc = pyspark.SparkContext()
Of course, you need to start hadoop first: start-dfs.sh.
Zeppelin
Get the latest source from GitHub and compile it with Maven:
1 cd ~
2 git clone http://github.com/apache/incubator-zeppelin
3 mv incubator-zeppelin zeppelin
4 cd zeppelin
5 export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=128m"
6 mvn install -DskipTests -Dspark.version=1.5.2 -Dhadoop.version=2.6.2
vi zeppelin/conf/zeppelin-env.sh and input the following:
1 export SPARK_HOME=/usr/local/share/spark-1.5.2
2 export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.11:1.1.0 --jars /usr/share/java/mysql-connector-java.jar"
You can run now the tutorial at http://host_ip_address:8080/. But first you need to start hadoop and the zeppelin daemon:
1 start-dfs.sh
2 zeppelin-daemon.sh start
Securing the access to Zeppelin is outside the scope of this post.
IntelliJ IDEA
Go to https://www.jetbrains.com/idea/download/ - and get the Community version for linux:
1 wget https://download.jetbrains.com/idea/ideaIC-15.0.2.tar.gz
2 tar -xzf ideaIC-15.0.2.tar.gz -C ~
3 mv ~/idea-IC-143.1184.17 ~/idea-IC
4 rm ideaIC-15.0.2.tar.gz
We will have to specify the scala version and the maven version that we are using in the setup.
Start Hadoop
3 de 4 28/08/2017 20:48
start-dfs.shand check with jpsthe runnning servers.
Start Thriftserver
start-thriftserver.sh
Start Jupyter
You should have cloned already the tutorials from my GitHub git clone http://github.com/nasdag/pyspark - browse now to the notebooks https://host_ip_address:4334/pyspark/notebooks/test1.ipynb or https://host_ip_address:4334/pyspark
/notebooks/test2.ipynb and follow the self explanatory testings …
Setting ideaIC
Launch ìdea.sh. Select I do not have a previous version - Skip All and Set defaults.
Configure - Plugins - Install Scala - Restart
Configure - Settings - Build/Build Tools - Maven - Maven Home Directory - /usr/local/share/maven-3.3.9
Configure - Project Defaults/Project Structure - Platform Settings - SDKs - Add New SDK - /usr/lib/jvm/java-7-oracle
Configure - Project Defaults/Project Structure - Project Settings - Project - Project SDK - 1.7
Create New Project - Maven - Project SDK 1.7 - … - Open Module Settings (F4) - Add Scala Support - /usr/local/share/scala-2.10.6 - … - main / New Directory / scala / Mark Directory As Sources Root - test / New Directory / scala / Mark Directory As Test
Sources Root
Swap
If sudo swapon -s is empty then I suggest to create a swapfile of 4Gb:
1 sudo fallocate -l 4G /swapfile

2 sudo chmod 600 /swapfile
3 sudo mkswap /swapfile
4 sudo swapon /swapfile
5 sudo vi /etc/fstab
Add:
1 /swapfile none swap sw 0 0
Continue with:
1 sudo sysctl vm.swappiness=10

2 sudo sysctl vm.vfs_cache_pressure=50
3 sudo vi /etc/sysctl.conf
Add:
1 vm.swappiness=10
2 vm.vfs_cache_pressure = 50
Cheers,
Philippe
http://linkedin.com/in/nasdag
Posted by Philippe DAGHER Dec 12th, 2015 8:35 pm
Tweet
« Metis Discourse How to Install Step by Step a Local Data Lake (2/3) »
Recent Posts
Classifying Bees With Google TensorFlow

Predicting Heart Disease With Hadoop, Spark and Python
How to Install Step by Step a Local Data Lake (2/3)
Step by Step Installation of a Local Data Lake (1/3)
Metis Discourse
GitHub Repos
mll
Machine Learning Luxembourg workshops
nasdag.github.io
pyspark
metis
Bootcamp projects
@nasdag on GitHub
Copyright © 2016 - Philippe DAGHER - Powered by Octopress
4 de 4 28/08/2017 20:48

3) - Nasdag

Uploaded by

Document Informationclick to expand document informationnasdag

Document Informationclick to expand document information

Copyright:

Available Formats

3) - Nasdag

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3) - Nasdag

Uploaded by

Copyright:

Available Formats

Step by Step Installation of a Local Data Lake (1/3) - NASDAG.org http://nasdag.github.io/blog/2015/12/12/step-by-step-installation-of-a-loc...

Step by Step Installation of a Local Data Lake (1/3)

1 sudo apt-get update

Git and ssh

1 sudo apt-get install git

Verify your access: ssh -T git@github.com

1 cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Test with: ssh localhost

1 sudo apt-get install python-pip

Let’s now secure the connection to the ipython notebook.

Prepare a hashed password with ipython:

Generate a configuration file and input the configuration as listed below:

1 jupyter notebook --generate-config

add the following lines at the beginning of the jupyter_notebook_config.py file

We will need later to edit ~/.ipython/profile_default/startup/initspark.py to include the path to pyspark.

1 sudo apt-get install python-software-properties

Test your version: java -version

1 sudo apt-get install mysql-server

Go to http://scala-lang.org/ - Dowload - All downloads - and get version 2.10.6:

Go to https://maven.apache.org/download.cgi - and get the latest version:

Go to http://hadoop.apache.org/releases.html - and get the version 2.6.2:

Now we have to edit the configuration files:

vi /usr/local/share/hadoop-2.6.2/etc/hadoop/core-site.xml and replace the content with:

vi /usr/local/share/hadoop-2.6.2/etc/hadoop/mapred-site.xml and replace the content with:

vi /usr/local/share/hadoop-2.6.2/etc/hadoop/hdfs-site.xml and replace the content with:

vi ~/.bashrc and add the following at the very end:

Format the HDFS (hadoop filesystem)

1 hdfs namenode -format

Go to http://spark.apache.org/downloads.html - and get a prebuilt version for Hadoop 2.6:

vi ~/.ipython/profile_default/startup/initspark.py and add the following to tell ipython where is pyspark:

When starting a jupyter notebook, you need to initiate with:

Of course, you need to start hadoop first: start-dfs.sh.

vi zeppelin/conf/zeppelin-env.sh and input the following:

Securing the access to Zeppelin is outside the scope of this post.

Go to https://www.jetbrains.com/idea/download/ - and get the Community version for linux:

start-dfs.shand check with jpsthe runnning servers.

Configure - Plugins - Install Scala - Restart

Configure - Settings - Build/Build Tools - Maven - Maven Home Directory - /usr/local/share/maven-3.3.9

If sudo swapon -s is empty then I suggest to create a swapfile of 4Gb:

1 sudo fallocate -l 4G /swapfile

1 /swapfile none swap sw 0 0

1 sudo sysctl vm.swappiness=10

Posted by Philippe DAGHER Dec 12th, 2015 8:35 pm

Classifying Bees With Google TensorFlow

Machine Learning Luxembourg workshops

Copyright © 2016 - Philippe DAGHER - Powered by Octopress

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.