0% found this document useful (0 votes)
17 views

bdcc-2.3

big data

Uploaded by

yexadat679
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
17 views

bdcc-2.3

big data

Uploaded by

yexadat679
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 16
yarn6124, act ANE BDC -3- Setting up Hadoop BDCC Apache Hadoop can be set up in the following three different configurations: * Developer mode: Developer mode can be used to run programs in a standalone manner. This arrangement does not require any Hadoop process daemons, and jars can tun directly. This mode is useful if developers wish to debug their code on MapReduce. * Pseudo cluster (single node Hadoop): A pseudo cluster is a single node cluster that has similar capabilities to that of a standard cluster; it is also used for the development and testing of programs before they are deployed on a production cluster. Pseudo clusters provide an independent environment for all developers for coding and testing. * Cluster mode: This mode is the real Hadoop cluster where you will set up multiple nodes of Hadoop across your production environment. You should use it to solve all of your big data problems. Best practices Hadoop deployment + Start small: Like other software projects, an implementation Hadoop also involves risks and cost. It's always better to set up a small Hadoop cluster of four nodes. This small cluster can be set up as proof of concept (POC). Before using any Hadoop component, it can be added to the existing Hadoop POC cluster as proof of technology (POT). It allows the infrastructure and development team to understand big data project requirements. After successful completion of POC and POT, additional nodes can be added to the existing cluster. + Hadoop cluster monitoring: Proper monitoring of the NameNode and all DataNodes is required to understand the health of the cluster. It helps to take corrective actions in the event of node problems. Ifa service goes down, timely action can help avoid big problems in the future. Setting up Gangalia and Nagios are popular choices to configure alerts and monitoring. In the case of the Hortonworks cluster, Ambari monitoring, and the Cloudera cluster, Cloudera (CDH) manager monitoring can be an easy setup. + Automated deployment: Use of tools like Puppet or Chef is essential for Hadoop deployment. It becomes super easy and productive to deploy the Hadoop cluster with automated tools instead of manual deployment. Give importance to data analysis and data processing using available tools/components. Give preference to using Hive or Pig scripts for problem solving rather than writing heavy, custom MapReduce code. The goal should be to develop less and analyze more. hips bac. santechscomunit2/3-seting-up-hasoop ans. yarn6124, att AN BDC -3- Setting up Hadoop BDCC failure or crash, the system should be able to recover itself or failover to another data center/ site. + Security: Data needs to be protected by creating users and groups, and mapping users to the groups. Setting appropriate permissions and enforcing strong passwords should lock each user group down. + Data protection: The identification of sensitive data is critical before moving it to the Hadoop cluster. It's very important to understand privacy policies and government regulations for the better identification and mitigation of compliance exposure risks. Batch processing * Very efficient in processing a high volume of data. * All data processing steps (that is, data collection, data ingestion, data processing, and results presentation) are done as one single batch job. + Throughput carries more importance than latency. Latency is always more than a single minute. + Throughput directly depends on the size of the data and available computational system resources. + Available tools include Apache Sqoop, MapReduce jobs, Spark jobs, Hadoop DistCp utility, and soon. Real-time processing + Latency is extremely important, for example, less than one second + Computation is relatively simple * Data is processed as an independent unit * Available tools include Apache Storm, Spark Streaming, Apache Fink, Apache Kafka, and so on hips bac. santechscomunit2/3-seting-up-hasoop 26 yarn6124, att AN BDC -3- Setting up Hadoop BDCC + Text/CSV file Text and CSV files are very common in Hadoop data processing algorithms. Each line in the file is treated as anew record. + JSON The JSON format is typically used in data exchange applications and it is treated as an object, record, struct, or an array. These files are text files and support schema evolutions. It's very easy to add or delete attributes from a JSON file. * Sequence fil A sequence file is a flat file consisting of binary key/value pairs. They are extensively used in MapReduce as input/output formats. They are mostly used for intermediate data storage within a sequence of MapReduce jobs. + Avro Avro is a widely used file type within the Hadoop community. It is popular because it helps schema evolution. It contains serialized data with a binary format. An Avro fie is splittable and supports block compression. It contains data and metadata. It uses a separate JSON file to define the schema format. When Avro data is stored in a file, its schema is stored with it so that files may be processed later by any program. + Parquet Parquet stores nested data structures in a flat columnar format. Parquet is more efficient in terms of storage and performance than any row-level file formats. Parquet stores binary data in a column-oriented way. In the Parquet format, new columns are added at the end of the structure. * ORC ORC files are optimized record columnar file format and are the extended version of RC files. These are great for compression and are best suited for Hive SQL performance when Hive is reading, writing, and processing data to reduce access time and the storage space. hips bac. santechscomunit2/3-seting-up-hasoop ane yarn6124, att AN BOC - 3 - Setting up Hasoop, BDCC SUUU apL anstaLt vpenyun-Li-yuK -y Once the installation process is complete, verify the current Java version: jaaronjohns@aaron-hadoop:~$ java -version; javac -version lopenjdk version "11.@.10" 2021-01-19 OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-@ubuntu1. 20.04) OpenJOK 64~Bit Server VM (build 11.@.10+9-Ubuntu-@ubuntu1.20.04, mixed mode, sha ring) javac 11.0.10 Install OpenSSH sudo apt install openssh-server openssh-client -y Create a user named hadoop and set the password for this user sudo adduser hadoop Then switch to that user su — hadoop Generate an SSH key pair for this user ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys hntps:ifbdce santechz.com/uni-23-seting-up-hacoop ane yarn6124, att AN BOC - 3 - Setting up Hasoop, BDCC wget https://dlcdn. apache. org/hadoop/common/hadoop-3.3.6/ hadoop-3.3.6. tar.gz Extract the files tar xzf hadoop-3.3.6.tar.gz Since we are deploying Hadoop on a single node, we do the following settings Open the .bashre file and append the following to the bottom of the file. Then exit the file after you make the changes. nano .bashrc #Hadoop Related Options export HADOOP_HOME=/home/hadoop/hadoop-3.2.2 export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=SHADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/1ib/native export PATH=$PA export Apply the changes to the current running environment source ~/.bashre hntps:ifbdce santechz.com/uni-23-seting-up-hacoop se 276124, att ANE BOC -3- Setting up Hasoop BDCC if [ -f /usr/share/bash-completion/bash_completion ]; then . /usr/share/bash-completion/bash_completion elif [ -f /etc/bash_completion }; then + /etc/bash_completion fi fi #Hadoop Related Options export HADOOP_HOME=/home/hadoop/hadoop-3.2.2 export HADOOP_INSTALL=SHADOOP_HOME ‘export HADOOP_MAPRED_HOME=SHADOOP_HOME ‘export HADOOP_COMMON_HOME=SHADOOP_HOME ‘export HADOOP_HDFS_HOME=SHADOOP_HOME export YARN_HOME=$HADOOP_HOME ‘export HADOOP_COMMON_LIB_NATIVE_DIR=SHADOOP_HOME/1ib/native export PATH=SPATH: SHADOOP_HOME/sbin:SHADOOP_HOME/bin export HADOOP_OPTS="-Djava. library.path=SHADOOP_HOME/1ib/native" Write Out J} where Is JM cut Text [BJ Justity Cur Pos Read File MN Replace W Paste Textil To Spell fll Go To Line Find the path to your jdk and note it down. The area to note down is highlighted with a red box Get Help Exit [aaronjohns@aaron-hadoop:~$ readlink -f /usr/bin/javac 7usr/1ib/jvm/java~11-openjdk-amd64}bin/javac Edit the hadoop-env.sh file. Uncomment the $JAVA_HOME variable (i.e., remove the # sign) and add the full path to the OpenJDK installation on your system. nano $HADOOP_HOME/etc/hadoop/hadoop-env. sh @ The java inplenentation to use. By default, this environnent # variable is REQUIRED on ALL platforms except 0S X! export JAVA_HOME=/usr/1ib/ jvm/Java~11-openjdk-and6all cet Help PB write out ME where ts PB cur text PB austity PB cur pos exit Read File BN Replace if Paste TextBll To Spell fl Go To Line hntps:ifbdce santechz.com/uni-23-seting-up-hacoop ete yarn6124, att AN BDC -3- Setting up Hadoop BDCC Add the following configuration to override the default values for the temporary directory and add your HDFS URL to replace the default local file system setting: hadoop. tmp. dir /home/hadoop/tmpdata fs.default.name hdfs: //127.0.0.1:9000 GNU nano 4.8 /home/hadoop/hadoop-3.2.2/etc/hadoop/core-site.xml_ Modified distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. _> hadoop.tmp.dir /home/hadoop/tmpdata fs.default.name hdfs://127.0.0.1:9000 RE cet Help JM write out J} where 1s PM cut text PR gustity PB cur pos Qe Exit Read File (NV Replace [fl paste Textal To Spell Go To Line hips bac. santechscomunit2/3-seting-up-hasoop m6 yarn6124, att AN BDC -3- Setting up Hadoop BDCC Add the following configuration to the file dfs. data. dir /home/hadoop/df sdata/namenode dfs.data.dir /home/hadoop/dfsdata/datanode dfs. replication i GNU nano 4.8 /home/hadoop/hadoop-3.2.2/etc/hadoop/hdfs—site.xml Modified configuration> i mapreduce . framework .name yarn k/configuration> MS Get Help [J write out J} where Is MM cut Text R) Justity PM cur pos ee Exit Wi Read File MV Replace fl) paste Textifj To Spell Go To Line hips bac. santechscomunit2/3-seting-up-hasoop ene yarn6124, att AN BDCC BDC -3- Setting up Hadoop Add the following configuration to the file yarn. nodemanager . aux- services mapreduce_shuffle yarn. nodemanager . aux- services.mapreduce. shuffle. class org. apache. hadoop.mapred. Shuf f leHandler yarn. resourcemanager.hostname d.0. @ yarn.acl.enable 0 yarn.nodemanager.env- whitelist JAVA_HOME, HADOOP_COMMON_HOME, H ADOOP_HDFS_HOME, HADOOP_CONF_DIR, CLASS. PATH_PERPEND_DISTCACHE, HADOOP_YARN_HO ME, HADOOP_MAPRED_HOME Cnrneee) Unless required by applicable law or agreed to in writing, UTEP ae ERIC Te et ee Modified software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. > yarn.nodemanager.aux-services -nodemanager.aux-services .mapreduce.shuffle.class G cur Pos yarn org. apache. hadoop.mapred.ShuffleHandler yarn. resourcemanager .hostname 127.0.0.1 ll Get Help Write Out [MJ where Is Qe Exit Mi Read File [MN Replace My cut Text [) Justify ] Paste Text To Spell Go To Line hips bac. santechscomunit2/3-seting-up-hasoop s0n6 yarn6124, att AN BOC - 3 - Setting up Hasoop, BDCC The shutdown notification signifies the end of the NameNode format process. Start the namenode and datanode. cd ~/hadoop-3.2.2/sbin «/start-dfs.sh radoope =Thadoop-3.2.2/SbInS.7start-ats. sh starting ‘on [localhost] jodes Caaron-hadoop] Permanently added ‘aaron-hadoop' (ECDSA) to the list of k own hosts. Start the YARN resource and nodemanagers -/start-yarn.sh [Radooptaaron-hadoop:~/hadoop-3.2.2/sbinS_./start-yarn-sh Starting resourcemanager [starting nodemanagers hntps:ifbdce santechz.com/uni-23-seting-up-hacoop n6 yarn6124, att AN BOC - 3 - Setting up Hasoop, BDCC Ips jhadoop@aaron-hadoop:~/hadoop-3.2.2/sbin$ jps 4611 Ips [3878 SecondaryNameNode [3451 NameNode 3645 DataNode \4@94 ResourceManager |4271 NodeManager Note your ip address down ipa ins ip t glen 1620 Link/loopback 00:00:00:00:00:00 brd 8@:00:00:00:00:08 inet 127.8.0.1/8 scope host lo valid_Ift forever preferred_ift forever ineté ::/128 scope host valid_ift forever preferred ift forever 2: enposs oup default qlen 1¢¢e Link/ether 0 a4:9e:6e brd fff tiff a inet 10.211.55.21/24 brd 10.211.55.255 scope global dynamic enpess valid_1ft 1¢80sec preterred_ift 198esec inete fdb2:2¢26: fhe4:0:21 noprefixroute valid_1ft 2591946sec preferred_ift 604746sec inet fed: :21¢:42Ff:fead:9e6c/64 scope link valid ift forever preferred ift forever mtu 65536 qdisc noqueue state UNKNOWN group defaul mtu 1500 qdise fq_codel state UP or 2fFifeas:9e6c/64 scope global dynamic mngtmpaddr hntps:ifbdce santechz.com/uni-23-seting-up-hacoop a6 ya7n6124, 8:41 AM DCC -3- Setting up Hacaop. BDCC Overview ‘ocainost:2000" (active) Summary | ran cen cp cs Oni ek = et ap ony a 248549 Coma ey np aan. onaiescaey race The default port 9864 is used to access individual DataNodes directly from your browser DataNode on aaron-hadoop:9866 comer: co-seronsserseatie ctor Vert: 2122 cataract Block Pools amenade Aree Block Pst Actor Siatn Last earbat Last lock Raprt Lat Bac Repo lecahonsm00 = LRH 27.O1.-0178T700NSE | RUWWNG te omnae on eeuey hntps:ifbdce santechz.com/uni-23-seting-up-hacoop 136 yarn6124, att AN BDCC BOC - 3 - Setting up Hasoop, (Chena All Applications eee Re ERE PA intimes | era ato : ——_ Sonera ee — a — Simm | daeaurtnen ae ss a — 2 ea SS a ae SS = rumcing M9885 stocated SSS SS SSS aas = oy mn Ag cme rey oy rae i EES Shona TESTING A PROGRAM STANDALONE NODE ON THE Create a file named WordCount.java in a directory named programs sport spore ors. ons. snport Anport import import Lib. npat -FLlernput Forma 1ibcovtpot .Fileoueputrormaty public class Hordcount ( public static class Tokenizersapper ‘extends Mapper( » private final static Inthritable one = new Intwritable(); Private Text word = nev Text()} public vold map(object key, Text value, Context context } hrows TOException, Interruptedsxception { Stringtokonizer itr = new StringTokenizer(valve.tostring()) While. (Jer -haskorezokens()) ( ‘word.aet(ste.nexeToken(})1 ‘context -uriva(word, one) > > , » hntps:ifbdce santechz.com/uni-23-seting-up-hacoop public static class IntSunteducer ‘extends Reducer { private intiritable result = new Inciritable(); public void reduce(Text key, Tterablectntwritable> ) throws ToException, Interruptedexception { ine um = for (Entvritable val + values) ( ‘aon f= valget()7 > Feaule see (aun) context write(key, result); , public static void main(string|} args) throws Exception ( ‘Contigurstion cont = new Contigurstion()) ‘Job Jab = Jab.getinstance cont, "word count"): job-setsarsycinss Wordcount class); Job-setaapperClass (Tokentzorhapper clas job.setConbinerclase(IntSunfoducer cla jobiaetReducerClass (IntSunfeducer-cless)? (text-clase)) Job-setoutpueVelueclass(rnvieieable.class); ‘FilarnputPornat addraputPath(Job, new Path(axga{0}))7 ileoutputFormat.setoutpatrath( job, new Path(args(2]})2 ‘Syoton.exit{job.waitrorcompletion(érue) 7-0! 1) sane yarn6124, att AN BOC - 3 - Setting up Hasoop, BDCC jar cf we. jar WordCount*.class Create two files named file1 and file2. These files contain some text. hadoop@aaron-hadoop:~/programs$ echo “Hello World Bye World" > file1 hadoop@aaron-hadoop:~/programs$ echo “Hello Hadoop Goodbye Hadoop" > file2 Then check for the root directory in the hdfs file system hadoop@aaron-hadoop:~/programs$ hdfs dfs -ls / Found 1 items drwx: - hadoop supergroup @ 2021-04-05 12:07 /tmp Create a directory in the HDFS file system hdfs dfs -mkdir /tmp/input Now copy the two files you created into HDFS hadoop@aaron-hadoo} hadoop@aaron-hadooy /programs$ hdfs dfs -copyFromLocal file1 /tmp/input /programs$ hdfs dfs -copyFromLocal file2 /tmp/input Now run the application hadoop jar we. jar WordCount /tmp/input /tmp/output hntps:ifbdce santechz.com/uni-23-seting-up-hacoop 156 276124, tt ANE BOC - 3 - Setting up Hasoop, BDCC Let us see the output of the program ihadoop@aaron-hadoop:~/progr: Found 2 items $$ hdfs dfs -1s /tmp/output 1 hadoop supergroup @ 2021-04-05 12:35 /tmp/output/_SUCCESS 1 hadoop supergroup 41 2021-04-05 12:35 /tmp/output/part-r-00000 \hadoop@aaron-hadoop:~/programs$ hdfs dfs -cat /tmp/output/part-r-@0000 Goodbye 1 Hadoop 2 Hello 2 World 2 hadoop@aaron-hadoop:~/programs$ Compiled by Aaron Stanislaus Johns hntps:ifbdce santechz.com/uni-23-seting-up-hacoop 166

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy