Skip to content

Trisha11r/covid_data_analysis_mapreduce

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

COVID-19 data analysis with MapReduce

Simple data processing on COVID-19 dataset for Hadoop and MapReduce warm up, using local single node Hadoop setup (using HDP sandbox).

Basic data analysis implementations:

1. Count the total number of reported cases for every country/location till April 8th, 2020 (case_count_country_wise.java)
2. Report total number of deaths for every location/country in between a given range of dates (death_count.java)
3. Count the total number of cases per 1 million population for every country (cases_per_million_pop.java)

Instructions for running:

Run a Hadoop environment with a VM Based Distribution

  • Virtual Machine Installation: Download and install a virtual machine client. Here, I used VirtualBox (https://www.virtualbox.org/wiki/Downloads )

  • Download a Virtual machine image: HDP Sandbox (version 3.0.1 at the time of writing). Download a distribution matching with your virtual machine client (recommended: Virtualbox image i.e .ova file).

  • Start your VM client and load the Virtual machine image. Example: On VirtualBox select File -> Import Appliance.

    • Click the folder icon and select the.ovf file from previous step.
    • Recommended: Use a higher number of CPUs than default (1), such as 4, 6, 8+ (depending on how many CPUs are in your machine).
    • Wait for VirtualBox to load the image. (It may take some time, so sit back and relax)
    • Start the virtual machine by double-clicking its name in the menu.
  • ssh into the virtual machine

    • For UNIX (Mac/Linux):

        Launch Terminal
        Type the following command
        ssh root@localhost -p 2222
        On first successfull login, you'll be asked to renew the password for root
      
  • Test Hadoop Distributed File System: Execute "hdfs dfs -ls /" command on Terminal. You should see multiple directories listed by HDFS.

Lists of commands used:

  • To view content a file in HDFS use:

      hdfs dfs -cat path_in_hdfs
    
  • Download the content of a sample text file

      wget http://<file_link>
    
  • List the content of the current directory

      ls -lt
    
  • List the content of some directories on HDFS hdfs dfs -ls /user

       hdfs dfs -ls /user/root
    
  • Put the text file into HDFS

      hdfs dfs -put <data_file>.txt /user/root/<data_file>.txt
    
  • Open a text file and copy the content of the program source into the text editor. Save it as .java

  • Create a temporary build directory to store the compiled

       mkdir build
    
  • Transfer .java to the sandbox

    • for UNIX (MacOS/Linux)

        scp -P host_path_wordcount_file root@localhost:/
      
  • Compiling the code into Java bytecode

      javac -cp `hadoop classpath` <filename>.java -d build -Xlint
    
  • Package the code into a JAR archive file

      jar -cvf <filename>.jar -C build/ .
    
       ls -lt
    

(You should see there is a .jar file just being created)

  • Execute the MapReduce job with 2 parameters - input and output path

       hadoop jar <filename>.jar <filename> /user/root/<data_file>.txt /user/root/<output>
    
  • View the content/result

      hdfs dfs -cat /user/root/<output>/*
    
  • Remove the output directory

      hdfs dfs -rm -r /user/cloudera/<output>
    
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy