0% found this document useful (0 votes)
27 views

Hadoop-Hive Report

Report Hadoop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Hadoop-Hive Report

Report Hadoop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Hospital Database Implementation Using

Apache Hive, Hadoop, and Google Cloud


Dataproc
Hisham AlQanneh – 2110017

Table of Contents:
0- Introduction
1- Abstract
2- Environment Setup
3- Prerequisites
4- Database Creation
a. Tables of Dataset
b. EER Diagram
c. Relationships
d. Schema
5- Implementation
a. Creating The Relations
b. Inserting Values
c. Queries
6- Conclusion
Introduction:
In the contemporary healthcare landscape, the efficient management of hospital data is
paramount to ensuring high-quality patient care and operational effectiveness. Hospitals
generate vast amounts of data daily, ranging from patient records and staff details to billing
information and departmental logistics. To handle this complexity, robust database
management systems are essential. Apache Hive, a data warehouse software that facilitates
querying and managing large datasets residing in distributed storage, is increasingly being
adopted for such tasks. This report delves into the creation and management of a hospital
database using Apache Hive, supported by Hadoop and Google Cloud Dataproc. It covers the
design and implementation of various tables to store and retrieve critical hospital data, outlines
the SQL commands used for these operations, and provides an analysis of the system's
efficiency and areas for improvement.

Hadoop, an open-source framework, enables the distributed processing of large data sets
across clusters of computers using simple programming models. It is designed to scale up from
single servers to thousands of machines, each offering local computation and storage. Google
Cloud Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and
Apache Hadoop clusters. Dataproc automates cluster management and simplifies the process
of running big data workloads in the cloud, providing a scalable and cost-effective solution for
data processing.

Abstract:
This report presents the implementation of a hospital database using Apache Hive, aimed at
improving data management and retrieval processes within a healthcare setting. The database
encompasses various essential entities, including staff, doctors, nurses, departments, rooms,
patients, medical records, and billing information. The document details the SQL commands
executed to create and populate these tables, providing a comprehensive overview of the data
structure and relationships. The database operations are supported by Hadoop, which enables
distributed data processing, and Google Cloud Dataproc, which simplifies cluster management
and enhances scalability. Key issues such as logging configuration conflicts and illegal reflective
access warnings are identified, with recommendations provided to resolve these challenges.
Additionally, the report analyzes the performance of data queries and suggests optimizations
for storage and query execution. This analysis highlights the strengths of the current system
while proposing strategies for enhancing future scalability, performance, and maintainability.
Environment Setup:
We started by creating a Virtual Machine instance(cluster) in google cloud dataproc

Prerequisites:
We started a Hive Session in it.
Database Creation:
Tables of Dataset:
EER Diagram:
The Relationships:
1. doctor treats patient:
(One to many) as one doctor can treat many patients at once
(Partial, total) participation as a doctor don’t need to treat a patient while a patient
should be treated by a doctor
2. doctor surpervises nurse:
(One to many) as one doctor can supervise more than one nurse
(Partial, total) participation as a doctor don’t need to supervise a nurse while a
nurse should be supervised by a doctor
3. doctor works_in department:
(Many to one) as many doctors can work in the same department
(Total, total) participation as every doctor works in a department and every
department have at least one doctor that works in it
4. nurse goven room:
(One to many) as one nurse can goven more than one room
(Partial, total) participation as not every nurse govens a room but every room is
govened by a nurse
5. patient has medical record:
(One to many) as one patient can have many medical records
(Partial, total) participation as a patient isn't required to have a medical record
while a medical record is required to have a patient
6. Patient assigned room:
(Many to one) as many patients can be assigned to same room
(Partial, partial) participation as not every patient is assigned a room and not every
room has a patient
7. patient issued bill:
(One to one) as every patient has one bill
(Total, total) participation as every patient is issued a bill and each bill has a patient
The Database Schema:
Creating The Relations:
Inserting Values:
SQL Queries:
1. Select all doctors and their specialties:

2. Select all patients and their respective doctors:


3. Select all nurses and the doctors they report to:

4. Find the total charges of each bill and their patients:


5. Select all patients who were admitted in 2023:

6. Find all departments and their managers:


7. Find all rooms and their availability:

8. Find all patients who were diagnosed with 'Heart Disease':


9. Find the average age of all patients:

10. Find all patients who are currently admitted (haven't been
discharged yet):

(No patient has been discharged yet)

Conclusion
The implementation of the hospital database using Apache Hive, supported by Hadoop and
Google Cloud Dataproc, provides a robust foundation for managing healthcare data efficiently.
While the current setup is effective for small to medium-sized datasets, future optimizations
and updates are necessary to handle larger volumes of data and ensure long-term scalability
and performance.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy