Analysis the Biomedical Datasets CSV File
Analysis the Biomedical Datasets CSV File
(Spring-2024)
Submitted By
Hassan Mukhtiar
(2023-BME-5)
Submitted To
Data Base:
The database is an organized collection of structured data to make it easily accessible, manageable and
update. In simple words, we can say, a database in a place where the data is stored. The best analogy is
the library. The library contains a huge collection of books of different genres, here the library is
database, and books are the data. [1]
Example:
There are some databases examples include such as grocery store, bank E-commerce platforms,
healthcare systems, social media platforms.
Biomedical Database:
Databases that store and maintain biomedical data such as gene and protein sequences. Biomedical
data: NER is used extensively in biomedical data for gene identification, DNA identification, and the
identification of drug names and disease names. These experiments use CRFs with features engineered
for their domain data. [2]
Example: [3]
❖ Generic gene expression databases
❖ Nucleosome positioning region database
❖ Protein structure database
Biomedical Datasets:
Healthcare data sets include a vast amount of medical data, various measurements, financial data,
statistical data, demographics of specific populations, and insurance data, to name just a few, gathered
from various healthcare data sources. To investigate how data sets are used in the healthcare industry.
Example: [4]
This is my CSV file which has insurance data of smoker and download from Kaggle website and save it
into laptop files and perform different task on this file. This file screenshot shown in below and different
task which also perform on it given below step by step.
The statement f=pd.read_csv('OEL.csv') reads the data from a CSV file named 'OEL.csv' into a panda
DataFrame and give name it to the variable 'f'. This allows us to analyze the data using the panda’s
library in Python.
(CS-103L- Introduction to Programming for Data Science)
Report # Open Ended Lab
The expression info = f.info( ) in Python show a summary of the DataFrame 'f', include information
about its structure, such as the number of entries and data types, and assigns this summary to the
variable 'info'.
The line columns = f.columns in Python returned the column names or labels from the DataFrame 'f' and
assigns them to the variable 'columns'. This allows for easily access to the names of the columns within
the DataFrame, facilitating further data manipulation or analysis based on column names.
(CS-103L- Introduction to Programming for Data Science)
Report # Open Ended Lab
The f.tail() method in pandas DataFrame, when applied like f.tail(), return and displays the last few rows
or by default 5 of the DataFrame 'f'.
standard deviation, minimum, maximum and, providing a comprehensive summary of the distribution
the numerical data within the DataFrame.
f['region'].value_counts():
This key f['region'].value_counts() in pandas counts the present of each unique value in the 'region'
column of the DataFrame 'f'. It provides a series where the index represents each unique value in the
'region' column, and the corresponding values indicate how many times each value appears in the
column.
As age increasing, there is a showing increasingly trend in charges, express that older individuals tend to
have higher medical expenses. The markers on the line represent individual data points, showing the
specific charges associated with each age.
(CS-103L- Introduction to Programming for Data Science)
Report # Open Ended Lab
The following program shown that from ‘sex’ column when index i==male then add 1 intger variables
and at last of column print the total number of males in the csv file.
The following program shown that from ‘sex’ column when index i==female then add 1 intger variables
and at last of cloumn print the total number of females in the csv file.
In the given program, we check that number of female in the files but who are smoking when both
conditin fulfil then these are store in varible and print this varible which shows number of female who
are smoking given below.
In the given program, we check that number of male in the files but who are smoking when both
conditin fulfil then these are store in varible and print this varible which shows number of male who
are smoking given below.
The following given program after read csv files iterate all rows in smoker column and when both
conditions fulfil then print the total number of smokers which have yes and nonsmoker which have non
condition when fulfil then print the total number of smoker and non-smoker.
The following program shown that the following shows the smoker which are present in different ratio
in different regions such smoker shows below southeast, northeast, and northwest and southwest.
• There is following a program which shows relationship between the average charges and sex.
• In this CSV file the average charge of female is 12569.578844 and male average charges is
13956.751178.
• Another relationship between smokers and charges and nonsmokers and charges shows.
(CS-103L- Introduction to Programming for Data Science)
Report # Open Ended Lab
• People who are smoker their average charges is much less but charges of smoker who are
smoking their average charges is very huge.
Average_charges_sex.describe():
• The describe() method in pandas DataFrame, when apply like Average_charges_sex.describe(),
generates a statistical data of the numerical columns in the DataFrame 'f'.
• This data consists of measures such as count, mean, standard deviation, minimum, maximum
and, providing a comprehensive summary of the distribution the numerical data within the
DataFrame.
Average_charges_smoker.describe():
• This line graph explains the relationship between age and medical charges in the
OEL/insurance dataset. As age increases, there is an increasing trend in charges,
expressing that older individuals tend to have higher medical expenses.
• The markers on the line represent individual data points, showing the specific charges
associated with each age.
• The line’s increase slope suggests a positive correlation between age and medical charges,
indicating that age is a significant factor influencing healthcare costs.
• This shown highlights the importance of age as an increase as medical expenses.
Conclusion:
We learnt how to manage or handle CSV files and perform different operations on them. Handling the
Biomedical Datasets CSV file in Python involved reading the data using pandas, exploring its structure
and content using methods like head(), tail(), and describe(), and conducting various analyses such as
statistical summaries, visualization, or machine learning modeling. To establish relationships between
different columns such as sex and charges and smoking and charges in the file, we could use
correlation analysis to identify any linear relationships between pairs of columns.
References:
1- https://www.edureka.co/blog/what-is-a-database/#Database
2- https://www.sciencedirect.com/topics/computer-science/biomedical-data
3- https://en.wikipedia.org/wiki/List_of_biological_databases
4- https://www.kaggle.com/datasets?search=biomedical+datasets