DAL Lab File
DAL Lab File
DAL Lab File
LAB MANUAL
Session 2021-22
This is to certify that the experimental work entered in this journal as per the B. tech IIIrd
year syllabus prescribed by the RGPV was done by Piyush Mahajan B. tech year VI
semester in the Data Analytics Lab Laboratory of this institute during the academic year
PEO 1: To provide students with a solid foundation in mathematics, computer science and
engineering, basic science fundamentals required to solve the computing problems.
PEO 2: To expose students to latest computing technologies and software tools, so that they
can comprehend, analyze, design and create innovative computing products and solutions
for real life problems.
PEO 3: To inculcate in students’ multi-disciplinary approach, professional attitude and
ethics, communication and teamwork skills, and ability to relate computer engineering
issues with social awareness.
PEO 4: To develop professional skills in students that prepare them for immediate
employment and for life-long learning in advanced areas of computer science and related
fields which enable them to be successful entrepreneurs.
PROGRAM SPECIFIC OUTCOMES (PSO's)
A graduate of the Computer Science and Engineering Program will demonstrate:
PSO 1: Computer Science Specific Skills: The ability to identify, analyze and design
solutions for complex engineering problems in multidisciplinary areas by understanding the
core principles and concepts of computer science and thereby engage in national grand
challenges.
PSO 2: Programming and Software Development Skills: The ability to acquire programming
efficiency by designing algorithms and applying standard practices in software project
development to deliver quality software products meeting the demands of the industry.
PSO 3: Professional Skills: The ability to apply the fundamentals of computer science in
competitive research and to develop innovative products to meet the societal needs thereby
evolving as an eminent researcher and entrepreneur.
0 0 6 6
EXPERIMENT NO. 1
Objectives: To write and execute programs in python to demonstrate python’s various basic
operations of python’s add-on library NumPy.
Outcomes: To get the grip on python’s built in functions and data structures to use in data
manipulation and analysis using python add on library NumPy.
Prerequisite: You must be comfortable with variables, linear equations, graphs of functions,
histograms, and statistical means.
You should be a good programmer. Ideally, you should have some experience programming
in python because the programming exercises are in Python. However, experienced
programmers without Python experience can usually complete the programming exercises
anyway.
Hardware requirements: Memory and disk space required per user: 1GB RAM + 1GB of
disk + . 5 CPU core.Server overhead: 2-4GB or 10% system overhead (whatever is larger), .
5 CPU cores, 10GB disk space.Port requirements: Port 8000 plus 5 unique, random ports
per notebook.
Software requirements: jupyter notebook , anaconda platform or any online platform to run
the model.
It also has functions for working in the domain of linear algebra, fourier transform, and
matrices.
NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use
it freely.
NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
The array object in NumPy is called ndarray, it provides a lot of supporting functions that
make working with ndarray very easy.
Arrays are very frequently used in data science, where speed and resources are very
important.
Data Science: is a branch of computer science where we study how to store, use and analyze
data for deriving information from it.
Ans : The Pandas module mainly works with the tabular data, whereas the NumPy module
works with the numerical data. NumPy library provides objects for multi-dimensional
arrays, whereas Pandas is capable of offering an in-memory 2d table object called
DataFrame. NumPy consumes less memory as compared to Pandas.
0818CS19 Pratyoosh
1132 Mishra
EXPERIMENT NO. 2
Objectives: To write and execute programs in python to demonstrate python’s various basic
operations of python’s add-on library Pandas.
Outcomes: To get the grip on python’s built in functions and data structures to use in data
manipulation and analysis using python add on library Pandas.
Prerequisite: You must be comfortable with variables, linear equations, graphs of functions,
histograms, and statistical means.
You should be a good programmer. Ideally, you should have some experience programming
in python because the programming exercises are in Python. However, experienced
programmers without Python experience can usually complete the programming exercises
anyway.
Hardware requirements: Memory and disk space required per user: 1GB RAM + 1GB of
disk + . 5 CPU core.Server overhead: 2-4GB or 10% system overhead (whatever is larger), .
5 CPU cores, 10GB disk space.Port requirements: Port 8000 plus 5 unique, random ports
per notebook.
Software requirements: jupyter notebook , anaconda platform or any online platform to run
the model.
Data Science: is a branch of computer science where we study how to store, use and analyze
data for deriving information from it.
What Can Pandas Do?
Pandas gives you answers about the data. Like:
Pandas are also able to delete rows that are not relevant, or contain wrong values, like empty
or NULL values. This is called cleaning the data.
Ans : Pandas is the most popular python library that is used for data analysis. It provides
highly optimized performance with back-end source code that is purely written in C or
Python. We can analyze data in pandas with: Series.
Q.2 What are the significant features of the pandas Library?
Ans : The key features of the panda's library are as follows:
• Memory Efficient
• Data Alignment
• Reshaping
• Time Series
Conclusion:
0818CS Pratyoosh
191132 Mishra
EXPERIMENT NO. 3
Aim / Title: Implementation of python’s library for data visualisation and plotting.
Outcomes: To get the grip on python’s built in library for data visualisation and plotting.
Prerequisite: You must be comfortable with variables, linear equations, graphs of functions,
histograms, and statistical means.
You should be a good programmer. Ideally, you should have some experience programming
in python because the programming exercises are in Python. However, experienced
programmers without Python experience can usually complete the programming exercises
anyway.
Hardware requirements: Memory and disk space required per user: 1GB RAM + 1GB of
disk + . 5 CPU core.Server overhead: 2-4GB or 10% system overhead (whatever is larger), .
5 CPU cores, 10GB disk space.Port requirements: Port 8000 plus 5 unique, random ports
per notebook.
Software requirements: jupyter notebook , anaconda platform or any online platform to run
the model.
Theory:
Bar chart
length/count
category
color
Presents categorical data with rectangular bars with heights or lengths
proportional to the values that they represent. The bars can be plotted vertically
or horizontally.
• A bar graph shows comparisons among discrete categories. One axis of the chart
shows the specific categories being compared, and the other axis represents a
measured value.
• Some bar graphs present bars clustered in groups of more than one, showing the
values of more than one measured variable. These clustered groups can be
differentiated using color.
• For example; comparison of values, such as sales performance for several
persons or businesses in a single time period.
Histogram
• bin limits
• count/length
• color
• An approximate representation of the distribution of numerical data. Divide the
entire range of values into a series of intervals and then count how many values
fall into each interval this is called binning. The bins are usually specified as
consecutive, non-overlapping intervals of a variable. The bins (intervals) must
be adjacent, and are often (but not required to be) of equal size.
• For example, determining frequency of annual stock market percentage returns
within particular ranges (bins) such as 0-10%, 11-20%, etc. The height of the bar
represents the number of observations (years) with a return % in the range
represented by the respective bin.
Scatter plot
x position y
position
symbol/glyp
h color size
• Uses Cartesian coordinates to display values for typically two variables for a set
of data.
• Points can be coded via color, shape and/or size to display additional variables.
• Each point on the plot has an associated x and y term that determines its location
on the cartesian plane.
• Scatter plots are often used to highlight the correlation between variables (x and
y).
Scatter plot
• position x
• position y
• position z
• color
• symbol
• size
• Similar to the 2-dimensional scatter plot above, the 3-dimensional scatter plot
visualizes the relationship between typically 3 variables from a set of data.
Again point can be coded via color, shape and/or size to display additional
variables
Network analysis
Network
nodes size
nodes color
ties
thickness
ties color
spatializatio
n
• Finding clusters in the network (e.g. grouping Facebook friends into different
clusters).
• Discovering bridges (information brokers or boundary spanners) between
clusters in the network
• Determining the most influential nodes in the network (e.g. A company wants to
target a small group of people on Twitter for a marketing campaign).
• Finding outlier actors who do not fit into any cluster or are in the periphery of a
network.
Pie chart
Pie chart
• color
• Represents one categorical variable which is divided into slices to illustrate
numerical proportion. In a pie chart, the arc length of each slice (and
consequently its central angle and area), is proportional to the quantity it
represents.
• For example, as shown in the graph to the right, the proportion of English native
speakers worldwide
Line chart
Line chart
• x position
• y position
• symbol/glyph
color
size
Represents information as a series of data points called 'markers' connected by
straight line segments.
Similar to a scatter plot except that the measurement points are ordered
(typically by their x-axis value) and joined with straight line segments.
• Often used to visualize a trend in data over intervals of time – a time series –
thus the line is often drawn chronologically.
3. Scatter plot
4. Scatter plot (3D)
5. Pie chart
Conclusion: Now we are able to use python’s built in libraries for data visualization and
plotting.
0818CS Pratyoosh
191132 Mishra
EXPERIMENT NO. 4
Prerequisite: You must be comfortable with variables, linear equations, graphs of functions,
histograms, and statistical means.
You should be a good programmer. Ideally, you should have some experience programming
in pR because the programming exercises are in R. However, experienced programmers
without R experience can usually complete the programming exercises anyway.
Hardware requirements: Memory and disk space required per user: 1GB RAM + 1GB of
disk + . 5 CPU core.Server overhead: 2-4GB or 10% system overhead (whatever is larger), .
5 CPU cores, 10GB disk space.Port requirements: Port 8000 plus 5 unique, random ports
per notebook.
Software requirements: jupyter notebook , anaconda platform or any online platform to run
the model.
Theory:
R Data Structure:
To make the best of the R language, you’ll need a strong understanding of the basic data
types and data structures and how to operate on them.
Data structures are very important to understand because these are the objects you will
manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most
common sources of frustration for beginners.
Vectors:
A vector is a collection of elements that are most commonly of
mode character, logical, integer or numeric.
You can create an empty vector with vector(). (By default the mode is logical. You can be
more explicit as shown in the examples below.) It is more common to use direct
constructors such as character(), numeric(), etc.
List:
In R lists act as containers. Unlike atomic vectors, the contents of a list are not restricted to
a single mode and can encompass any mixture of data types. Lists are sometimes called
generic vectors, because the elements of a list can by of any type of R object, even lists
containing further lists. This property makes them fundamentally different from atomic
vectors. A list is a special type of vector. Each element can be a different type.
Create lists using list( or coerce other objects usingas.list(). An empty list of the required
)
length can be created using vector(
Matrices )
In these R objects, the elements are organised in a 2-dimensional layout. Matrices hold
elements of similar atomic types. These are beneficial when the elements belong to a
single class. Matrices having numeric elements are created for mathematical calculations.
You can create matrices using the matrix() function. The basic syntax to create a matrix
is given below:
matrix(data, nrow, ncol, byrow, dimnames)
Factors
These R objects are used for categorizing data and storing them as levels. They are good for
statistical modelling and data analysis. Both integers and strings can be stored in factors.
You can use the factor() function for creating a factor by providing a vector as an input to
the method.
Data Frame:
A data frame is a very important data type in R. It’s pretty much the de facto data structure
for most tabular data and what we use for statistics.
A data frame is a special type of list where every element of the list has same length (i.e.
data frame is a “rectangular” list).
Data frames can have additional attributes such rownames(), which can be useful for
as
annotating data, likesubject_id or sample_i . But most of the time they are not used.
d
Some additional information on data
frames:
• Usually created by read.csv() and read.table(), i.e. when importing the data into R.
• Assuming all columns in a data frame are of same type, data frame can be converted
to a matrix with data.matrix() (preferred) or as.matrix(). Otherwise type coercion will
be enforced and the results may not always be what you expect.
Can also create a new data framedata.frame() function.
with
Find the number of rows and columns nrow(dat) and ncol(dat), respectively.
with
Rownames are often automatically generated and look like 1, 2, …, n.
Consistency in
numbering of rownames may not be honored when rows are reshuffled or subset.
LIST:-
MATRICS:-
DATA FRAMES:-
Conclusion:
Vector List
It has contiguous memory. While it has non-contiguous memory.
It is synchronized. While it is not synchronized.
Vector may have a default size. List does not have default size.
In vector, each element only In list, each element requires extra requires the
space for itself only. space for the node which holds the
element, including pointers to the next
and previous elements in the list.
Insertion at the end requires Insertion is cheap no matter where constant time
but insertion in the list it occurs.
elsewhere is costly.
Vector is thread safe. List is not thread safe.
Random access of elements is Random access of elements is not possible.
possible
EXPERIMENT NO. 5
Prerequisite: You must be comfortable with variables, linear equations, graphs of functions,
histograms, and statistical means.
You should be a good programmer. Ideally, you should have some experience programming
in pR because the programming exercises are in R. However, experienced programmers
without R experience can usually complete the programming exercises anyway.
Hardware requirements: Memory and disk space required per user: 1GB RAM + 1GB of
disk + . 5 CPU core.Server overhead: 2-4GB or 10% system overhead (whatever is larger), .
5 CPU cores, 10GB disk space.Port requirements: Port 8000 plus 5 unique, random ports
per notebook.
Software requirements: jupyter notebook , anaconda platform or any online platform to run
the model.
Theory:
Pipe operator
The pipe operator is available in packages such as magrittr and dplyr for simplifying your
overall code. The operator lets you combine multiple functions together. Denoted by the
%>% symbol, it can be used with popular methods such as summarise(), filter(), select()
and group_by() while data manipulation in R.
EXPERIMENT NO. 6
Prerequisite: You must be comfortable with variables, linear equations, graphs of functions,
histograms, and statistical means.
You should be a good programmer. Ideally, you should have some experience programming
in R because the programming exercises are in R. However, experienced programmers
without R experience can usually complete the programming exercises anyway.
Hardware requirements: Memory and disk space required per user: 1GB RAM + 1GB of
disk + . 5 CPU core.Server overhead: 2-4GB or 10% system overhead (whatever is larger), .
5 CPU cores, 10GB disk space.Port requirements: Port 8000 plus 5 unique, random ports
per notebook.
Software requirements: jupyter notebook , anaconda platform or any online platform to run
the model.
Theory:
Data Visualization in R
R is a language that is designed for statistical computing, graphical data analysis, and
scientific research. It is usually preferred for data visualization as it offers flexibility and
minimum required coding through its packages.
BAR PLOT :-
HISTOGRAM :-
BOX PLOT:-
SCATTER PLOT : -
Conclusion:
ANS:- ggplot(data=diamonds,aes(x=price))+geom_bar()