0% found this document useful (0 votes)
61 views

Introduction To Data and Statistics With R

This document provides an introduction to data and statistics. It covers foundational topics like descriptive versus inferential statistics, types of data, measures of central tendency and spread, population versus sample, data visualization techniques, and exploring categorical variables. The goal is to introduce statistics as a science of understanding and analyzing data to make data-driven decisions. Key concepts covered include measures like mean, median, mode, variance, standard deviation, and interquartile range. Visualization techniques like scatter plots, histograms, box plots, and bar plots are also introduced.

Uploaded by

APPIAH ELIJAH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Introduction To Data and Statistics With R

This document provides an introduction to data and statistics. It covers foundational topics like descriptive versus inferential statistics, types of data, measures of central tendency and spread, population versus sample, data visualization techniques, and exploring categorical variables. The goal is to introduce statistics as a science of understanding and analyzing data to make data-driven decisions. Key concepts covered include measures like mean, median, mode, variance, standard deviation, and interquartile range. Visualization techniques like scatter plots, histograms, box plots, and bar plots are also introduced.

Uploaded by

APPIAH ELIJAH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

INTRODUCTION TO

DATA & STATISTICS


WITH
2

HELLO!
I am Elijah Appiah from
Ghana.
I am an Economist by
profession.
I love everything about R!

You can reach me:


secret behind the smile! eappiah.uew@gmail.com
3

Lecture Series
Introduction to Data and Statistics

Foundations of Probability

Inferential Statistics

Modeling and Regression Analysis


4

Lesson Goal
Introduce statistics as a science of
understanding and analyzing data
and making data-based decisions.
5

Statistics

Practice and study of:


collecting data
analyzing data
6

Statistics

Two main branches of statistics:


Descriptive – describe and
summarise data
Inferential – uses sample data to
make inferences about a larger
population
7

Statistics - Data

Two main types of data


Numeric (Quantitative)
Categorical (Qualitative)
8

Statistics - Data

Data

Numeric Categorical
(Quantitative) (Qualitative)

Continuous Discrete Nominal Ordinal


9

Data
Numeric Categorical
Discrete – counts Nominal – names, labels,
e.g. number of categories with no natural
cylinders of a vehicle order
e.g. gender, countries
Continuous – measured Ordinal – categories with
even within an interval an order
e.g. height, weight e.g. Likert Scales
10

EXPLORATION & SUMMARIES


11

EXPLORATION & SUMMARIES


12

Visualizing Numeric Data


TWO VARIABLES

Correlation does not imply causation


13

Visualizing Numeric Data


TWO VARIABLES
Both Continuous

Scatter plot – geom_point()

Data Source: gapminder.com


Country, Income per person (in US$), Life Expectancy (in
years) [2012] ----- {country, income, lifeExp}
14

Now, let’s practice


15

Visualizing Numeric Data


ONE VARIABLE
Discrete Continuous
Bar Plot – geom_bar() Histogram – geom_histogram()
Density Plot – geom_density()
Dot Plot – geom_dotplot()
Box Plot – geom_boxplot()
Data Source: gapminder.com
Country, Income per person (in US$), Life Expectancy (in
years) [2012] ----- {country, income, lifeExp}
16

Visualizing Numeric Data


Left Skewed Symmetric Right Skewed
17

Now, let’s practice


18

Visualizing Numeric Data


Large bin width Moderate bin width Narrow bin width
19

Visualizing Numeric Data


Dot Plot
20

Now, let’s practice


21

Visualizing Numeric Data


Box Plot Left Skewed

Normal

Right Skewed
22

Now, let’s practice


23

Population vs Sample
Population – entire group you to want
to draw conclusions about
e.g. income of all countries in the world
Sample – specific group from the
population used for inference
e.g. income of countries in Africa
24

Measures of Central Tendency


Sometimes, not good to have all
observations for data
Estimates may not be perfect
Good sample (representative of
population) makes estimates good
guesses.
25

Measures of Central Tendency


Key characteristics of a distribution
Mean
Mode
Median
Population parameters vs Sample
Statistics
26

Measures of Central Tendency


Mean
mean()
Median
median()
Mode
table() {base}
count() {dplyr}
27

Now, let’s practice


28

Measures of Spread
Data Variability Mean = 0
SD = 1

Mean = 0
SD = 2
29

Measures of Spread
Range: (maximum – minimum)

Variance: (average squared deviation from the mean)

Standard Deviation: (average deviation around the mean)

Interquartile Range: (range of the middle 50% of the data;


difference between first and third quartiles)
30

Measures of Spread
Range: max(x) – min(x); range()***

Variance: var(x)

Standard Deviation: sd(x); sqrt(var(x))

Interquartile Range: quantile(); boxplot


31

Robust Statistics
A measure least affected by extreme
values
32

Robust Statistics
Robust measures of Center & Spread
Example:
Data Mean Median
1,2,3,4,5,6 3.5 3.5
1,2,3,4,5,1000 169.12 3.5

Note: While the mean depends on all observations, the median


depends only on the midpoint of the distribution and the values of the
end points are irrelevant to its calculation.
33

Robust Statistics
Median is a more robust statistic of
center than the mean.

So too the IQR (which is based on


median) is more robust than standard
deviation (which is calculated using
the mean).
34

Robust Statistics
Robust statistics like the median and
IQR are most useful for describing
skewed distributions.

Non-robust statistics like the mean


and standard deviation are useful for
describing symmetric data.
35

Data Transformation
Rescaling data
Logarithmic Transformation
Square Root Transformation
36

Now, let’s practice


37

EXPLORATION & SUMMARIES


38

Exploring Categories
Data
titanic {ggmosaic}
Passengers and crew on board the Titanic
Description
A dataset containing some demographics and survival of people
on board the Titanic
Variables: Class (1st, 2nd, 3rd, crew); Sex (Male, Female);
Age (Child, Adult); Survived (Yes, No)
39

Exploring Categories
One Categorical Variable
Frequency Table
Bar Plots
40

Now, let’s practice


41

Exploring Categories
Two Categorical Variables
Contingency Table
Stacked Bar Plots
Clustered Bar Plots
Mosaic Plots
42

Now, let’s practice


43

Exploring Categories
One Numerical and One Categorical
Box Plot
44

Now, let’s practice


45

Any questions?

Reach me anytime!
Email
eappiah.uew@gmail.com
LinkedIn
https://www.linkedin.com/in/appiah-elijah-383231123/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy