0% found this document useful (0 votes)
14 views

Uni T - 2 - R Programming

Uploaded by

anju.k10301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Uni T - 2 - R Programming

Uploaded by

anju.k10301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Uni t -2 - R Programming

2.1. Summary Function in R


• The summary() function in R can be used to quickly summarize the values in a vector, data
frame, regression model.

• This syntax uses the following basic syntax:

summary(data)

• The summary() function automatically calculates the following summary statistics for the
vector:

• Min: The minimum value

• 1st Qu: The value of the 1st quartile (25th percentile)

• Median: The median value

• 3rd Qu: The value of the 3rd quartile (75th percentile)

• Max: The maximum value

Using summary() with Vector

• The following code shows how to use the summary() function to summarize the values in a
vector

Using summary() with Data Frame


• The following code shows how to use the summary() function to summarize every column in
a data frame:
Using summary() with Specific Data Frame Columns
• The following code shows how to use the summary() function to summarize specific columns
in a data frame:

2.2. Logistic Regression


• Logistic regression in R Programming is a classification algorithm used to find the probability
of event success and event failure.

• Logistic regression is a statistical method used to predict the outcome of a categorical


dependent variable (target variable) based on one or more predictor variables. It is a type of
regression analysis used for predicting the outcome of a binary dependent variable (i.e., a
variable that has only two possible outcomes, such as 0/1, yes/no, etc.).

• Logistic regression in R can be performed using the glm() function, which fits generalized
linear models.
Logistic regression is commonly used in various fields, including:

• Medical Research: Predicting disease outcomes, treatment responses, or patient survival.

• Marketing: Predicting customer satisfaction, response to marketing campaigns, or credit risk.

• Finance: Predicting credit risk, loan defaults, or stock prices.

• Social sciences: Predicting voter behavior, crime rates, or social network outcomes.

• The logistic regression model can be represented mathematically as:

p = 1 / (1 + e^(-z))

• There is the following syntax of the glm() function.

glm(formula, data, family)

• Formula - It is a symbol which represents the relationship b/w the variables.

• Data - It is the dataset giving the values of the variables.

• Family - An R object which specifies the details of the model, and its value is binomial for
logistic regression.

2.3. Confusion Matrix


• The Confusion Matrix is a type of matrix that is used to visualize the predicted values against
the actual Values.

• The row headers in the confusion matrix represent predicted values and column headers are
used to represent actual values.

• A confusion matrix is a table used to evaluate the performance of a classification model in R


programming. It summarizes the predictions against the actual outcomes.

• The Confusion matrix contains four cells as shown in the below image.
• True Positive – Indicates how many positive values are predicted as positive only by the
model.

• False Positive – Indicates how many negative values are predicted as positive values by the
model.

• False Negative – Indicates how many positive values are predicted as negative values by the
model.

• True Negative – Indicates how many negative values are predicted as negative only by the
model.

ConfusionMatrix() function

• In R Programming the Confusion Matrix can be visualized using confusionMatrix() function


which is present in the caret package.

Syntax: confusionMatrix(data, reference, positive = NULL, dnn = c(“Prediction”, “Reference”))

where

• Data - a factor of predicted classes.

• reference - a factor of classes to be used as the true results.


• positive(optional) - an optional character string for the factor level.

• dnn(optional) - a character vector of dimnames for the table.

Factor()

• Factors are data structures that are implemented to categorize the data or represent
categorical data and store it on multiple levels.

Example 1

1 - Example 2

2.4. Calculate Sensitivity, Specificity in CARET


• The ‘caret’ package is stands for Classification and Regression Training.

• Sensitivity and specificity are performance metrics used to evaluate the accuracy of a
classification model in R programming.

• The goal of a classification model is to learn the relationship between the input features and
the target class label, so that it can make accurate predictions on new, unseen data.

Here's how to calculate them:

Sensitivity (True Positive Rate):

• Definition: Proportion of true positives (correctly predicted instances) among all actual
positive instances.

• Formula: sensitivity = TP / (TP + FN)

• R code: sensitivity = sum(true positives) / (sum(true positives) + sum(false negatives))

Specificity (True Negative Rate):

• Definition: Proportion of true negatives (correctly predicted instances) among all actual
negative instances.

• Formula: specificity = TN / (TN + FP)

• R code: specificity = sum(true negatives) / (sum(true negatives) + sum(false positives))

Where:

• TP = True Positives (correctly predicted positive instances)

• TN = True Negatives (correctly predicted negative instances)

• FP = False Positives (incorrectly predicted positive instances)

• FN = False Negatives (incorrectly predicted negative instances)


2.4. ROC Curve

• A Receiver Operating Characteristic (ROC) curve is a graphical representation of the


performance of a binary classification model. It plots the True Positive Rate (TPR) against the
False Positive Rate (FPR) at different threshold settings.

Here's a breakdown of the ROC curve:

• True Positive Rate (TPR): The proportion of actual positive instances correctly identified by
the model.

• False Positive Rate (FPR): The proportion of actual negative instances incorrectly identified
as positive by the model.

• Threshold: The cutoff value used to determine whether a prediction is positive or negative.

ROC curves are useful for:

• Evaluating model performance

• Comparing different models

• Selecting optimal threshold values

• Identifying bias in models

Common ROC curve metrics include:

• AUC (Area Under the Curve)


• Accuracy

• Precision

• Recall (Sensitivity)

• Specificity

• F1-score

2.5. Recitation

• Reiteration: Repeating a process or operation, often using loops (e.g., for, while, repeat).

• Recursion: A function calling itself repeatedly until a base case is reached.

Data Types and Structures in R

• Basic Data Types:

- Numeric: Represents real numbers, e.g., 2.5, 10

- Integer: Whole numbers, e.g., 2L, 10L (where L indicates an integer)

- Character: Text or string data, e.g., "hello", "R programming"

- Logical: Boolean values, TRUE and FALSE

- Factor: Categorical variables, e.g., factor(c("yes", "no", "yes"))

• Data Structures:

- Vector: A sequence of elements of the same data type, created using c()

- Matrix: A two-dimensional array of elements of the same data type, created


using matrix()
- Array: A multi-dimensional collection of elements

- Data Frame: A table where columns can have different data types, created using
data.frame()

- List: A collection that can contain elements of different types

2. Basic Operators

• Arithmetic Operators: +, -, *, /, %% (modulus), %/% (integer division)

• Relational Operators: == (equals), != (not equal), <, >, <=, >=

• Logical Operators: & (and), | (or), ! (not)

• Assignment Operators: <-, = for assigning values to variables; <<- for global assignment
within functions

my_function <- function(arg1, arg2) {

3. Control Structures

• Conditional Statements:

- if (condition) { ... }: Executes code if the condition is TRUE

- ifelse(condition, true_value, false_value): Vectorized if-else statement

• Loops:

- for (variable in sequence) { ... }: Repeats code for each element in the sequence

- while (condition) { ... }: Repeats code while the condition is TRUE

- repeat { ...; break }: Repeats code until a break condition is met

4. Functions

• Defining Functions: Functions are defined using function(), allowing code to be reused:

Copy code

my_function <- function(arg1, arg2) {

# Code block
return(result)

• Scope of Variables: Variables defined within functions are local by default, unless assigned
globally using <<-

5. Data Manipulation

• Data Frames:

- Access columns with $ (e.g., df$column_name)

- Subset data frames with df[row, column] indexing

• dplyr Package: Provides easy-to-use functions for data manipulation:

- filter(): Select rows based on conditions

- select(): Choose specific columns

- mutate(): Add new columns or modify existing ones

- summarize(): Calculate summary statistics

- group_by(): Group data for grouped calculations

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy