0% found this document useful (0 votes)
35 views

Data Minig and Techniquezz

Uploaded by

manikantaala5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Data Minig and Techniquezz

Uploaded by

manikantaala5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Department of Information technology

LAB MANUAL
R20

DATA MINING LAB


[III B.TECH, I-SEM]
1. Implement all basic R commands.

Type 'demo()' for some demos, 'help()' for on-line help, or


'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
BASIC MATH:
> 1+2+3
[1] 6
> 3*7*2
[1] 42
> 4/3
[1] 1.333333
> (4*6)+5
[1] 29
> 4*(6+5)
[1] 44

2
> XYZ
Error: object 'XYZ' not found
DATA TYPES:
> class(x)
[1] "numeric"
> is.numeric(x)
[1] TRUE
> i<-4L
>i
[1] 4
> is.integer(i)
[1] TRUE
> class(4L)
[1] "integer"
> class(2.8)
[1] "numeric"
> 4L*2.8
[1] 11.2
> 5L/2L
[1] 2.5
> class(5L/2L)
[1] "numeric"
> TRUE*5
[1] 5
> FALSE*5
[1] 0

Data Frames:
Data frame is just like an Excel spreadsheet in that it has column and rows. In statistical terms, each
column is a variable and each row is an observation.
> x<- 10:1
> y<--4:3
>x
[1] 10 9 8 7 6 5 4 3 2 1
>y
[1] -4 -3 -2 -1 0 1 2 3
> y<--4:5
>y
[1] -4 -3 -2 -1 0 1 2 3 4 5
> q<-
c("hockey","football","baseball","curling","rugby","lacrosse","basketball","tennis","cricket","soccer")
[1] "hockey" "football" "baseball" "curling" "rugby" "lacrosse"
[7] "basketball" "tennis" "cricket" "soccer"
> theDF<-data.frame(x,y,q)
> theDF
x y q
1 10 -4 hockey
3
2 9 -3 football
3 8 -2 baseball
4 7 -1 curling
5 6 0 rugby
6 5 1 lacrosse
7 4 2 basketball
8 3 3 tennis
9 2 4 cricket
10 1 5 soccer
> theDF<-data.frame(First=x, Second=y,Third=q)

4
> theDF
> theDF
First Second Third
1 10 -4 hockey
2 9 -3 football
3 8 -2 baseball
4 7 -1 curling
5 6 0 rugby
6 5 1 lacrosse
7 4 2 basketball
8 3 3 tennis
9 2 4 cricket
10 1 5 soccer

5
2) Interact data through .csv files (Import from and export to .csv files).
In R, we can read data from files stored outside the R environment. We can also write data into files
which will be stored and accessed by the operating system. R can read and write into various file formats
like csv, excel, xml etc.
Getting and Setting the Working Directory
You can check which directory the R workspace is pointing to using the getwd() function. You can also
set a new working directory using setwd()function.
> print(getwd())
[1] "C:/Users/Prasanna Kumar/Documents"
> data <- read.csv("2.csv") # Reading the Data from the .csv file
> print(data) # data
id name salary start_date dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
6 6 Nina 578.00 2013-05-21 IT
7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance
Analyzing the CSV File
> print(ncol(data)) # number of columns
[1] 5
> print(nrow(data)) # number of rows
[1] 8
Get the maximum salary
>sal <- max(data$salary)
>print(sal)
[1] 843.25
Get all the people working in IT department
> retval <- subset( data, dept == "IT")
> retval
id name salary start_date dept
1 1 Rick 623.3 2012-01-01 IT
3 3 Michelle 611.0 2014-11-15 IT
6 6 Nina 578.0 2013-05-21 IT

6
> info <- subset(data, salary > 600 & dept == "IT") # IT dept with salary>600
> print(info)
id name salary start_date dept
1 1 Rick 623.3 2012-01-01 IT
3 3 Michelle 611.0 2014-11-15 IT
> retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))
> retval
id name salary start_date dept
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
8 8 Guru 722.50 2014-06-17 Finance
Writing into a CSV File
R can create csv file form existing data frame. The write.csv() function is used to create the csv file. This
file gets created in the working directory.
> write.csv(retval,"output.csv")
> newdata <- read.csv("output.csv")
> print(newdata)
X id name salary start_date dept
1 3 3 Michelle 611.00 2014-11-15 IT
2 4 4 Ryan 729.00 2014-05-11 HR
3 5 5 Gary 843.25 2015-03-27 Finance
4 8 8 Guru 722.50 2014-06-17 Finance

7
3. Get and Clean data using swirl exercises. (Use ‘swirl’ package, library and install that
topic from swirl).

The swirl R package makes it fun and easy to learn R programming and data science.
Step 1: Get R

In order to run swirl, you must have R 3.1.0 or later installed on your computer.
Step 2 (recommended): Get RStudio

In addition to R, it’s highly recommended that you install RStudio, which will make your experience with
R much more enjoyable.
Step 3: Install swirl

Open RStudio (or just plain R if you don't have RStudio) and type the following into the console:

> install.packages("swirl")

Note that the > symbol at the beginning of the line is R's prompt for you type something into the console.
Step 4: Start swirl

This is the only step that you will repeat every time you want to run swirl. First, you will load the package
using the library() function. Then you will call the function that starts the magic! Type the following,
pressing Enter after each line:
> library("swirl")
> swirl()
Step 5: Install an interactive course

Getting and Cleaning Data


Installation
swirl::install_course("Getting and Cleaning Data")
Manual Installation

1. Download getting_and_cleaning_data.swc file.


2. Run swirl::install_course() in the R console.

8
3. Select the file you just downloaded

> swirl::install_course("Getting and Cleaning Data")


|============================================================| 100%

| Course installed successfully!

> library(swirl)
> swirl()
What shall I call you? Prasanna Kumar
... <-- That's your cue to press Enter to continue
Select 1, 2, or 3 and press Enter

1: Continue.
2: Proceed.
3: Let's get going!

Selection: 1
...

| Please choose a course, or type 0 to exit swirl.

1: Getting and Cleaning Data


2: R Programming
3: Take me to the swirl course repository!

Selection: 1

| Please choose a lesson, or type 0 to return to course menu.

1: Manipulating Data with dplyr


2: Grouping and Chaining with dplyr
3: Tidying Data with tidyr
4: Dates and Times with lubridate

Selection: 1
| Attempting to load lesson dependencies...

| This lesson requires the ‘dplyr’ package. Would you like me to install it for you
| now?

1: Yes
2: No

> read.csv(path2csv,stringsAsFactors =FALSE)


| Nice try, but that's not exactly what I was hoping for. Try again. Or, type
| info() for more options.
9
> mydf<-read.csv(path2csv,stringsAsFactors =FALSE)
| dim(mydf) will give you the dimensions of the dataset.

> dim(mydf)
[1] 225468 11

| Great job!

|====== | 8%
| Now use head() to preview the data.

> head(mydf)
X date time size r_version r_arch r_os package version
1 1 2014-07-08 00:54:41 80589 3.1.0 x86_64 mingw32 htmltools 0.2.4
2 2 2014-07-08 00:59:53 321767 3.1.0 x86_64 mingw32 tseries 0.10-32
3 3 2014-07-08 00:47:13 748063 3.1.0 x86_64 linux-gnu party 1.0-15
4 4 2014-07-08 00:48:05 606104 3.1.0 x86_64 linux-gnu Hmisc 3.14-4
5 5 2014-07-08 00:46:50 79825 3.0.2 x86_64 linux-gnu digest 0.6.4
6 6 2014-07-08 00:48:04 77681 3.1.0 x86_64 linux-gnu randomForest 4.6-7
country ip_id
1 US 1
2 US 2
3 US 3
4 US 3
5 CA 4
6 US 3

| You are quite good my friend!

> library(dplyr)

10
| Your dedication is inspiring!

> packageVersion("dplyr")
[1] ‘1.0.7’

...

|=========== | 15%
| The first step of working with data in dplyr is to load the data into what the
| package authors call a 'data frame tbl' or 'tbl_df'. Use the following code to
| create a new tbl_df called cran:
|
| cran <- tbl_df(mydf).

> cran <- tbl_df(mydf)

| Nice work!

|============ | 17%
| To avoid confusion and keep things running smoothly, let's remove the original
| data frame from your workspace with rm("mydf").

> rm("mydf")

| You got it!


|============== | 18%
| From ?tbl_df, "The main advantage to using a tbl_df over a regular data frame is
| the printing." Let's see what is meant by this. Type cran to print our tbl_df to
| the console.

> cran
# A tibble: 225,468 x 11
X date time size r_version r_arch r_os package version country ip_id
<int> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <int>
1 1 2014-0~ 00:54~ 8.06e4 3.1.0 x86_64 ming~ htmltoo~ 0.2.4 US 1
2 2 2014-0~ 00:59~ 3.22e5 3.1.0 x86_64 ming~ tseries 0.10-32 US 2
3 3 2014-0~ 00:47~ 7.48e5 3.1.0 x86_64 linu~ party 1.0-15 US 3
4 4 2014-0~ 00:48~ 6.06e5 3.1.0 x86_64 linu~ Hmisc 3.14-4 US 3
5 5 2014-0~ 00:46~ 7.98e4 3.0.2 x86_64 linu~ digest 0.6.4 CA 4
6 6 2014-0~ 00:48~ 7.77e4 3.1.0 x86_64 linu~ randomF~ 4.6-7 US 3
7 7 2014-0~ 00:48~ 3.94e5 3.1.0 x86_64 linu~ plyr 1.8.1 US 3
8 8 2014-0~ 00:47~ 2.82e4 3.0.2 x86_64 linu~ whisker 0.3-2 US 5
9 9 2014-0~ 00:54~ 5.93e3 NA NA NA Rcpp 0.10.4 CN 6
10 10 2014-0~ 00:15~ 2.21e6 3.0.2 x86_64 linu~ hflights 0.1 US 7
#.....with 225,458 more rows

| Your dedication is inspiring!


11
|=============== | 20%
| This output is much more informative and compact than what we would get if we
| printed the original data frame (mydf) to the console.

...

|================ | 22%
| First, we are shown the class and dimensions of the dataset. Just below that, we
| get a preview of the data. Instead of attempting to print the entire dataset,
| dplyr just shows us the first 10 rows of data and only as many columns as fit
| neatly in our console. At the bottom, we see the names and classes for any
| variables that didn't fit on our screen.

...

|================= | 23%
| According to the "Introduction to dplyr" vignette written by the package authors,
| "The dplyr philosophy is to have small functions that each do one thing well."
| Specifically, dplyr supplies five 'verbs' that cover most fundamental data
| manipulation tasks: select(), filter(), arrange(), mutate(), and summarize().

...

|================== | 25%
| Use ?select to pull up the documentation for the first of these core functions.

> ?select

| Keep working like that and you'll get there!

|==================== | 27%
| Help files for the other functions are accessible in the same way.

...

|===================== | 28%
| As may often be the case, particularly with larger datasets, we are only
| interested in some of the variables. Use select(cran, ip_id, package, country) to
| select only the ip_id, package, and country variables from the cran dataset.

> select(cran, ip_id, package,


country) # A tibble: 225,468 x 3
ip_id package country
<int> <chr> <chr>
1 1 htmltools US
12
2 2 tseries US
3 3 party US
4 3 Hmisc US
5 4 digest CA
6 3 randomForest US
7 3 plyr US
8 5 whisker US
9 6 Rcpp CN
10 7 hflights US
# ... with 225,458 more rows

| That's correct!

|====================== | 30%
| The first thing to notice is that we don't have to type cran$ip_id, cran$package,
| and cran$country, as we normally would when referring to columns of a data frame.
| The select() function knows we are referring to columns of the cran dataset.

...

|======================= | 32%
| Also, note that the columns are returned to us in the order we specified, even
| though ip_id is the rightmost column in the original dataset.

...

|========================= | 33%
| Recall that in R, the `:` operator provides a compact notation for creating a
| sequence of numbers. For example, try 5:20.

13
4) Visualize all Statistical measures (Mean, Mode, Median, Range, Inter Quartile Range etc., using
Histograms, Boxplots and Scatter Plots).

MEAN
The mean of an observation variable is a numerical measure of the central location of the data values. It is
the sum of its data values divided by data count.
Hence, for a data sample of size n, its sample mean is defined as follows:

Problem
Find the mean eruption duration in the data set faithful.
>head(faithful)
eruptions waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
Solution
We apply the mean function to compute the mean value of eruptions.
> duration = faithful$eruptions # the eruption durations
> mean(duration) # apply the mean
function [1] 3.4878
Answer
The mean eruption duration is 3.4878 minutes.

MEDIAN
The median of an observation variable is the value at the middle when the data is sorted in ascending
order. It is an ordinal measure of the central location of the data values.
Problem
Find the median of the eruption duration in the data set faithful.
Solution
We apply the median function to compute the median value of eruptions.
> duration = faithful$eruptions # the eruption durations
> median(duration) # apply the median function
[1] 4
Answer
The median of the eruption duration is 4 minutes.

MODE

It is the value that has the highest frequency in the given data set. The data set may have no mode if the
frequency of all data points is the same. Also, we can have more than one mode if we encounter two or
more data points having the same frequency. There is no inbuilt function for finding mode in R, so we can
create our own function for finding the mode or we can use the package called moodest.
14
>mode = function(){

return(sort(-table(faithful$eruption))[1])
}
>mode()
1.867
-8

QUARTILE
There are several quartiles of an observation variable. The first quartile, or lower quartile, is the value that
cuts off the first 25% of the data when it is sorted in ascending order. The second quartile, or median, is
the value that cuts off the first 50%. The third quartile, or upper quartile, is the value that cuts off the first
75%. Problem
Find the quartiles of the eruption durations in the data set faithful.
Solution
We apply the quantile function to compute the quartiles of eruptions.
> duration = faithful$eruptions # the eruption durations
> quantile(duration) # apply the quantile
function 0% 25% 50% 75% 100%
1.6000 2.1627 4.0000 4.4543 5.1000
Answer
The first, second and third quartiles of the eruption duration are 2.1627, 4.0000 and 4.4543 minutes
respectively.

PERCENTILE
The nth percentile of an observation variable is the value that cuts off the first n percent of the data values
when it is sorted in ascending order.
Problem
Find the 32nd, 57th and 98th percentiles of the eruption durations in the data set faithful.
Solution
We apply the quantile function to compute the percentiles of eruptions with the desired percentage ratios.
> duration = faithful$eruptions # the eruption durations
> quantile(duration, c(.32, .57, .98))
32% 57% 98%
2.3952 4.1330 4.9330
Answer
The 32nd, 57th and 98th percentiles of the eruption duration are 2.3952, 4.1330 and 4.9330 minutes
respectively.

RANGE
The range of an observation variable is the difference of its largest and smallest data values. It is a
measure of how far apart the entire data spreads in value.

Problem
Find the range of the eruption duration in the data set faithful.
Solution

15
We apply the max and min function to compute the largest and smallest values of eruptions, then take the
difference.
> duration = faithful$eruptions # the eruption durations
> max(duration) − min(duration) # apply the max and min functions
[1] 3.5
Answer
The range of the eruption duration is 3.5 minutes.

INTERQUARTILE RANGE
The interquartile range of an observation variable is the difference of its upper and lower quartiles. It is a
measure of how far apart the middle portion of data spreads in value.

Problem
Find the interquartile range of eruption duration in the data set faithful.
Solution
We apply the IQR function to compute the interquartile range of eruptions.
> duration = faithful$eruptions # the eruption durations
> IQR(duration) # apply the IQR
function [1] 2.2915
Answer
The interquartile range of eruption duration is 2.2915 minutes.

BOX PLOT
The box plot of an observation variable is a graphical representation based on its quartiles, as well as its
smallest and largest values. It attempts to provide a visual shape of the data distribution.
Problem
Find the box plot of the eruption duration in the data set faithful.
Solution
We apply the boxplot function to produce the box plot of eruptions.
> duration = faithful$eruptions # the eruption durations
> boxplot(duration, horizontal=TRUE) # horizontal box plot
Answer
The box plot of the eruption duration is:

16
HISTOGRAM
A histogram consists of parallel vertical bars that graphically shows the frequency distribution of a
quantitative variable. The area of each bar is equal to the frequency of items found in each class.
Example
In the data set faithful, the histogram of the eruptions variable is a collection of parallel vertical bars
showing the number of eruptions classified according to their durations.
Problem
Find the histogram of the eruption durations in faithful.
Solution
We apply the hist function to produce the histogram of the eruptions variable.
> duration = faithful$eruptions
> hist(duration, right=FALSE) # apply the hist function , # intervals closed on the left

Answer
The histogram of the eruption durations is:

SCATTER PLOT
A scatter plot pairs up values of two quantitative variables in a data set and display them as geometric
points inside a Cartesian diagram.
Example
In the data set faithful, we pair up the eruptions and waiting values in the same observation as (x,y)
coordinates. Then we plot the points in the Cartesian plane. Here is a preview of the eruption data value
pairs with the help of the cbind function.

> duration = faithful$eruptions # the eruption durations


> waiting = faithful$waiting # the waiting interval
> head(cbind(duration,
waiting)) duration waiting
[1,] 3.600 79
[2,] 1.800 54
[3,] 3.333 74
17
[4,] 2.283 62
[5,] 4.533 85
[6,] 2.883 55
Problem
Find the scatter plot of the eruption durations and waiting intervals in faithful. Does it reveal any
relationship between the variables?
Solution
We apply the plot function to compute the scatter plot of eruptions and waiting.
> duration = faithful$eruptions # the eruption durations
> waiting = faithful$waiting # the waiting interval
> plot(duration, waiting, # plot the variables
+ xlab="Eruption duration", # x−axis label
+ ylab="Time waited") # y−axis label
Answer
The scatter plot of the eruption durations and waiting intervals is as follows. It reveals a positive linear
relationship between them.

Enhanced Solution

We can generate a linear regression model of the two variables with the lm function, and then draw a trend
line with abline.

> abline(lm(waiting ~ duration))

18
19
5) Create a data frame with the following structure.
EMP ID EMP NAME SALARY START DATE
1 Satish 5000 01-11-2013
2 Vani 7500 05-06-2011
3 Ramesh 10000 21-09-1999
4 Praveen 9500 13-09-2005
5 Pallavi 4500 23-10-2000

a. Extract two column names using column name.


b. Extract the first two rows and then all columns.
c. Extract 3rd and 5th row with 2nd and 4th column.

> emp_id<-1:5
> emp_name<-c("Satish","Vani","Ramesh","Praveen","Pallavi")
> Salary<-c(5000,7500,10000,9500,4500
> d1<-as.Date("01-11-2013")
> d2<-as.Date("05-06-2011")
> d3<-as.Date("21-09-1999")
> d4<-as.Date("13-09-2005")
> d5<-as.Date("23-10-2000")
> Start_Date<-c(d1,d2,d3,d4,d5)
> theDF<-data.frame(emp_id,emp_name,Salary,Start_Date)
> theDF
emp_id emp_name Salary Start_Date
1 1 Satish 5000 2013-11-01
2 2 Vani 7500 2011-06-05
3 3 Ramesh 10000 1999-09-21
4 4 Praveen 9500 2005-09-13
5 5 Pallavi 4500 2000-10-23

a. Extract two column names using column name.


> names(theDF)
[1] "emp_id" "emp_name" "Salary" "Start_Date"
> names(theDF)[1:2]
[1] "emp_id" "emp_name"
b. Extract the first two rows and then all columns.
> theDF[1:2,]
emp_id emp_name Salary Start_Date
1 1 Satish 5000 2013-11-01
2 2 Vani 7500 2011-06-05

c. Extract 3rd and 5th row with 2nd and 4th column.
> theDF[c(3,5),c(2,4)]
emp_name Start_Date
3 Ramesh 1999-09-21
5 Pallavi 2000-10-23

20
6) Write R Program using ‘apply’ group of functions to create and apply normalization function on
each of the numeric variables/columns of iris dataset to transform them into
i. 0 to 1 range with min-max normalization.
ii. a value around 0 with z-score normalization.

#view first six rows of iris dataset


head(iris)

# Sepal.Length Sepal.Width Petal.Length Petal.Width Species


#1 5.1 3.5 1.4 0.2 setosa
#2 4.9 3.0 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#5 5.0 3.6 1.4 0.2 setosa
#6 5.4 3.9 1.7 0.4 setosa

i. 0 to 1 range with min-max normalization.

Min-Max Normalization

The formula for a min-max normalization is:

(X – min(X))/(max(X) – min(X))

For each value of a variable, we simply find how far that value is from the minimum value, then divide
by the range.

To implement this in R, we can define a simple function and then use lapply to apply that function to
whichever columns in the iris dataset we would like:

#define Min-Max normalization function


min_max_norm <- function(x) {
(x - min(x)) / (max(x) - min(x))
}

#apply Min-Max normalization to first four columns in iris dataset


iris_norm <- as.data.frame(lapply(iris[1:4], min_max_norm))

#view first six rows of normalized iris dataset


head(iris_norm)

# Sepal.Length Sepal.Width Petal.Length Petal.Width


#1 0.22222222 0.6250000 0.06779661 0.04166667
#2 0.16666667 0.4166667 0.06779661 0.04166667
#3 0.11111111 0.5000000 0.05084746 0.04166667
#4 0.08333333 0.4583333 0.08474576 0.04166667
21
#5 0.19444444 0.6666667 0.06779661 0.04166667
#6 0.30555556 0.7916667 0.11864407 0.12500000

Notice that each of the columns now have values that range from 0 to 1. Also notice that the fifth column
“Species” was dropped from this data frame. We can easily add it back by using the following code:

#add back Species column


iris_norm$Species <- iris$Species

#view first six rows of iris_norm


head(iris_norm)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 0.22222222 0.6250000 0.06779661 0.04166667 setosa
#2 0.16666667 0.4166667 0.06779661 0.04166667 setosa
#3 0.11111111 0.5000000 0.05084746 0.04166667 setosa
#4 0.08333333 0.4583333 0.08474576 0.04166667 setosa
#5 0.19444444 0.6666667 0.06779661 0.04166667 setosa
#6 0.30555556 0.7916667 0.11864407 0.12500000 setosa

ii. a value around 0 with z-score normalization

The formula for a z-score standardization is:

(X – μ) / σ

For each value of a variable, we simply subtract the mean value of the variable, then divide by the
standard deviation of the variable.

To implement this in R, we have a few different options:

1. Standardize one variable

If we simply want to standardize one variable in a dataset, such as Sepal.Width in the iris dataset, we can
use the following code:

#standardize Sepal.Width
iris$Sepal.Width <- (iris$Sepal.Width - mean(iris$Sepal.Width)) / sd(iris$Sepal.Width)

head(iris)

# Sepal.Length Sepal.Width Petal.Length Petal.Width Species


#1 5.1 1.01560199 1.4 0.2 setosa

22
#2 4.9 -0.13153881 1.4 0.2 setosa
#3 4.7 0.32731751 1.3 0.2 setosa
#4 4.6 0.09788935 1.5 0.2 setosa
#5 5.0 1.24503015 1.4 0.2 setosa
#6 5.4 1.93331463 1.7 0.4 setosa

The values of Sepal.Width are now scaled such that the mean is 0 and the standard deviation is 1. We can
even verify this if we’d like:

#find mean of Sepal.Width


mean(iris$Sepal.Width)

#[1] 2.034094e-16 #basically zero

#find standard deviation of Sepal.Width


sd(iris$Sepal.Width)

#[1] 1

2. Standardize several variables using the scale function

To standardize several variables, we can simply use the scale function. For example, the following code
shows how to scale the first four columns of the iris dataset:

#standardize first four columns of iris dataset


iris_standardize <- as.data.frame(scale(iris[1:4]))

#view first six rows of standardized dataset


head(iris_standardize)

# Sepal.Length Sepal.Width Petal.Length Petal.Width


#1 -0.8976739 1.01560199 -1.335752 -1.311052
#2 -1.1392005 -0.13153881 -1.335752 -1.311052
#3 -1.3807271 0.32731751 -1.392399 -1.311052
#4 -1.5014904 0.09788935 -1.279104 -1.311052
#5 -1.0184372 1.24503015 -1.335752 -1.311052
#6 -0.5353840 1.93331463 -1.165809 -1.048667

23
7) Create a data frame with 10 observations and 3 variables and add new rows and columns to it
using ‘rbind’ and ‘cbind’ function.

>df1 = data.frame(name = c("Rahul","joe","Adam","Brendon","Srilakshmi","Prasanna


Kumar","Anitha","Bhanu","Rajesh","Priya"), married_year =
c(2016,2015,2016,2008,2007,2009,2011,2013,2014,2008),
Salary=c(10000,15000,12000,13000,14000,15000,12000,10000,11000,14000))
> df1
name married_year Salary
1 Rahul 2016 10000
2 joe 2015 15000
3 Adam 2016 12000
4 Brendon 2008 13000
5 Srilakshmi 2007 14000
6 Prasanna Kumar 2009 15000
7 Anitha 2011 12000
8 Bhanu 2013 10000
9 Rajesh 2014 11000
10 Priya 2008 14000
>father_name<-c("Gandhi","Jashua","God","Bush","Venkateswarlu","David",
"Anand","Bharath","Rupesh","Prem Sagar")
> father_name
[1] "Gandhi" "Jashua" "God" "Bush" "Venkateswarlu" "David"
[7] "Anand" "Bharath" "Rupesh" "Prem Sagar"
> cb_df1= cbind(df1,father_name) # cbind to add new column
> cb_df1
name married_year Salary father_name
1 Rahul 2016 10000 Gandhi
2 joe 2015 15000 Jashua
3 Adam 2016 12000 God
4 Brendon 2008 13000 Bush
5 Srilakshmi 2007 14000 Venkateswarlu
6 Prasanna Kumar 2009 15000 David
7 Anitha 2011 12000 Anand
8 Bhanu 2013 10000 Bharath
9 Rajesh 2014 11000 Rupesh
10 Priya 2008 14000 Prem Sagar

> rb<-c("Prakash",2011,15000,"Jeevan")
> rb
[1] "Prakash" "2011" "15000" "Jeevan"
> rb_df1<-rbind(cb_df1,rb) # rbind to add new row
> rb_df1
name married_year Salary father_name
1 Rahul 2016 10000 Gandhi
2 joe 2015 15000 Jashua
3 Adam 2016 12000 God
24
4 Brendon 2008 13000 Bush
5 Srilakshmi 2007 14000 Venkateswarlu
6 Prasanna Kumar 2009 15000 David
7 Anitha 2011 12000 Anand
8 Bhanu 2013 10000 Bharath
9 Rajesh 2014 11000 Rupesh
10 Priya 2008 14000 Prem Sagar
11 Prakash 2011 15000 Jeevan

25
8) Write R program to implement linear and multiple regression on ‘mtcars’ dataset to estimate
the value of ‘mpg’ variable, with best R2 and plot the original values in ‘green’ and predicted
values in ‘red’.

Name Description
mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (lb/1000)
qsec 1/4 mile time
vs V/S
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburettors

If we are interested in the relationship between fuel efficiency (mpg) and weight (wt) we may start
plotting those variables with:
>plot(mpg ~ wt, data = mtcars, col=2)

26
The plots shows a (linear) relationship!. Then if we want to perform linear regression to determine the
coefficients of a linear model, we would use the lm function:

fit <- lm(mpg ~ wt, data = mtcars)

The ~ here means "explained by", so the formula mpg ~ wt means we are predicting mpg as explained by
wt. The most helpful way to view the output is with:

>summary(fit)

Which gives the output:

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom


Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

27
bs <- round(coef(fit), 3)
lmlab <- paste0("mpg = ", bs[1],
ifelse(sign(bs[2])==1, " + ", " - "), abs(bs[2]), " wt ")
mtext(lmlab, 3, line=-2)

The result is:

Multiple Regression – Mathematical Formula

In multiple regression there are more than one predictor variable and one response variable, relation of the
variables is shown below:

Y = a + b1x1 + b2x2 + ……… bnxn, Where Y is response, a…..bn are coefficients and x1…..xn are
predictor variables.

Multiple Regression using R Programming

For this tutorial on Multiple Regression Analysis using R Programming, I am going to use mtcars dataset
and we will see How the Model is built for two and three predictor variables.

28
Case Study 1: Establishing Relationship between “mpg” as response variable and “disp”, “hp” as predictor
variables.

Step1: Load the required data

>data <- mtcars[,c("mpg","disp","hp")]


##From this command we are creating new data variable with all rows and only required columns
>head(data)

Step2: Build Model using lm() function

>model <- lm(mpg~disp+hp, data=data)


>summary(model)

Step3: Predicting the output.


>predict(model, newdata = data.frame(disp=140, hp=80))
Predicted Output Mileage is 24.50022
If you enter the values of disp and hp in the equation derived above you will get the same output.
Plotting the Regression:

29
>plot(model)

Output:

Case Study 2: Establishing Relationship between “mpg” as response variable and “disp”, “hp” and “wt”
as predictor variables.

>model1 <- lm(mpg~disp+hp+wt, data=mtcars)


>summary(model1)

Equation will be like:


mpg = 37.105505 + (-0.000937)disp + (-0.031157)hp + (-3.800891)wt

>predict(model1, newdata = data.frame(disp=160, hp=100, wt=2.5))

Predicted Output Mileage is 24.3377

30
9) Implement k-means clustering using R.

K Means is a clustering algorithm that repeatedly assigns a group amongst k groups present to a data point
according to the features of the point. It is a centroid-based clustering method.

Step 1

>data("iris")
>head(iris) #will show top 6 rows only

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


1 5.1 1.01560199 1.4 0.2 setosa
2 4.9 -0.13153881 1.4 0.2 setosa
3 4.7 0.32731751 1.3 0.2 setosa
4 4.6 0.09788935 1.5 0.2 setosa
5 5.0 1.24503015 1.4 0.2 setosa
6 5.4 1.93331463 1.7 0.4 setosa
Step 2

>x=iris[,3:4] #using only petal length and width columns


>head(x)
Petal.Length Petal.Width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
5 1.4 0.2
6 1.7 0.4

Step 3

Then the ‘cluster’ package is called. Clustering in R is done using this inbuilt package which will perform
all the mathematics. Clusplot function creates a 2D graph of the clusters.

>model=kmeans(x,3)

31
>library(cluster)
>clusplot(x,model$cluster)

Step 4

The next step is to assign different colors to the clusters and shading them hence we use the color and
shade parameters setting them to T which means true.

>clusplot(x,model$cluster,color=T,shade=T)

32
10) Implement k-medoids clustering using R.
K-medoids clustering is a technique in which we place each observation in a dataset into one of K
clusters. The end goal is to have K clusters in which the observations within each cluster are quite similar
to each other while the observations in different clusters are quite different from each other.
In practice, we use the following steps to perform K-means clustering:
1. Choose a value for K.
 First, we must decide how many clusters we’d like to identify in the data. Often we have to simply
test several different values for K and analyze the results to see which number of clusters seems to
make the most sense for a given problem.
2. Randomly assign each observation to an initial cluster, from 1 to K.
3. Perform the following procedure until the cluster assignments stop changing.
 For each of the K clusters, compute the cluster centroid. This is the vector of the p feature medians
for the observations in the kth cluster.
 Assign each observation to the cluster whose centroid is closest. Here, closest is defined using
Euclidean distance.
K-Medoids Clustering in R
Step 1: Load the Necessary Packages
First, we’ll load two packages that contain several useful functions for k-medoids clustering in R.
>library(factoextra)
>library(cluster)

Step 2: Load and Prep the Data


For this example we’ll use the USArrests dataset built into R, which contains the number of arrests per
100,000 residents in each U.S. state in 1973 for Murder, Assault, and Rape along with the percentage of
the population in each state living in urban areas, UrbanPop.
The following code shows how to do the following:
 Load the USArrests dataset
 Remove any rows with missing values
 Scale each variable in the dataset to have a mean of 0 and a standard deviation of 1
#load data
df <- USArrests

#remove rows with missing values df <- na.omit(df)

#scale each variable to have a mean of 0 and sd of 1 df <- scale(df)

#view first six rows of dataset head(df)

Murder Assault UrbanPop Rape


Alabama1.24256408 0.7828393 -0.5209066 -0.003416473
Alaska0.50786248 1.1068225 -1.2117642 2.484202941
Arizona0.07163341 1.4788032 0.9989801 1.042878388
Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
California 0.27826823 1.2628144 1.7589234 2.067820292
33
Colorado 0.02571456 0.3988593 0.8608085 1.864967207

Step 3: Find the Optimal Number of Clusters


To perform k-medoids clustering in R we can use the pam() function, which stands for “partitioning
around medians” and uses the following syntax:
pam(data, k, metric = “euclidean”, stand = FALSE)
where:
 data: Name of the dataset.
 k: The number of clusters.
 metric: The metric to use to calculate distance. Default is euclidean but you could also specify
manhattan.
 stand: Whether or not to standardize each variable in the dataset. Default is FALSE.
Since we don’t know beforehand how many clusters is optimal, we’ll create two different plots that can
help us decide:
1. Number of Clusters vs. the Total Within Sum of Squares
First, we’ll use the fviz_nbclust() function to create a plot of the number of clusters vs. the total within
sum of squares:
fviz_nbclust(df, pam, method = "wss")

The total within sum of squares will typically always increase as we increase the number of clusters, so
when we create this type of plot we look for an “elbow” where the sum of squares begins to “bend” or
level off.
The point where the plot bends is typically the optimal number of clusters. Beyond this number,
overfitting is likely to occur.
For this plot it appear that there is a bit of an elbow or “bend” at k = 4 clusters.
2. Number of Clusters vs. Gap Statistic
Another way to determine the optimal number of clusters is to use a metric known as the gap statistic,
which compares the total intra-cluster variation for different values of k with their expected values for a
distribution with no clustering.
We can calculate the gap statistic for each number of clusters using the clusGap() function from the cluster
package along with a plot of clusters vs. gap statistic using the fviz_gap_stat() function:
#calculate gap statistic based on number of clusters
gap_stat <- clusGap(df,
FUN = pam,
34
K.max = 10, #max clusters to consider B = 50) #total bootstrapped iterations

#plot number of clusters vs. gap statistic fviz_gap_stat(gap_stat)

From the plot we can see that gap statistic is highest at k = 4 clusters, which matches the elbow method we
used earlier.
Step 4: Perform K-Medoids Clustering with Optimal K
Lastly, we can perform k-medoids clustering on the dataset using the optimal value for k of 4:
#make this example reproducible set.seed(1)

#perform k-medoids clustering with k = 4 clusters kmed <- pam(df, k = 4)

#view results kmed

ID Murder Assault UrbanPop Rape


Michigan Oklahoma 1 1.2425641 0.7828393 -0.5209066 -0.003416473
22 0.9900104 1.0108275 0.5844655 1.480613993
36 -0.2727580 -0.2371077 0.1699510 -0.131534211
New Hampshire 29 -1.3059321 -1.3650491 -0.6590781 -1.252564419
Clustering vector:
Alabama 1 AlaskaArizona ArkansasCalifornia
Colorado 2 2 1 2
Connecticut Delaware Florida Georgia
2 3 3 2 1
Hawaii Idaho Illinois Indiana Iowa
3 4 2 3 4
Kansas Kentucky Louisiana Maine Maryland
3 3 1 4 2
Massachusetts Michigan Minnesota Mississippi Missouri
3 2 4 1 3
35
Montana Nebraska Nevada New Hampshire New Jersey
3 3 2 4 3
New Mexico New York North Carolina North Dakota Ohio
2 2 1 4 3
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
3 3 3 3 1
South Dakota Tennessee Texas Utah Vermont
4 1 2 3 4
Virginia Washington West Virginia Wisconsin Wyoming
3 3 4 4 3
Objective function: buildswap
1.035116 1.027102

Available components:
[1] "medoids""id.med" "clustering" "objective" "isolation"
[6] "clusinfo" "silinfo" "diss" "call" "data"
Note that the four cluster centroids are actual observations in the dataset. Near the top of the output we can
see that the four centroids are the following states:
 Alabama
 Michigan
 Oklahoma
 New Hampshire

We can visualize the clusters on a scatterplot that displays the first two principal components on the axes
using the fivz_cluster() function:
#plot results of final k-medoids model fviz_cluster(kmed, data = df)

We can also append the cluster assignments of each state back to the original dataset:
#add cluster assignment to original data
final_data <- cbind(USArrests, cluster = kmed$cluster)
36
#view final data
head(final_data)

Murder Assault UrbanPop Rape cluster


Alabama 13.2 236 58 21.2 1
Alaska 10.0 263 48 44.5 2
Arizona 8.1 294 80 31.0 2
Arkansas 8.8 190 50 19.5 1
California 9.0 276 91 40.6 2
Colorado 7.9 204 78 38.7 2

37
11) implement density based clustering on iris dataset.
Iris dataset consists of 50 samples from each of 3 species of Iris(Iris setosa, Iris virginica, Iris versicolor)
and a multivariate dataset introduced by British statistician and biologist Ronald Fisher in his 1936 paper
The use of multiple measurements in taxonomic problems. Four features were measured from each
sample
i.e length and width of the sepals and petals and based on the combination of these four features, Fisher
developed a linear discriminant model to distinguish the species from each other.
# Loading data
>data(iris)
# Structure
>str(iris)

Performing DBScan on Dataset

Using the DBScan Clustering algorithm on the dataset which includes 11 persons and 6 variables or
attributes

# Installing Packages
>install.packages("fpc")
# Loading package
>library(fpc)
# Remove label form dataset
>iris_1 <- iris[-5]
# Fitting DBScan clustering Model
# to training dataset
>set.seed(220) # Setting seed
>Dbscan_cl <- dbscan(iris_1, eps = 0.45, MinPts = 5)
>Dbscan_cl
# Checking cluster
Dbscan_cl$cluster

# Table
>table(Dbscan_cl$cluster, iris$Species)

38
# Plotting Cluster
>plot(Dbscan_cl, iris_1, main = "DBScan")

>plot(Dbscan_cl, iris_1, main = "Petal Width vs Sepal Length")

39
12) implement decision trees using ‘readingSkills’ dataset.

Step 1: Run the required


libraries library(datasets)
library(caTools)
install.packages("party")
library(party)
library(dplyr)
library(magrittr)
Step 2: Load the dataset readingSkills and execute head(readingSkills)
>data("readingSkills")
> head(readingSkills)

Format
A data frame with 200 observations on the following 4 variables.
nativeSpeaker: a factor with levels no and yes, where yes indicates that the child is a native
speaker of the language of the reading test.
age : age of the child in years.
shoeSize: shoe size of the child in cm.
score: raw score on the reading test.

Step 3: Splitting dataset into 4:1 ratio for train and test data
>sample_data = sample.split(readingSkills, SplitRatio = 0.8)
## sample_data<- sample(2,nrow(readingSkills), replace=TRUE, prob=c(0.8,0.2))
>train_data <- subset(readingSkills, sample_data == TRUE)
>test_data <- subset(readingSkills, sample_data == FALSE)

Step 4: Create the decision tree model using ctree and plot the model
>model<- ctree(nativeSpeaker ~ ., train_data)
>plot(model)
The basic syntax for creating a decision tree in R is:
>ctree(formula, data)
where, formula describes the predictor and response variables and data is the data set used. In this case
nativeSpeaker is the response variable and the other predictor variables are represented by ., hence when
we plot the model we get the following output.
Output:

40
Step 5: Making a prediction
# testing the people who are native speakers and those who are not
>predict_model<-predict(ctree_, test_data)
# creates a table to count how many are classified as native speakers and how many are not
>m_at <- table(test_data$nativeSpeaker, predict_model)
>m_at
The model has correctly predicted 13 people to be non-native speakers but classified an additional 13
to be non-native, and the model by analogy has misclassified none of the passengers to be native
speakers when actually they are not.
Step 6: Determining the accuracy of the model developed
>ac_Test <- sum(diag(table_mat)) / sum(table_mat)
>print(paste('Accuracy for test is found to be', ac_Test))
Here the accuracy-test from the confusion matrix is calculated and is found to be 0.74.
Hence this model is found to predict with an accuracy of 74 %

41
13) Implement decision trees using ‘iris’ dataset using package party and ‘rpart’. (Recursive
Partitioning and Regression Trees)
Fit a rpart model
Usage:
rpart(formula, data, weights, subset, na.action = na.rpart, method,
model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...)
Arguments
formula: a formula, with a response but no interaction terms. If this a a data frame, that is taken as the
model frame.
data : an optional data frame in which to interpret the variables named in the
formula. weights: optional case weights.
subset : optional expression saying that only a subset of the rows of the data should be used in the fit.
na.action: the default action deletes all observations for which y is missing, but keeps those in which one
or more predictors are missing.
method: one of "anova", "poisson", "class" or "exp". If method is missing then the routine tries to make
an intelligent guess. If y is a survival object, then method = "exp" is assumed, if y has 2 columns then
method = "poisson" is assumed, if y is a factor then method = "class" is assumed, otherwise method =
"anova" is assumed. It is wisest to specify the method directly, especially as more criteria may added to
the function in future.
Alternatively, method can be a list of functions named init, split and eval. Examples are given in the file
‘tests/usersplits.R’ in the sources, and in the vignettes ‘User Written Split Functions’.
model : if logical: keep a copy of the model frame in the result? If the input value for model is a model
frame (likely from an earlier call to the rpart function), then this frame is used rather than constructing
new data.
x : keep a copy of the x matrix in the result.
y : keep a copy of the dependent variable in the result. If missing and model is supplied this defaults
to FALSE.
parms : optional parameters for the splitting function.
Anova splitting has no parameters.
Poisson splitting has a single parameter, the coefficient of variation of the prior distribution on the rates.
The default value is 1.
Exponential splitting has the same parameter as Poisson.
For classification splitting, the list can contain any of: the vector of prior probabilities (component prior),
the loss matrix (component loss) or the splitting index (component split). The priors must be positive and
sum to 1. The loss matrix must have zeros on the diagonal and positive off-diagonal elements. The
splitting index can be gini or information. The default priors are proportional to the data counts, the losses
default to 1, and the split defaults to gini.
control : a list of options that control details of the rpart algorithm. See rpart.control.
cost : a vector of non-negative costs, one for each variable in the model. Defaults to one for all
variables. These are scalings to be applied when considering splits, so the improvement on splitting on
a variable is divided by its cost in deciding which split to choose.
...
arguments to rpart.control may also be specified in the call to rpart. They are checked against the list of
valid arguments.
Details: This differs from the tree function in S mainly in its handling of surrogate variables. In most
details it follows Breiman et. al (1984) quite closely. R package tree provides a re-implementation of tree.
Value: An object of class rpart.
42
Program:
> library(rpart)
> install.packages('rpart.plot')
> library(rpart.plot)
>data<-iris
>head(data)

> dt3 = rpart(Species ~., control = rpart.control( minsplit = 10, maxdepth = 5),data=iris , method =
"poisson")
> dt3
n= 150

node), split, n, deviance, yval


* denotes terminal node

1) root 150 52.324810000 2.000000


2) Petal.Length< 2.45 50 0.004869366 1.009901 *
3) Petal.Length>=2.45 100 10.068000000 2.497512
6) Petal.Width< 1.75 54 1.935982000 2.091743
12) Petal.Length< 4.95 48 0.422411100 2.020619 *
13) Petal.Length>=4.95 6 0.531330400 2.615385 *
7) Petal.Width>=1.75 46 0.372588600 2.967742 *
> dt3 = rpart(Species ~., control = rpart.control( minsplit = 10, maxdepth = 5),data=iris , method =
"class")

> dt3
n= 150

node), split, n, loss, yval, (yprob)


* denotes terminal node

1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)


2) Petal.Length< 2.45 50 0 setosa (1.00000000 0.00000000 0.00000000) *
3) Petal.Length>=2.45 100 50 versicolor (0.00000000 0.50000000 0.50000000)
6) Petal.Width< 1.75 54 5 versicolor (0.00000000 0.90740741 0.09259259)
12) Petal.Length< 4.95 48 1 versicolor (0.00000000 0.97916667 0.02083333) *
13) Petal.Length>=4.95 6 2 virginica (0.00000000 0.33333333 0.66666667) *
7) Petal.Width>=1.75 46 1 virginica (0.00000000 0.02173913 0.97826087) *
>rpart.plot(dt3)

43
>rpart.plot(dt3,type=4,extra=1)

44
14. Use a Corpus() function to create a data corpus then Build a term Matrix and Reveal word
frequencies
corpus: Text Corpus Analysis
Text corpus data analysis, with full support for international text (Unicode). Functions for reading data
from newline-delimited 'JSON' files, for normalizing and tokenizing text, for searching for term
occurrences, and for computing term occurrence frequencies, including n-grams.
> install.packages("corpus")
package ‘corpus’ successfully unpacked and MD5 sums
checked The downloaded binary packages are in
C:\Users\Prasanna Kumar\AppData\Local\Temp\Rtmps7sf41\downloaded_packages
> library(corpus)
> help(corpus)
The Corpus Package
Text corpus analysis functions Details:
This package contains functions for text corpus analysis. To create a text object, use the read_ndjson or
as_corpus_text function. To split text into sentences or token blocks, use text_split. To specify
preprocessing behavior for transforming a text into a token sequence, use text_filter. To tokenize text or
compute term frequencies, use text_tokens, term_stats or term_matrix. To search for or count specific
terms, use text_locate, text_count, or text_detect.

term_matrix {corpus}
Description: Tokenize a set of texts and compute a term frequency matrix.
Usage:
term_matrix(x, filter = NULL, ngrams = NULL, select = NULL, group = NULL, transpose = FALSE, ...)
term_counts(x, filter = NULL, ngrams = NULL, select = NULL, group = NULL, ...)
Arguments
x : a text vector to tokenize.
filter : if non-NULL, a text filter to to use instead of the default text filter for x.
ngrams: an integer vector of n-gram lengths to include, or NULL to use the select argument to determine
the n-gram lengths.
select :a character vector of terms to count, or NULL to count all terms that appear in x.
group : if non-NULL, a factor, character string, or integer vector the same length of x specifying the
grouping behavior.
transpose: a logical value indicating whether to transpose the result, putting terms as rows instead of
columns.
... : additional properties to set on the text filter.
Details:
term_matrix tokenizes a set of texts and computes the occurrence counts for each term, returning the
result as a sparse matrix (texts-by-terms). term_counts returns the same information, but in a data frame.
If ngrams is non-NULL, then multi-type n-grams are included in the output for all lengths appearing in
the ngrams argument. If ngrams is NULL but select is non-NULL, then all n-grams appearing in the
select set are included. If both ngrams and select are NULL, then only unigrams (single type terms) are
included.
If group is NULL, then the output has one set of term counts for each input text. Otherwise, we convert
group to a factor and compute one set of term counts for each level. Texts with NA values for group get
skipped.
45
Value:

46
term_matrix with transpose = FALSE returns a sparse matrix in "dgCMatrix" format with one column for
each term and one row for each input text or (if group is non-NULL) for each grouping level. If
filter$select is non-NULL, then the column names will be equal to filter$select. Otherwise, the columns
are assigned in arbitrary order.
term_matrix with transpose = TRUE returns the transpose of the term matrix, in "dgCMatrix" format.
term_counts with group = NULL returns a data frame with one row for each entry of the term matrix, and
columns "text", "term", and "count" giving the text ID, term, and count. The "term" column is a factor
with levels equal to the selected terms. The "text" column is a
factor with levels equal to names(as_corpus_text(x)); calling as.integer on the "text"
column converts from the factor values to the integer row index in the term matrix.
term_counts with group non-NULL behaves similarly, but the result instead has columns named "group",
"term", and "count", with "group" giving the grouping level, as a factor.

Examples
text <- c("A rose is a rose is a rose.",
"A Rose is red, a violet is blue!",
"A rose by any other name would smell as sweet.")
term_matrix(text)

# select certain terms


term_matrix(text, select = c("rose", "red", "violet", "sweet"))

# specify a grouping factor


term_matrix(text, group = c("Good", "Bad", "Good"))

# include higher-order n-grams


term_matrix(text, ngrams = 1:3)

# select certain multi-type terms


term_matrix(text, select = c("a rose", "a violet", "sweet", "smell"))

# transpose the result


term_matrix(text, ngrams = 1:2, transpose = TRUE)[1:10, ] # first 10 rows

# data frame
head(term_counts(text), n = 10) # first 10 rows

# with grouping
term_counts(text, group = c("Good", "Bad", "Good"))

# taking names from the input


term_counts(c(a = "One sentence.", b = "Another", c = "!!"))

47
48

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy