A Note On R
A Note On R
A Note On R
All this has been made possible by the years of effort that have gone behind CARET ( Classification
And Regression Training) which is possibly the biggest project in R. This package alone is all you
need to know for solving almost any supervised machine learning problem. It provides a uniform
interface to several machine learning algorithms and standardizes various other tasks such as Data
splitting, Pre-processing, Feature selection, Variable importance estimation, etc.
To get an in-depth overview of various functionalities provided by Caret, you can refer to this article.
Today, well work on the Loan Prediction problem-III to show you the power of Caret package.
P.S. While caret definitely simplifies the job to a degree, it can not take away the hard work and
practice you need to put in to become a master at machine learning.
Table of Contents
1. Getting started
2. Pre-processing using Caret
3. Splitting the data using Caret
4. Feature selection using Caret
5. Training models using Caret
6. Parameter tuning using Caret
7. Variable importance estimation using Caret
8. Making predictions using Caret
1. Getting started
To put in simple words, Caret is essentially a wrapper for 200+ machine learning algorithms.
Additionally, it provides several features which makes it a one stop solution for all the modeling needs
for supervised machine learning problems.
Caret tries not to load all the packages it depends upon at the start. Instead, it loads them only when the
packages are needed. But it does assume that you already have all the algorithms installed on your
system.
To install Caret on your system, use the following command. Heads up: It might take some time:
> install.packages("caret", dependencies = c("Depends", "Suggests"))
Now, lets get started using caret package on Loan Prediction 3 problem:
#Loading caret package
library("caret")
In this problem, we have to predict the Loan Status of a person based on his/ her profile.
2. Pre-processing using Caret
We need to pre-process our data before we can use it for modeling. Lets check if the data has any
missing values:
sum(is.na(train))
#[1] 86
Next, let us use Caret to impute these missing values using KNN algorithm. We will predict these
missing values based on other attributes for that row. Also, well scale and center the numerical data by
using the convenient preprocess() in Caret.
#Imputing missing values using KNN.Also centering and scaling numerical columns
preProcValues <- preProcess(train, method = c("knnImpute","center","scale"))
library('RANN')
train_processed <- predict(preProcValues, train)
sum(is.na(train_processed))
#[1] 0
It is also very easy to use one hot encoding in Caret to create dummy variables for each level of a
categorical variable. But first, well convert the dependent variable to numerical.
#Converting outcome variable to numeric
train_processed$Loan_Status<-ifelse(train_processed$Loan_Status=='N',0,1)
id<-train_processed$Loan_ID
train_processed$Loan_ID<-NULL
Here, fullrank=T will create only (n-1) columns for a categorical column with n different levels. This
works well particularly for the representing categorical predictors like gender, married, etc. where we
only have two levels: Male/Female, Yes/No, etc. because 0 can be used to represent one class while 1
represents the other class in same column.
You can proceed further tune the parameters in all these algorithms using the parameter tuning
techniques.
If the search space for parameters is not defined, Caret will use 3 random values of each tunable
parameter and use the cross-validation results to find the best set of parameters for that algorithm.
Otherwise, there are two more ways to tune parameters:
6.1.Using tuneGrid
To find the parameters of a model that can be tuned, you can use
modelLookup(model='gbm')
#Creating grid
grid <-
expand.grid(n.trees=c(10,20,50,100,500,1000),shrinkage=c(0.01,0.05,0.1,0.5),n.minob
sinnode = c(3,5,10),interaction.depth=c(1,5,10))
#No pre-processing
#Resampling: Cross-Validated (5 fold, repeated 5 times)
#Summary of sample sizes: 368, 370, 369, 369, 368, 369, ...
#Resampling results across tuning parameters:
Thus, for all the parameter combinations that you listed in expand.grid(), a model will be created and
tested using cross-validation. The set of parameters with the best cross-validation performance will be
used to create the final model which you get at the end.
6.2. Using tuneLength
Instead, of specifying the exact values for each parameter for tuning we can simply ask it to use any
number of possible values for each tuning parameter through tuneLength. Lets try an example using
tuneLength=10.
#using tune length
model_gbm<-
train(trainSet[,predictors],trainSet[,outcomeName],method='gbm',trControl=fitContro
l,tuneLength=10)
print(model_gbm)
#No pre-processing
#Resampling: Cross-Validated (5 fold, repeated 5 times)
#Summary of sample sizes: 368, 369, 369, 370, 368, 369, ...
#Resampling results across tuning parameters:
#Variable Importance
varImp(object=model_gbm)
#gbm variable importance
#Overall
#Credit_History 100.000
#LoanAmount 16.633
#ApplicantIncome 7.104
#CoapplicantIncome 6.773
#Loan_Amount_Term 0.000
#Overall
#Credit_History 100.00
#ApplicantIncome 73.46
#LoanAmount 60.59
#CoapplicantIncome 40.43
#Loan_Amount_Term 0.00
#Overall
#ApplicantIncome 100.00
#LoanAmount 82.87
#CoapplicantIncome 56.92
#Credit_History 41.11
#Loan_Amount_Term 0.00
#Overall
#Credit_History 100.000
#CoapplicantIncome 17.218
#Loan_Amount_Term 12.988
#LoanAmount 5.632
#ApplicantIncome 0.000
Caret also provides a confusionMatrix function which will give the confusion matrix along with
various other metrics for your predictions. Here is the performance analysis of our GBM model:
confusionMatrix(predictions,testSet[,outcomeName])
#Confusion Matrix and Statistics
#Reference
#Prediction 0 1
#0 25 3
#1 23 102
#Accuracy : 0.8301
#95% CI : (0.761, 0.8859)
#No Information Rate : 0.6863
#P-Value [Acc > NIR] : 4.049e-05
#Kappa : 0.555
#Mcnemar's Test P-Value : 0.0001944
#Sensitivity : 0.5208
#Specificity : 0.9714
#Pos Pred Value : 0.8929
#Neg Pred Value : 0.8160
#Prevalence : 0.3137
#Detection Rate : 0.1634
#Detection Prevalence : 0.1830
#Balanced Accuracy : 0.7461
#'Positive' Class : 0
Additional Resources
Caret Package Homepage
Caret Package on CRAN
Caret Package Manual (PDF, all the functions)
A Short Introduction to the caret Package (PDF)
Open source project on GitHub (source code)
Here is a webinar by creater of Caret package himself
End Notes
Caret is one of the most powerful and useful packages ever made in R. It alone has the capability to
fulfill all the needs for predictive modeling from preprocessing to interpretation. Additionally, its
syntax is also very easy to use. If you use R, Ill encourage you to use Caret.
Caret is a very comprehensive package and instead of covering all the functionalities that it offers, I
thought itll be a better idea to show an end-to-end implementation of Caret on a real hackathon J
dataset. I have tried to cover as many functions in Caret as I could, but Caret has a lot more to offer.
For going in depth, you might find the resources mentioned above very useful. Several of these
resources have been written by Max Kuhn (the creator of caret package) himself.
Table of Contents
1. Importance of Feature Selection
2. Filter Methods
3. Wrapper Methods
4. Embedded Methods
5. Difference between Filter and Wrapper methods
6. Walkthrough example
Next, well discuss various methodologies and techniques that you can use to subset your feature space
and help your models perform better and efficiently. So, lets get started.
2. Filter Methods
Filter methods are generally used as a preprocessing step. The selection of features is independent of
any machine learning algorithms. Instead, features are selected on the basis of their scores in various
statistical tests for their correlation with the outcome variable. The correlation is a subjective term here.
For basic guidance, you can refer to the following table for defining correlation co-efficients.
Pearsons Correlation: It is used as a measure for quantifying linear dependence between two
continuous variables X and Y. Its value varies from -1 to +1. Pearsons correlation is given as:
LDA: Linear discriminant analysis is used to find a linear combination of features that
characterizes or separates two or more classes (or levels) of a categorical variable.
ANOVA: ANOVA stands for Analysis of variance. It is similar to LDA except for the fact that it
is operated using one or more categorical independent features and one continuous dependent
feature. It provides a statistical test of whether the means of several groups are equal or not.
Chi-Square: It is a is a statistical test applied to the groups of categorical features to evaluate
the likelihood of correlation or association between them using their frequency distribution.
One thing that should be kept in mind is that filter methods do not remove multicollinearity. So, you
must deal with multicollinearity of features as well before training models for your data.
3. Wrapper Methods
In wrapper methods, we try to use a subset of features and train a model using them. Based on the
inferences that we draw from the previous model, we decide to add or remove features from your
subset. The problem is essentially reduced to a search problem. These methods are usually
computationally very expensive.
Some common examples of wrapper methods are forward feature selection, backward feature
elimination, recursive feature elimination, etc.
Forward Selection: Forward selection is an iterative method in which we start with having no
feature in the model. In each iteration, we keep adding the feature which best improves our
model till an addition of a new variable does not improve the performance of the model.
Backward Elimination: In backward elimination, we start with all the features and removes
the least significant feature at each iteration which improves the performance of the model. We
repeat this until no improvement is observed on removal of features.
Recursive Feature elimination: It is a greedy optimization algorithm which aims to find the
best performing feature subset. It repeatedly creates models and keeps aside the best or the
worst performing feature at each iteration. It constructs the next model with the left features
until all the features are exhausted. It then ranks the features based on the order of their
elimination.
One of the best ways for implementing feature selection with wrapper methods is to use Boruta
package that finds the importance of a feature by creating shadow features.
It works in the following steps:
1. Firstly, it adds randomness to the given data set by creating shuffled copies of all features
(which are called shadow features).
2. Then, it trains a random forest classifier on the extended data set and applies a feature
importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of
each feature where higher means more important.
3. At every iteration, it checks whether a real feature has a higher importance than the best of its
shadow features (i.e. whether the feature has a higher Z-score than the maximum Z-score of its
shadow features) and constantly removes features which are deemed highly unimportant.
4. Finally, the algorithm stops either when all features get confirmed or rejected or it reaches a
specified limit of random forest runs.
For more information on the implementation of Boruta package, you can refer to this article :
For the implementation of Boruta in python, refer can refer to this article.
4. Embedded Methods
Embedded methods combine the qualities of filter and wrapper methods. Its implemented by
algorithms that have their own built-in feature selection methods.
Some of the most popular examples of these methods are LASSO and RIDGE regression which have
inbuilt penalization functions to reduce overfitting.
Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of
the magnitude of coefficients.
Ridge regression performs L2 regularization which adds penalty equivalent to square of the
magnitude of coefficients.
For more details and implementation of LASSO and RIDGE regression, you can refer to this article.
Other examples of embedded methods are Regularized trees, Memetic algorithm, Random multinomial
logit.
6. Walkthrough example
Lets use wrapper methods for feature selection and see whether we can improve the accuracy of our
model by using an intelligently selected subset of features instead of using every feature at our
disposal.
Well be using stock prediction data in which well predict whether the stock will go up or down based
on 100 predictors in R. This dataset contains 100 independent variables from X1 to X100 representing
profile of a stock and one outcome variable Y with two levels : 1 for rise in stock price and -1 for drop
in stock price.
To download the dataset, click here.
Lets start with applying random forest for all the features on the dataset first.
library('Metrics')
library('randomForest')
library('ggplot2')
library('ggthemes')
library('dplyr')
#set random seed
set.seed(101)
#loading dataset
data<-read.csv("train.csv",stringsAsFactors= T)
#checking dimensions of data
dim(data)
## [1] 3000 101
#specifying outcome variable as factor
data$Y<-as.factor(data$Y)
data$Time<-NULL
#dividing the dataset into train and test
train<-data[1:2000,]
test<-data[2001:3000,]
#applying Random Forest
model_rf<-randomForest(Y ~ ., data = train)
preds<-predict(model_rf,test[,-101])
table(preds)
##preds
## -1 1
##453 547
#checking accuracy
auc(preds,test$Y)
##[1] 0.4522703
Now, instead of trying a large number of possible subsets through say forward selection or backward
elimination, well keep it simple by using the top 20 features only to build a Random forest. Lets find
out if it can improve the accuracy of our model.
Lets look at the feature importance:
importance(model_rf)
#MeanDecreaseGini
##x1 8.815363
##x2 10.920485
##x3 9.607715
##x4 10.308006
##x5 9.645401
##x6 11.409772
##x7 10.896794
##x8 9.694667
##x9 9.636996
##x10 8.609218
##x87 8.730480
##x88 9.734735
##x89 10.884997
##x90 10.684744
##x91 9.496665
##x92 9.978600
##x93 10.479482
##x94 9.922332
##x95 8.640581
##x96 9.368352
##x97 7.014134
##x98 10.640761
##x99 8.837624
##x100 9.914497
preds<-predict(model_rf,test[,-101])
table(preds)
##preds
##-1 1
##218 782
#checking accuracy
auc(preds,test$Y)
##[1] 0.4767592
So, by just using 20 most important features, we have improved the accuracy from 0.452 to 0.476. This
is just an example of how feature selection makes a difference. Not only we have improved the
accuracy but by using just 20 predictors instead of 100, we have also:
increased the interpretability of the model.
reduced the complexity of the model.
reduced the training time of the model.
End Notes
I believe that his article has given you a good idea of how you can perform feature selection to get the
best out of your models. These are the broad categories that are commonly used for feature selection. I
believe you will be convinced about the potential uplift in your model that you can unlock using feature
selection and added benefits of feature selection.
lambda <- 2
mu <- 4
rho <- lambda/mu # = 2/4
# Theoretical value
mm1.N <- rho/(1-rho)
graph + geom_hline(yintercept=mm1.N)
It is possible also to visualise, for instance, the instantaneous usage of individual elements by playing
with the parameters items and steps.
plot_resource_usage(mm1.env, "resource", items=c("queue", "server"), steps=TRUE) +
xlim(0, 20) + ylim(0, 4)
We may obtain the time spent by each customer in the system and we compare the average with the
theoretical expression.
mm1.arrivals <- get_mon_arrivals(mm1.env)
mm1.t_system <- mm1.arrivals$end_time - mm1.arrivals$start_time
## [1] 0.5
## [1] 0.5012594
It seems that it matches the theoretical value pretty well. But of course we are picky, so lets take a
closer look, just to be sure (and to learn more about simmer, why not). Replication can be done with
standard R tools:
library(parallel)
Et voil! Parallelizing has the shortcoming that we lose the underlying C++ objects when each thread
finishes, but the wrap function does all the magic for us retrieving the monitored data. Lets perform a
simple test:
library(dplyr)
t_system <- get_mon_arrivals(envs) %>%
mutate(t_system = end_time - start_time) %>%
group_by(replication) %>%
summarise(mean = mean(t_system))
t.test(t_system$mean)
##
## One Sample t-test
##
## data: t_system$mean
## t = 344.14, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.4953154 0.5009966
## sample estimates:
## mean of x
## 0.498156
Good news: the simulator works. Finally, an M/M/1 satisfies that the distribution of the time spent in
the system is, in turn, an exponential random variable with average (T).
qqplot(mm1.t_system, rexp(length(mm1.t_system), 1/mm1.T))
abline(0, 1, lty=2, col="red")
M/M/c/k systems
An M/M/c/k system keeps exponential arrivals and service times, but has more than one server in
general and a finite queue, which often is more realistic. For instance, a router may have several
processor to handle packets, and the in/out queues are necessarily finite.
This is the simulation of an M/M/2/3 system (2 server, 1 position in queue). Note that the trajectory is
identical to the M/M/1 case.
lambda <- 2
mu <- 4
## rejection_rate
## 1 0.02065614
Despite this, the time spent in the system still follows an exponential random variable, as in the M/M/1
case, but the average has dropped.
mm23.t_system <- mm23.arrivals$end_time - mm23.arrivals$start_time
# Comparison with M/M/1 times
qqplot(mm1.t_system, mm23.t_system)
abline(0, 1, lty=2, col="red")
How to create Beautiful, Interactive data
visualizations using Plotly in R and Python?
Python R
SHARE
Saurav Kaushik , / 6
Introduction
The greatest value of a picture is when it forces us to notice what we never expected to see.
John Tukey
Data visualization is an art as well as a science. It takes constant practice and efforts to master the art of
data visualization. I always keep exploring how to make my visualizations more interesting and
informative. My main tool for creating these data visualizations had been ggplots. When I started using
ggplots, I was amazed by its power. I felt like I was now an evolved story teller.
Then I realized that it is difficult to make interactive charts using ggplots. So, if you want to show
something in 3 dimension, you can not look at it from various angles. So my exploration started again.
One of the best alternatives, I found after spending hours was to learn D3.js. D3.js is a must know
language if you really wish to excel in data visualization. Here, you can find a great resource to realize
the power of D3.js and master it.
But I realized that D3.js is not as popular in data science community as it should have been, probably
because it requires a different skill set (for ex. HTML, CSS and knowledge of JavaScript).
Today, I am going to tell you something which will change the way you perform data visualizations in
the language / tool of your choice (R, Python, MATLAB, Perl, Julia, Arduino).
Table of Contents
1. What is Plotly?
2. Advantages and Disadvantages of Plotly
3. Steps for using Plotly
4. Setting up Data
5. Basic Visualizations
Bar Charts
Box Plots
Scatter Plots
Time Series Plots
6. Advanced Visualizations
Heat Maps
3D Scatter Plots
3D Surfaces
7. Using plotly with ggplot2
8. Different versions of Plotly
1. What is Plotly?
Plotly is one of the finest data visualization tools available built on top of visualization library D3.js,
HTML and CSS. It is created using Python and the Django framework. One can choose to create
interactive data visualizations online or use the libraries that plotly offers to create these visualizations
in the language/ tool of choice. It is compatible with a number of languages/ tools: R, Python,
MATLAB, Perl, Julia, Arduino.
Disadvantages:
The plots made using plotly community version are always public and can be viewed by
anyone.
For plotly community version, there is an upper limit on the API calls per day.
There are also limited number of color Palettes available in community version which acts as
an upper bound on the coloring options.
In Python:
plotly.plotly( [plotly.graph_objs .type(x ,y ,mode , marker = dict(color ,size ))]
Where:
size= values for same length as x, y and z that represents the size of datapoints or lines in
plot.
x = values for x-axis
y = values for y-axis
type = to specify the plot that you want to create like histogram, surface , box, etc.
mode = format in which you want data to be represented in the plot. Possible values are
markers, lines, points.
color = values of same length as x, y and z that represents the color of datapoints or lines
in plot.
4. Adding the layout fields like plot title axis title/ labels, axis title/ label fonts, etc.
In R:
layout(plot ,title , xaxis = list(title ,titlefont ), yaxis = list(title ,titlefont
))
In Python:
plotly.plotly.iplot( plot, plotly.graph_objs.Layout(title , xaxis = dict( title
,titlefont ), yaxis = dict( title ,titlefont)))
Where
plot = the plotly object to be displayed
title = string containing the title of the plot
xaxis : title = title/ label for x-axis
xaxis : titlefont = font for title/ label of x-axis
yaxis : title = title/ label for y-axis
yaxis : titlefont = font for title/ label of y-axis
5. Plotly also allows you to share the plots with someone else in various formats. For this, one
needs to sign in to a plotly account. For sharing your plots youll need the following credentials:
your username and your unique API key. Sharing the plots can be done as:
In R
Sys.setenv("plotly_username"="XXXX")
Sys.setenv("plotly_api_key"="YYYY")
#To plot the last plot you created, simply use this.
plotly_POST(x = last_plot(), filename = "XYZ")
In Python
#Setting plotly credentials
plotly.tools.set_credentials_file(username=XXXX, api_key='YYYY)
Since R and Python are two of the most popular languages among data scientists, Ill be focusing on
creating interactive visualizations using these two languages.
4. Setting up Data
For performing a wide range of interactive data visualizations, Ill be using some of the publicly
available datasets. You can follow the following code to get the datasets that Ill be using during the
course of this article :
In Python
from sklearn import datasets
import pandas as pd
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris.data)
iris_df.columns = ['Sepal.Length','Sepal.Width','Petal.Length','Petal.Width']
iris_df.columns
iris_df['Species'] = iris.target
iris_df['Species'] = iris_df['Species'].astype('category')
iris_df.dtypes
#Sepal.Length float64
#Sepal.Width float64
#Petal.Length float64
#Petal.Width float64
#Species category
#dtype: object
iris_df['Species'].replace(0,'setosa',inplace=True)
iris_df['Species'].replace(1,'versicolor',inplace=True)
iris_df['Species'].replace(2,'virginica',inplace=True)
In Python
You can get International airline passengers dataset here .
#Loading the data
airline_data = pd.read_csv('international-airline-passengers.csv')
4.3 Volcano Dataset
In R
#Loading the data
data(volcano)
#Checking dimensions
dim(volcano)
## [1] 87 61
In Python
You can get International airline passengers dataset here.
#Loading the data
volcano_data = pd.read_csv('volcano.csv')
5. Basic Visualization
To get a good understanding of when you should use which plot, Ill recommend you to check out this
resource. Feel free to play around and explore these plots more. Here are a few things that you can try
in the coming plots:
hovering your mouse over the plot to view associated attributes
selecting a particular region on the plot using your mouse to zoom
resetting the axis
rotating the 3D images
5.1 Histograms
;
You can view the interactive plot here.
In R
library('plotly')
In Python
import plotly.plotly as py
import plotly.graph_objs as go
data = [go.Histogram(x=iris.data[:,0])]
layout = go.Layout(
title='Iris Dataset - Sepal.Length',
xaxis=dict(title='Sepal.Length'),
yaxis=dict(title='Count')
)
In R
#plotting a histogram with Species variable and storing it in bar_chart
bar_chart<-plot_ly(x=Species,type='histogram')
In Python
data = [go.Bar(x=['setosa','versicolor','virginica'],
y=[iris_df.loc[iris_df['Species']=='setosa'].shape[0],iris_df.loc[iris_df['Species'
]=='versicolor'].shape[0],iris_df.loc[iris_df['Species']=='virginica'].shape[0]]
)]
In R
#plotting a Boxplot with Sepal.Length variable and storing it in box_plot
box_plot<-plot_ly(y=Sepal.Length,type='box',color=Species)
In Python
data =
[go.Box(y=iris_df.loc[iris_df["Species"]=='setosa','Sepal.Length'],name='Setosa'),
go.Box(y=iris_df.loc[iris_df["Species"]=='versicolor','Sepal.Length'],name='Versico
lor'),
go.Box(y=iris_df.loc[iris_df["Species"]=='virginica','Sepal.Length'],name='Virginic
a')]
In R
#plotting a Scatter Plot with Sepal.Length and Sepal.Width variables and storing it
in scatter_plot1
scatter_plot1<-plot_ly(x=Sepal.Length,y=Sepal.Width,type='scatter',mode='markers')
In Python
data = [go.Scatter(x = iris_df["Sepal.Length"],y = iris_df["Sepal.Width"],mode =
'markers')]
py.iplot(fig)
1. Lets go a step further and add another dimension (Species) using color.
You can view the interactive plot here.
In R
#plotting a Scatter Plot with Sepal.Length and Sepal.Width variables with color
representing the Species and storing it in scatter_plot12
scatter_plot2<-
plot_ly(x=Sepal.Length,y=Sepal.Width,type='scatter',mode='markers',color = Species)
In Python
data = [go.Scatter(x = iris_df["Sepal.Length"],y = iris_df["Sepal.Width"],mode =
'markers', marker=dict(color = iris_df["Species"]))]
2. We can add another dimension (Petal Length) to the plot by using the size of each data point in
the plot.
You can view the interactive plot here.
#plotting a Scatter Plot with Sepal.Length and Sepal.Width variables with color
represneting the Species and size representing the Petal.Length. Then, storing it
in scatter_plot3
scatter_plot3<-
plot_ly(x=Sepal.Length,y=Sepal.Width,type='scatter',mode='markers',color =
Species,size=Petal.Length)
In Python
data = [go.Scatter(x = iris_df["Sepal.Length"],y = iris_df["Sepal.Width"],mode =
'markers', marker=dict(color = iris_df["Species"],size=iris_df["Petal.Length"]))]
In R
#plotting a Boxplot with Sepal.Length variable and storing it in box_plot
time_seies<-
plot_ly(x=time(AirPassengers),y=AirPassengers,type="scatter",mode="lines")
In Python
data = [go.Scatter(x=airline_data.ix[:,0],y=airline_data.ix[:,1])]
layout = go.Layout(
title='AirPassengers Dataset - Time Series Plot',
xaxis=dict(title='Time'),
yaxis=dict(title='Passengers'))
6. Advanced Visualization
Till now, we have got a grasp of how plotly can be beneficial for basic visualizations. Now lets shift
gears and see plotly in action for advanced visualizations.
6.1 Heat Maps
In R
plot_ly(z=~volcano,type="heatmap")
In Python
data = [go.Heatmap(z=volcano_data.as_matrix())]
fig = go.Figure(data=data)
py.iplot(fig)
In Python
data = [go.Scatter3d(x = iris_df["Sepal.Length"],y = iris_df["Sepal.Width"],z =
iris_df["Petal.Length"],mode = 'markers', marker=dict(color =
iris_df["Species"],size=iris_df["Petal.Width"]))]
fig = go.Figure(data=data)
py.iplot(fig)
6.3 3D Surfaces
In R
#Plotting the volcano 3D surface
plot_ly(z=~volcano,type="surface")
In Python
data = [go.Surface(z=volcano_data.as_matrix())]
fig = go.Figure(data=data)
py.iplot(fig)
7. Using plotly with ggplot2
ggplot2 is one of the best visualization libraries out there. The best part about plotly is that it can add
interactivity to ggplots and also ggplotly() which will further enhance the plots. For learning more
about ggplot, you can check out this resource.
Lets better understand it with an example in R.
#Loading required libraries
library('ggplot2')
library('ggmap')
End Notes
After going through this article, you would have got a good grasp of how to create interactive plotly
visualizations in R as well as Python. I personally use plotly a lot and find it really useful. Combining
plotly with ggplots by using ggplotly() can give you the best visualizations in R or Python. But keep in
mind that plotly is not limited to R and Python only, there a lot of other languages/ tools that it supports
as well.
I believe this article has inspired you to use plotly for data visualization tasks. Did you
data(iris)
str(iris)
# OUTPUT:
# 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1
1 1 1 1 1 1 1 ...
One data manipulation task that you need to do in pretty much any data analysis is recode data. Its
almost never the case that the data are set up exactly the way you need them for your analysis.
In R, you can re-code an entire vector or array at once. To illustrate, lets set up a vector that has
missing values.
A <- c(3, 2, NA, 5, 3, 7, NA, NA, 5, 2, 6)
A
[1] 3 2 NA 5 3 7 NA NA 5 2 6
We can re-code all missing values by another number (such as zero) as follows:
A[ is.na(A) ] <- 0
A
[1] 3 2 0 5 3 7 0 0 5 2 6
However, some re-coding tasks are more complex, particularly when you wish to re-code a categorical
variable or factor. In such cases, you might want to re-code an array with character elements to numeric
elements.
gender <- c("MALE","FEMALE","FEMALE","UNKNOWN","MALE")
gender
[1] "MALE" "FEMALE" "FEMALE" "UNKNOWN" "MALE"
Lets re-code males as 1 and females as 2. Very useful is the following re-coding syntax because it
works in many practical situations. It involves repeated (nested) use of the ifelse() command.
ifelse(gender == "MALE", 1, ifelse(gender == "FEMALE", 2, 3))
[1] 1 2 2 3 1
The element with unknown gender was re-coded as 3. Make a note of this syntax. Its great for re-
coding within R programs.
Another example, this time using a rectangular array.
A <- data.frame(Gender = c("F", "F", "M", "F", "B", "M", "M"), Height
= c(154, 167, 178, 145, 169, 183, 176))
A
Gender Height
1 F 154
2 F 167
3 M 178
4 F 145
5 B 169
6 M 183
7 M 176
We have deliberately introduced an error where gender is misclassified as B. This one gets re-coded to
the value 99. Note that the Gender variable is located in the first column, or A[ ,1].
A[,1] <- ifelse(A[,1] == "M", 1, ifelse(A[,1] == "F", 2, 99))
A
Gender Height
1 2 154
2 2 167
3 1 178
4 2 145
5 99 169
6 1 183
7 1 176
You can use the same approach to code as many different levels as you need to. Lets re-code for four
different levels.
My last example is drawn from the films of the Lord of the Rings and the Hobbit.
The sets where Peter Jackson produced these films are just a short walk from where I live, so the
example is relevant for me.
S <- data.frame(SPECIES = c("ORC", "HOBBIT", "ELF", "TROLL", "ORC",
"ORC", "ELF", "HOBBIT"), HEIGHT
= c(194, 127, 178, 195, 149, 183, 176, 134))
S
SPECIES HEIGHT
1 ORC 194
2 HOBBIT 127
3 ELF 178
4 TROLL 195
5 ORC 149
6 ORC 183
7 ELF 176
8 HOBBIT 134
We now use nested ifelse commands to re-code Orcs as 1, Elves as 2, Hobbits as 3, and Trolls as 4.
S[,1] <- ifelse(S[,1] == "ORC", 1, ifelse(S[,1] == "ELF", 2,
ifelse(S[,1] == "HOBBIT", 3, ifelse(S[,1] == "TROLL", 4, 99))))
S
SPECIES HEIGHT
1 1 194
2 3 127
3 2 178
4 4 195
5 1 149
6 1 183
7 2 176
8 3 134
As the name implies, the dummyVars function allows you to create dummy variables - in other words
it translates text data into numerical data for modeling purposes.
If you are planning on doing predictive analytics or machine learning and want to use regression or any
other modeling technique that requires numerical data, you will need to transform your text data into
numbers otherwise you run the risk of leaving a lot of information on the table
In R, there are plenty of ways of translating text into numerical data. You can do it manually, use a base
function, such as matrix, or a packaged function like dummyVar from the caret package. One of the
big advantages of going with the caret package is that its full of features, including hundreds of
algorithms and pre-processing functions. Once your data fits into carets modular design, it can be run
through different models with minimal tweaking.
Lets look at a few examples of dummy variables. If you have a survey question with 5 categorical
values such as very unhappy, unhappy, neutral, happy and very happy.
survey <- data.frame(service=c('very unhappy','unhappy','neutral','happy','very
happy'))
print(survey)
## service
## 1 very unhappy
## 2 unhappy
## 3 neutral
## 4 happy
## 5 very happy
You can easily translate this into a sequence of numbers from 1 to 5. Where 3 means neutral and, in the
example of a linear model that thinks in fractions, 2.5 means somewhat unhappy, and 4.88 means very
happy. So here we successfully transformed this survey question into a continuous numerical scale and
do not need to add dummy variables - a simple rank column will do.
survey <- data.frame(service=c('very unhappy','unhappy','neutral','happy','very
happy'), rank=c(1,2,3,4,5))
print(survey)
## service rank
## 1 very unhappy 1
## 2 unhappy 2
## 3 neutral 3
## 4 happy 4
## 5 very happy 5
So, the above could easily be used in a model that needs numbers and still represent that data
accurately using the rank variable instead of service. But this only works in specific situations
where you have somewhat linear and continuous-like data. What happens with categorical values such
as marital status, gender, alive?
Does it make sense to be a quarter female? Or half single? Even numerical data of a categorical nature
may require transformation. Take the zip code system. Does the half-way point between two zip codes
make geographical sense? Because that is how a regression model would use it.
It may work in a fuzzy-logic way but it wont help in predicting much; therefore we need a more
precise way of translating these values into numbers so that they can be regressed by the model.
library(caret)
# check the help file for more details
?dummyVars
The dummyVars function breaks out unique values from a column into individual columns - if you
have 1000 unique values in a column, dummying them will add 1000 new columns to your data set (be
careful). Lets create a more complex data frame:
customers <- data.frame(
id=c(10,20,30,40,50),
gender=c('male','female','female','male','female'),
mood=c('happy','sad','happy','sad','happy'),
outcome=c(1,1,0,0,0))
And ask the dummyVars function to dummify it. The function takes a standard R formula: something
~ (broken down) by something else or groups of other things. So we simply use ~ . and the
dummyVars will transform all characters and factors columns (the function never transforms numeric
columns) and return the entire data set:
# dummify the data
dmy <- dummyVars(" ~ .", data = customers)
trsf <- data.frame(predict(dmy, newdata = customers))
print(trsf)
If you just want one column transform you need to include that column in the formula and it will return
a data frame based on that variable only:
customers <- data.frame(
id=c(10,20,30,40,50),
gender=c('male','female','female','male','female'),
mood=c('happy','sad','happy','sad','happy'),
outcome=c(1,1,0,0,0))
## gender.female gender.male
## 1 0 1
## 2 1 0
## 3 1 0
## 4 0 1
## 5 1 0
The fullRank parameter is worth mentioning here. The general rule for creating dummy variables is
to have one less variable than the number of categories present to avoid perfect collinearity (dummy
variable trap). You basically want to avoid highly correlated variables but it also save space. If you
have a factor column comprised of two levels male and female, then you dont need to transform it
into two columns, instead, you pick one of the variables and you are either female, if its a 1, or male if
its a 0.
Lets turn on fullRank and try our data frame again:
customers <- data.frame(
id=c(10,20,30,40,50),
gender=c('male','female','female','male','female'),
mood=c('happy','sad','happy','sad','happy'),
outcome=c(1,1,0,0,0))
As you can see, it picked male and sad, if you are 0 in both columns, then you are female and
happy.
library(caret)
Introduction
Data and information on the web is growing exponentially. All of us today use Google as our first
source of knowledge be it about finding reviews about a place to understanding a new term. All this
information is available on the web already.
With the amount of data available over the web, it opens new horizons of possibility for a Data
Scientist. I strongly believe web scrapping is a must have skill for any data scientist. In todays world,
all the data that you need is already available on the internet, the only thing limiting you from using it
is the ability to access it. With the help of this article, you will be able to overcome that barrier as well.
Most of the data available over the web is not readily available. It is present in an unstructured format
(HTML format) and is not downloadable. Therefore, it requires knowledge & expertise to use this data.
In this article, I am going to take you through the process of web scrapping in R. With this article, you
will gain expertise to use any type of data available over the internet.
Table of Content
1. What is Web Scraping?
2. Why do we need Web Scraping?
3. Ways to scrap data
4. Pre-requisites
5. Scraping a web page using R
6. Analyzing scraped data from the web
4. Pre-requisites
The prerequisites for performing web scraping in R are divided into two buckets:
To get started with web scraping, you must have a working knowledge of R language. If you are
just starting or want to brush up the basics, Ill highly recommend following this learning path
in R. During the course of this article, well be using the rvest package in R authored by
Hadley Wickham. You can access the documentation for rvest package here. Make sure you
have this package installed. If you dont have this package by now, you can follow the
following code to install it.
install.packages('rvest')
Adding, knowledge of HTML and CSS will be an added advantage. One of the best sources I
could find for learning HTML and CSS is this. I have observed that most of the Data Scientists
are not very sound with technical knowledge of HTML and CSS. Therefore, well be using an
open source software named Selector Gadget which will be more than sufficient for anyone in
order to perform Web scrapping. You can access and download the Selector Gadget extension
here. Make sure that you have this extension installed by following the instructions from the
website. I have done the same. Im using Google chrome and I can access the extension in the
extension bar to the top right.
Using this you can select the parts of any website and get the relevant tags to get access to that part by
simply clicking on that part of the website. Note that, this is a way around to actually learning HTML
& CSS and doing it manually. But to master the art of Web scrapping, Ill highly recommend you to
learn HTML & CSS in order to better understand and appreciate whats happening under the hood.
Step 2: Once you are sure that you have made the right selections, you need to copy the corresponding
CSS selector that you can view in the bottom center.
Step 3: Once you know the CSS selector that contains the rankings, you can use this simple R code to
get all the rankings:
#Using CSS selectors to scrap the rankings section
rank_data_html <- html_nodes(webpage,'.text-primary')
Step 4: Once you have the data, make sure that it looks in the desired format. I am preprocessing my
data to convert it to numerical format.
#Data-Preprocessing: Converting rankings to numerical
rank_data<-as.numeric(rank_data)
[1] 1 2 3 4 5 6
Step 5: Now you can clear the selector section and select all the titles. You can visually inspect that all
the titles are selected. Make any required additions and deletions with the help of your curser. I have
done the same here.
Step 6: Again, I have the corresponding CSS selector for the titles .lister-item-header a. I will use this
selector to scrap all the titles using the following code.
#Using CSS selectors to scrap the title section
title_data_html <- html_nodes(webpage,'.lister-item-header a')
Step 7: In the following code, I have done the same thing for scrapping Description, Runtime, Genre,
Rating, Metascore, Votes, Gross_Earning_in_Mil , Director and Actor data.
#Using CSS selectors to scrap the description section
description_data_html <- html_nodes(webpage,'.ratings-bar+ .text-muted')
[2] "\nIn Ancient Polynesia, when a terrible curse incurred by the Demigod Maui
reaches an impetuous Chieftain's daughter's island, she answers the Ocean's call to
seek out the Demigod to set things right."
[4] "\nWWII American Army Medic Desmond T. Doss, who served during the Battle of
Okinawa, refuses to kill people, and becomes the first man in American history to
receive the Medal of Honor without firing a shot."
[5] "\nA spacecraft traveling to a distant colony planet and transporting thousands
of people has a malfunction in its sleep chambers. As a result, two passengers are
awakened 90 years early."
[6] "\nAfter the Bergens invade Troll Village, Poppy, the happiest Troll ever born,
and the curmudgeonly Branch set off on a journey to rescue her friends.
[1] "In a city of humanoid animals, a hustling theater impresario's attempt to save
his theater with a singing competition becomes grander than he anticipates even as
its finalists' find that their lives will never be the same."
[2] "In Ancient Polynesia, when a terrible curse incurred by the Demigod Maui
reaches an impetuous Chieftain's daughter's island, she answers the Ocean's call to
seek out the Demigod to set things right."
[3] "A chronicle of the childhood, adolescence and burgeoning adulthood of a young,
African-American, gay man growing up in a rough neighborhood of Miami."
[4] "WWII American Army Medic Desmond T. Doss, who served during the Battle of
Okinawa, refuses to kill people, and becomes the first man in American history to
receive the Medal of Honor without firing a shot."
[5] "A spacecraft traveling to a distant colony planet and transporting thousands
of people has a malfunction in its sleep chambers. As a result, two passengers are
awakened 90 years early."
[6] "After the Bergens invade Troll Village, Poppy, the happiest Troll ever born,
and the curmudgeonly Branch set off on a journey to rescue her friends."
[1] "108 min" "107 min" "111 min" "139 min" "116 min" "92 min"
runtime_data<-gsub(" min","",runtime_data)
runtime_data<-as.numeric(runtime_data)
[1] 1 2 3 4 5 6
#Data-Preprocessing: removing \n
genre_data<-gsub("\n","",genre_data)
10 Levels: Action Adventure Animation Biography Comedy Crime Drama ... Thriller
But, I want you to closely follow what happens when I do the same thing for Metascore data.
#Using CSS selectors to scrap the metascore section
metascore_data_html <- html_nodes(webpage,'.metascore')
[1] "59 " "81 " "99 " "71 " "41 "
[1] 96
Step 8: The length of meta score data is 96 while we are scrapping the data for 100 movies. The reason
this happened is because there are 4 movies which dont have the corresponding Metascore fields.
Step 9: It is a practical situation which can arise while scrapping any website. Unfortunately, if we
simply add NAs to last 4 entries, it will map NA as Metascore for movies 96 to 100 while in reality, the
data is missing for some other movies. After a visual inspection, I found that the Metascore is missing
for movies 39, 73, 80 and 89. I have written the following function to get around this problem.
for (i in c(39,73,80,89)){
a<-metascore_data[1:(i-1)]
b<-metascore_data[i:length(metascore_data)]
metascore_data<-append(a,list("NA"))
metascore_data<-append(metascore_data,b)
length(metascore_data)
[1] 100
Step 10: The same thing happens with the Gross variable which represents gross earnings of that movie
in millions. I have use the same solution to work my way around:
#Using CSS selectors to scrap the gross revenue section
gross_data_html <- html_nodes(webpage,'.ghost~ .text-muted+ span')
gross_data<-substring(gross_data,2,6)
[1] 86
a<-gross_data[1:(i-1)]
b<-gross_data[i:length(gross_data)]
gross_data<-append(a,list("NA"))
gross_data<-append(gross_data,b)
}
[1] 100
summary(gross_data)
Step 11: Now we have successfully scrapped all the 11 features for the 100 most popular feature films
released in 2016. Lets combine them to create a dataframe and inspect its structure.
#Combining all the lists to form a data frame
movies_df<-data.frame(Rank = rank_data, Title = title_data,
str(movies_df)
$ Runtime : num 108 107 111 139 116 92 115 128 111 116 ...
$ Rating : num 7.2 7.7 7.6 8.2 7 6.5 6.1 8.4 6.3 8 ...
You have now successfully scrapped the IMDb website for the 100 most popular feature films released
in 2016.
ggplot(movies_df,aes(x=Runtime,y=Rating))+
geom_point(aes(size=Votes,col=Genre))
Question 2: Based on the above data, in the Runtime of 130-160 mins, which genre has the highest
votes?
ggplot(movies_df,aes(x=Runtime,y=Gross_Earning_in_Mil))+
geom_point(aes(size=Rating,col=Genre))
Question 3: Based on the above data, across all genres which genre has the highest average gross
earnings in runtime 100 to 120.
End Notes
I believe this article would have given you a complete understanding of the web scrapping in R. Now,
you also have a fair idea of the problems which you might come across and how you can make your
way around them. As most of the data on the web is present in an unstructured format, web scrapping is
a really handy skill for any data scientist.
Comprehensive Guide on t-SNE algorithm with
implementation in R & Python
Introduction
Imagine you get a dataset with hundreds of features (variables) and have little understanding about the
domain the data belongs to. You are expected to identify hidden patterns in the data, explore and
analyze the dataset. And not just that, you have to find out if there is a pattern in the data is it signal
or is it just noise?
Does that thought make you uncomfortable? It made my hands sweat when I came across this situation
for the first time. Do you wonder how to explore a multidimensional dataset? It is one of the frequently
asked question by many data scientists. In this article, I will take you through a very powerful way to
exactly do this.
Table of Content
1. What is t-SNE?
2. What is dimensionality reduction?
3. How does t-SNE fit in the dimensionality reduction algorithm space
4. Algorithmic details of t-SNE
Algorithm
Time and Space Complexity
5. What does t-SNE actually do?
6. Use cases
7. t-SNE compared to other dimensionality reduction algorithm
8. Example Implementations
In R
Hyper parameter tuning
Code
Implementation Time
Interpreting Results
In Python
Hyper parameter tuning
Code
Implementation Time
9. Where and when to use
Data Scientist
Machine Learning Competition Enthusiast
Student
10.Common fallacies
1. What is t-SNE?
(t-SNE) t-Distributed Stochastic Neighbor Embedding is a non-linear dimensionality reduction
algorithm used for exploring high-dimensional data. It maps multi-dimensional data to two or more
dimensions suitable for human observation. With help of the t-SNE algorithms, you may have to plot
fewer exploratory data analysis plots next time you work with high dimensional data.
2. What is dimensionality reduction?
In order to understand how t-SNE works, lets first understand what is dimensionality reduction?
Well, in simple terms, dimensionality reduction is the technique of representing multi-dimensional data
(data with multiple features having a correlation with each other) in 2 or 3 dimensions.
Some of you might question why do we need Dimensionality Reduction when we can plot the data
using scatter plots, histograms & boxplots and make sense of the pattern in data using descriptive
statistics.
Well, even if you can understand the patterns in data and present it on simple charts, it is still difficult
for anyone without statistics background to make sense of it. Also, if you have hundreds of features,
you have to study thousands of charts before you can make sense of this data. (Read more about
dimensionality reduction here)
With the help of dimensionality reduction algorithm, you will be able to present the data explicitly.
4.1 Algorithm
Step 1
Stochastic Neighbor Embedding (SNE) starts by converting the high-dimensional Euclidean distances
between data points into conditional probabilities that represent similarities. The similarity of datapoint
neighbors were picked in proportion to their probability density under a Gaussian centered at .
For nearby datapoints, is relatively high, whereas for widely separated datapoints, will be
almost infinitesimal (for reasonable values of the variance of the Gaussian, ). Mathematically, the
the similarity between points is: the conditional probability that would pick as its neighbor if
neighbors were picked in proportion to their probability density under a Gaussian (normal distribution)
centered at .
Step 2
Note that, pi|i and pj|j are set to zero as we only want to model pair wise similarity.
In simple terms step 1 and step2 calculate the conditional probability of similarity between a pair of
points in
1. High dimensional space
2. In low dimensional space
For the sake of simplicity, try to understand this in detail.
Let us map 3D space to 2D space. What step1 and step2 are doing is calculating the probability of
similarity of points in 3D space and calculating the probability of similarity of points in the
corresponding 2D space.
Logically, the conditional probabilities and must be equal for a perfect representation of
the similarity of the datapoints in the different dimensional spaces, i.e the difference between
and must be zero for the perfect replication of the plot in high and low dimensions.
By this logic SNE attempts to minimize this difference of conditional probability.
Step 3
Now here is the difference between the SNE and t-SNE algorithms.
To measure the minimization of sum of difference of conditional probability SNE minimizes the sum of
Kullback-Leibler divergences overall data points using a gradient descent method. We must know that
KL divergences are asymmetric in nature.
In other words, the SNE cost function focuses on retaining the local structure of the data in the map (for
Step 4
If we see the equation to calculate the conditional probability, we have left out the variance from the
discussion as of now. The remaining parameter to be selected is the variance of the students t-
distribution that is centered over each high-dimensional datapoint . It is not likely that there is a
single value of that is optimal for all data points in the data set because the density of the data is
likely to vary. In dense regions, a smaller value of is usually more appropriate than in sparser
regions. Any particular value of induces a probability distribution, , over all of the other data
points. This distribution has an
This distribution has an entropy which increases as increases. t-SNE performs a binary search for
the value of that produces a with a fixed perplexity that is specified by the user. The
perplexity is defined as
where H( ) is the Shannon entropy of measured in bits
The perplexity can be interpreted as a smooth measure of the effective number of neighbors. The
performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and
50.
The minimization of the cost function is performed using gradient decent. And physically, the gradient
may be interpreted as the resultant force created by a set of springs between the map point and all
other map points . All springs exert a force along the direction ( ). The spring between
and repels or attracts the map points depending on whether the distance between the two in the map
is too small or too large to represent the similarities between the two high-dimensional datapoints. The
force exerted by the spring between and is proportional to its length, and also proportional to its
stiffness, which is the mismatch (pj|i qj|i + p i| j q i| j ) between the pairwise similarities of the data
points and the map points[1].-
6. Use cases
You may ask, what are the use cases of such an algorithm. t-SNE can be used on almost all high
dimensional data sets. But it is extensively applied in Image processing, NLP, genomic data and speech
processing. It has been utilized for improving the analysis of brain and heart scans. Below are a few
examples:
1. In R
The Rtsne package has an implementation of t-SNE in R. The Rtsne package can be installed in R
using the following command typed in the R console:
install.packages(Rtsne)
Code
MNIST data can be downloaded from the MNIST website and can be converted into a csv file
with small amount of code.For this example, please download the following preprocessed
MNIST data. link
## calling the installed package
train<- read.csv(file.choose()) ## Choose the train.csv file downloaded from the
link above
library(Rtsne)
## Curating the database for analysis with both t-SNE and PCA
Labels<-train$label
train$label<-as.factor(train$label)
## for plotting
colors = rainbow(length(unique(train$label)))
names(colors) = unique(train$label)
## Plotting
plot(tsne$Y, t='n', main="tsne")
text(tsne$Y, labels=train$label, col=colors[train$label])
Implementation Time
exeTimeTsne
user system elapsed
118.037 0.000 118.006
exectutiontimePCA
user system elapsed
11.259 0.012 11.360
As can be seen t-SNE takes considerably longer time to execute on the same sample size of data than
PCA.
Interpreting Results
The plots can be used for exploratory analysis. The output x & y co-ordinates and as well as cost can be
used as features in classification algorithms.
2. In Python
An important thing to note is that the pip install tsne produces an error. Installing tsne package is
not recommended. t-SNE algorithm can be accessed from sklearn package.
Hyper parameter tuning
Code
The following code is taken from the sklearn examples on the sklearn website.
## importing the required packages
from time import time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from sklearn import (manifold, datasets, decomposition, ensemble,
discriminant_analysis, random_projection)
## Loading and curating the data
digits = datasets.load_digits(n_class=10)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
n_neighbors = 30
## Function to Scale and visualize the embedding vectors
def plot_embedding(X, title=None):
x_min, x_max = np.min(X, 0), np.max(X, 0)
X = (X - x_min) / (x_max - x_min)
plt.figure()
ax = plt.subplot(111)
for i in range(X.shape[0]):
plt.text(X[i, 0], X[i, 1], str(digits.target[i]),
color=plt.cm.Set1(y[i] / 10.),
fontdict={'weight': 'bold', 'size': 9})
if hasattr(offsetbox, 'AnnotationBbox'):
## only print thumbnails with matplotlib > 1.0
shown_images = np.array([[1., 1.]]) # just something big
for i in range(digits.data.shape[0]):
dist = np.sum((X[i] - shown_images) ** 2, 1)
if np.min(dist) < 4e-3:
## don't show points that are too close
continue
shown_images = np.r_[shown_images, [X[i]]]
imagebox = offsetbox.AnnotationBbox(
offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),
X[i])
ax.add_artist(imagebox)
plt.xticks([]), plt.yticks([])
if title is not None:
plt.title(title)
#----------------------------------------------------------------------
## Plot images of the digits
n_img_per_row = 20
img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
for i in range(n_img_per_row):
ix = 10 * i + 1
for j in range(n_img_per_row):
iy = 10 * j + 1
img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))
plt.imshow(img, cmap=plt.cm.binary)
plt.xticks([])
plt.yticks([])
plt.title('A selection from the 64-dimensional digits dataset')
## Computing PCA
print("Computing PCA projection")
t0 = time()
X_pca = decomposition.TruncatedSVD(n_components=2).fit_transform(X)
plot_embedding(X_pca,
"Principal Components projection of the digits (time %.2fs)" %
(time() - t0))
## Computing t-SNE
print("Computing t-SNE embedding")
tsne = manifold.TSNE(n_components=2, init='pca', random_state=0)
t0 = time()
X_tsne = tsne.fit_transform(X)
plot_embedding(X_tsne,
"t-SNE embedding of the digits (time %.2fs)" %
(time() - t0))
plt.show()
Implementation Time
Tsne: 13.40 s
PCA: 0.01 s
9. Where and When to use t-SNE?
9.1 Data Scientist
Well for the data scientist the main problem while using t-SNE is the black box type nature of the
algorithm. This impedes the process of providing inferences and insights based on the results. Also,
another problem with the algorithm is that it doesnt always provide a similar output on successive
runs.
So then how could you use the algorithm? The best way to used the algorithm is to use it for
exploratory data analysis. It will give you a very good sense of patterns hidden inside the data. It can
also be used as an input parameter for other classification & clustering algorithms.
Reference
[1] L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal
of Machine Learning Research 9(Nov):2579-2605, 2008
[2] Jizheng Yi et.al. Facial expression recognition Based on t-SNE and AdaBoostM2.
IEEE International Conference on Green Computing and Communications and IEEE Internet of
Things and IEEE Cyber,Physical and Social Computing (2013)
[3] Walid M. Abdelmoulaa et.al. Data-driven identification of prognostic tumor subpopulations using
spatially mapped t-SNE of mass spectrometry imaging data.
1224412249 | PNAS | October 25, 2016 | vol. 113 | no. 43
[4] Hendrik Heuer. Text comparison using word vector representations and dimensionality
reduction. 8th EUR. CONF. ON PYTHON IN SCIENCE (EUROSCIPY 2015)
End Notes
I hope you enjoyed reading this article. In this article, I have tried to explore all the aspects to help you
get started with t-SNE. Im sure you must be excited to explore more on t-SNE algorithm and use it at
your end.
Share your experience of working with t-SNE algorithm and if you think its better than PCA. If you
have any doubts or questions, feel free to post it in the comments section.