Data Analysis Course Notes
Data Analysis Course Notes
Data Analysis Course Notes
Sn*Sn=p*(1-p) : the function of the marginal error reach it’s maximum when p=0.5
K=N/n, randomly select d=from 1 to k and then form the sample of d, d+k, d+2k….
Biased sample (when for eg you take statistics for every Thursday for all the weeks)
Fix from the beg the must haves in each strata: it is easy to be implemented- knowing the criteria
that affects the result is hard
DISP you can take different rates from diff stratas/we can say that there is “désiquilibre” it will be
fixed with weight=Nh/N
Cluster sampling: we do census (étude kemla for all the elemnts) in each cluster
-we calculate using proportion, p=0.5 to have an idea about the sample size
Quota: it is like stratified( looks like yet not exactly the same) but it’s harder: you call randomly to
collect the sample sizes you need , you may fill a sample and still call people fitting there while others
are still witjhout data
Under sampling: eli jjet kbira nekhdh menha eli 7ajti bih barka
We always believe that the sample is representative (drawn at random and big enough) else you
can’t prove any hypothesis testing or get any confidence intervals
In this chapter we will: Determine the relationship between two features which can be:
- mixture between qualitative and quantitave (test of comparison of two means/ ANOVA test)
To find the best projection for data (project while conserving the maximum number of data)
PCA : reduce dimension while keeping the maximum number of information clearly
dispersed
PCA is designed for numerical data
PCA ybasset l oumour w ysahhel ( moving frm 10 dimension-space to 2 dimension for eg)
Inertia: Sum of squared distances from the gravity center and the data
Inertia after projection/ Inertia before projection= %rate the bigger the better
We choose the projection with the higher rate and we say that this projection explains rate% of data
variability
Before PCA we have to fix the scale: normalization (when a feature dominates in the
distance distance=(Xa-Xb)^2 + (Ya –Yb)^2 )
To normalize / to standardize (bch features ykoun andhom nafs l value f distance)
Scaling :
Z= (X – Xmin)/(Xmax – Xmin) 0<Z<1
Z= (X – Xbar) / standard deviation of x
We unify the unit of measure of all features
We can choose not to scale this when we have the same unit of measure among
all the variables
Data standardization
Calculate … matrix
Calculate eigenvalues
Each eigenvalue has its own eigenvector, we sort them in a decreasing way
P eigenvalues p dimensions
How to move for a smaller dimension-space?
We want to move from p dimension space to for n<p dimension
We take the n highest eigenvalues and we work on the space formed with their
corresponding eigenvectors.
We will have a matrix with these vectors “U” we multiply it by “Z”
Any couple of eigenvectors the form a factor dra chneya (forme un plan)
-we had indiv and their features, neglbouhom ywalou features have their indiv
Cercle, sphère, hyper sphère
Communiality(High jump)= projection quality over principal factor map (the map defined by
comp1 and comp2)
=cos^2(high jump, 1)+ cos^(high jump, 2)
=The projection quality
= sum of two cos^2
Applicartion
Communality ( high jump)=0.32+0.12=0.44 : khaybaaa = bad projection quality
Sum>0.6 : acceptable projection quality
We see each dimension who or what represents the most, than each point near to a
dimensions we can conclude that it is good at these dimensions
The relative and the absolute contrib. gives the same result
- There is a command to add dimensions to the factor map
the best package for the PCA
for R install.package(“namepackage”)
factominer install