Summary of The First Ten Chapters
Summary of The First Ten Chapters
Important terms:
Descriptive statistics: For at redgøre og analysere ud fra tal lige nu:
Infernital statisics: Fremtidlige data næmmere rettet til som sandsynlighed for et næste begivenhed
Exit polls: Exit polls oversættes til dansk som "valgstedsmålinger" eller "udgangsmålinger". Disse målinger er
surveys eller meningsmålinger, der gennemføres med vælgere umiddelbart efter, de har afgivet deres
stemme ved et valg. Formålet med valgstedsmålinger er at indsamle data om, hvordan folk har stemt, samt
deres demografiske egenskaber og anden relevant information. Disse målinger bruges ofte til at forudsige
valgresultater og analysere vælgernes adfærd.
Parameter: A parameter has different meanings depending on the context in which it is used. It could be the
mean, median, STD and so on:
Population: Is the whole what we are trying to get information from
Sample: its a small cut from the population and meesaure om the sample instead:
Chapter summary:
Descriptive techniques is a easy way to show information from nominal data: Nominal data is where
“rækkefølgen” ikke har nogle betydning:
Here we can use bar charts, pie charts and distributions to summarize the single set of nominal data. Here
we can show the frequency and proportion of each category.
If the dataset have 2 nominal variables, we can show cross-classification table og bar charts:
We also described the difference between times series data and cross-sectional data. Se the picture below:
Histograms are used to describe single sets of interval data: To analyze the relationship between two
interval variables we draw a scatter diagram: We look for the linear relationship/correlation.
Scatter diagram:
4. Numerical descriptive techniques
The chapter extended our discussion about descriptive statistics which deals with methods of
summarizing and presenting the essential information contained in a set of data: Now we can numerical
descriptive our dataset with numerical methods: The three popular measures of the central location is
mean, median and mode:
But the don’t say what how much the data vary. Here to describe the information regarding the variability
of interval data is conveyed by such numerical measures as the range, variance and standard deviation.
Range is the diff between the biggest and small number in the dataset
Definition: Variance is a measure of how much individual data points deviate from the mean (average) of
the dataset. It quantifies the average of the squared differences between each data point and the mean.
standard deviation.The standard deviation is widely used because it provides a measure of dispersion that is
in the same units as the original data. It is easier to interpret than variance, and a smaller standard
deviation indicates less variability.
In summary, the range measures the spread by considering the difference between the highest and lowest
values, while the variance and standard deviation provide more detailed information by considering how all
data points deviate from the mean. The standard deviation is particularly useful because it is both
interpretable and commonly used in statistics to assess the variability within a dataset.
praksis bruges standardafvigelsen ofte mere end variansen, fordi den giver en mere meningsfuld måling af
variationen, der er lettere at tolke og sammenligne med de oprindelige datapunkter. Variansen kan dog
være nyttig i nogle statistiske beregninger og analyser, selvom den normalt ikke er det mest intuitive mål for
spredning.
For the special case in which a sample of measure ments has a mound-shaped distribution, the Empirical
Rule provides a good approximation of the percentages of measurements that fall within one, two and
three standard deviations ofthe mean. Chebysheff's Theorem applies to all sets of data no matter the shape
ofthe histogram. Measures of relative standing that were presented in this chapter are percentiles and
quartiles. The linear relationship between two interval variables is measured by the covariance, the
coefficient of correlation, the coefficient of determination and the least squares line.
Important terms:
5. Data collection and sampling
Summary:
- Because most populations are very large, it is extremely costly and impractical to investigate each
member of the population to determine the values of the parameters. As a practical alternative, we
take a sample from the population and use the sample statistics to draw inferences about the
parameters. Care must be taken to ensure that the sampled population is the same as the target
population.
- We can choose from among several different sampling plans, including simple random
sampling, stratified random sampling and cluster sampling. Whatever sampling plan is
used, it is important to realize that both sampling error and non-sampling error will occur
and to understand what the sources of these errors are.
Important terms:
6. Probability
Summary:
Important terms:
Sample: its a small cut from the population and meesaure om the sample instead:
CHAPTER SUMMARY: There are two types of random variables. A discrete random variable is one
whose values are countable. A continuous random variable can assume an uncountable number of
values. In this chapter, we discussed discrete random variables and their probability distributions.
We defined the expected value, variance and standard deviation of a population represented by a
discrete probability dis tribution. Also introduced in this chapter were bivariate discrete
distributions on which an important application in finance was based. Finally, the two most
important dis crete distributions— the binomial and the Poisson— were presented.
Important terms:
Sample: its a small cut from the population and meesaure om the sample instead:
8. Continuous probability distributions
Important terms:
This chapter dealt with continuous random variables and their distributions. Because a continuous
random variable can assume an infinite number of values, the probability that the random variable
equals any single value is zero. Consequently, we addressed the problem of computing the
probability of a range of values. We showed that the prob ability of any interval is the area in the
interval under the curve representing the density function. We introduced the most important
distribution in statistics and showed how to compute the probability.
That a normal random variable falls into any interval. Additionally, we demonstrated how to use
the normal table backwards to find values of a normal random varia ble given a probability. Next,
we introduced the exponen tial distribution, a distribution that is particularly useful in several
management science applications. Finally, we presented three more continuous random variables
and their probability density functions. The Student t, chi-squared and Fdistributions will be used
extensively in statistical inference
9. Sampling distributions
The sampling distribution of a statistic is created by repeated sampling from one population. In this
chapter, we introduced the sampling distribution of the mean, the proportion and the difference
between two means. We described how these distributions are created theoreti cally and
empirically.
Important terms:
10.Introduction to estimation
Sample: its a small cut from the population and meesaure om the sample instead:
This chapter introduced the concepts of estimation and the estimator of a population mean when the
population variance is known. It also presented a formula to calculate the sample size necessary to estimate
a population mean
Important terms:
11 Introduction to hypothesis testing
H0. 400.000 kr
H1 ligmed eller <>(mere eller mindre)
See below
There are two hypotheses. One is called the null hypothesis and the other the alternative or
research hypothesis. The usual notation is:
H1:µ ≠ 350
Conclude that there is not enough evidence to support the alternative hypothesis
- (also stated as: not rejecting the null hypothesis in favor of the alternative).
Once the null and alternative hypotheses are stated, the next step is to randomly sample the
population and calculate a test statistic (in this example, the sample mean).
For example, if we’re trying to decide whether the mean is not equal to 350, a large value of
mean ! (say, 600) would provide enough evidence.
If X is close to 350 (say, 355) we could not say that this provides a great deal of evidence to infer
that the population mean is different than 350.
11.2 TESTING THE POPULATION MEAN WHEN THE POPULATION STANDARD
DEVIATION IS KNOWN
Rejection Region
It seems reasonable to reject the null hypothesis if the mean is relative larger than 170. Let’s say
500. But if the mean is close to 170, lets say 171. Such as 171 is close to 170 and does not make us
to reject the null- hypothesis. Entirely.
- In this samle
The null
H0: u = < 170 Do not install the new system
H1: u > 170 Install the new system
Information
N=400 antal
X=178=mean
μ=170 Mean
σ=65 Standard deviation:
Rejection Region:
The rejection region is a range of values such that if the statistic falls into that range, we decide to
reject the null hypothesis in favor of the alternative hypothesis.
Suppose that we define the value of the sample mean that is just large enough to reject the null
hypothesis Xl . The rejection region is
X > Xl
_
We know from section 9-1, that if the sampling distributions is of X is a normal with mean and STD.
To calculate the rejection region, we need a value of a at the significance level. Suppose that the
manager chose a to be 5%. It follows that za = z005 = 1.645. We can now calculate the value of xL:
The sample mean was computed to be 178. Because the test statistic (sample mean) is in the rejec
tion region (it is greater than 175.34), we reject the null hypothesis. Thus there is sufficient
evidence to infer that the mean monthly account is greater than €170.
Therefore we reject the null hypothesis and in favor of the research hypothesis.
An easier method specifies that the test statistic be the standardized value of x; that is, we use the
standardized test statistic:
The rejection consists of all values of Z that are greater than Z0( se below)
Because 2.46 is greater than 1.645 we reject the null hypothesis and conclude that there is enough
evidence to infer that the mean monthly account is greater than 170.
11-2c p-Value
The smaller the p-value, the more statistical evidence exists to support the hypothesis
We observe a p-value of 0.0069, hence there is overwhelming evidence to support H1: µ > 170
Compare the p-value with the selected value of the significance level:
If the p-value is less than α, we judge the p-value to be small enough to reject the null hypothesis.
If we reject the null hypothesis, we conclude that there is enough evidence to infer that the
alternative hypothesis is true.
If we do not reject the null hypothesis, we conclude that there is not enough statistical evidence to
infer that the alternative hypothesis is true.
Remember: The alternative hypothesis is the more important one. It represents what we are
investigating
In some anther words, its VERY unlikely that a mean is over 178.
The p-value of a test provides valuable information because it is a measure of the amount of statis
tical evidence that supports the alternative hypothesis.
The second two chapters were about inference statistics with estimation and hypothesis. They
were the STD(standard deviation known and often the STD is unknown. So they are a bit
unrealistic.
In Section 12-1, we describe how to make inferences about the population mean under the more
realistic circumstance when the population standard deviation is unknown.
In Section 12-2, we continue to deal with interval data, but our parameter of interest becomes the
population variance.
the z-statistic is replaced by the t-statistic, where the number of “degrees of freedom” ν, is n–1.
https://www.youtube.com/watch?v=tI6mdx3s0zk
Using the t Table to Find the P-value in One-Sample t Tests
In this section, we take a more realistic approach by acknowledging that if the population mean is
unknown, then so is the population standard deviation.
Instead, we substitute the sample standard deviation s in place of the unknown population
standard deviation σ.
We will no longer use the z-statistic and the z-estimator of μ. All future inferential problems
involving a population mean will be solved using the t-statistic and t-estimator of μ shown in the
preceding boxes.
Herefter S2, og så putte det i kvadratrod for at få s. Til sidst skal vi finde T, det gør vi via vores
degress of fredom. Til sidst put tallene ind i formlen.
For at finde ens signifance level 1% skal vi kigge i en T tabel, hvor vi skal finde degresses of fredom.
Det gør vi ved at sige antal-1= degresss of freedom: F.eks en N på 148.(i bogen starten af kapitalet,
12,1) der er det 148-1=147. Så skal vi finde det i T-tabellen:
Herefter skal vi finde samle mean. Først skal vi starte med at finde summen af observationerne
som er tegnet ∑x. Herefter skal vi finde ∑x2. Det er at sætte observationerne i anden(^2)
Efter tallene for oven, skal vi finde mean( Det gør vi via summen ∑x og dividere det med antallet af
oberservationer) (148)
Da det står i S2, skal vi have det om til s. Det gør vi via kvadratroden.
Nu har vi fundet s(std) og nu kan vi sætte tallene ind i vores formel.
2,23 er ikke højere end 2,351 og det betyder, at vi at vi ikke kan afvise nul hypotesen til fordel for
alternative hyp. Se nedenfor for englesk:
Because 2.23 is not greater than 2.351, we cannot reject the null hypothesis in favour of the
alternative.
In Section 12-1, where we presented the inferential methods about a population mean, we were
interested in acquiring information about the central location of the population. As a result, we
tested and estimated the population mean. If we are interested instead in drawing inferences
about a population's variability, the parameter we need to investigate is the population variance σ
In section 12-1 we did the inferential methods about a population mean and where the middle
was. Now we want to lock at the variety of the populations variability.
We can use it In an example illustrating the use of the normal distribution in Section 8-2, we
showed why variance is a measure of risk with stocks and Quality techni cians attempt to ensure
that their company's products consistently meet specifications and so on. You name it:
We begin by identifying the best estimator. That estimator has a sampling distribution, from which
we produce the test statistic and the interval estimator.
The statistic s2 has the desirable characteristics presented in Section 10-1; that is, s2 is an
unbiased, consistent estimator of σ2.
Statisticians have shown that the sum of squared deviations from the mean ^£(x, - x)2 [which is
equal to (n - 11s2] divided by the population variance is chi-squared distributed with p = η - 1
degrees of freedom provided that the sampled population is normal. The statistic:
EXAMPLE: 12,3
Husk at læse bogen rigitigt.
For at finde 13,85, så gør vi følgende: Vi skal have 95% confidence lvl.
Det gør vi ved at kigge på 0,95 ved 24 og herfra finder vi 13,8
TABEL 5 I APENDEX NEDERST VED SIDE.
Important terms:
https://www.youtube.com/watch?v=rpKzq64GA9Y
video hvis man bliver I tvivl hvad det er
If the p-value is greater than alpha, you accept the null hypothesis. If it is less than alpha, you
reject the null hypothesis.
- Teststatistikken i en statistisk test giver en måling af, hvor meget data afviger fra forventet.
- For chi-i-anden-testen angiver den, hvor meget observerede frekvenser afviger fra
forventede.
o Større værdier indikerer større afvigelser. Sammenlignes med en kritisk værdi eller
p-værdi for at vurdere, om afvigelsen er signifikant og om nulhypotesen kan
forkastes.
Regression is a model that allows us to see the relationship between 2 or more variables.
With only 1 variable and no other information the best prediction of our next prediction would be
to take out or mean.
Residuals: residuals is the difference between the fit-line and the observed data: They are also
called the error, because that how far the observed data is from the best fit line
First its going to make all numbers positive and the second thing is, its emphasizes larger
devations:
See below:
If we add all the numbers up, we are getting the sum of squared residuals/ or the sum of squared
errors:
When we say, sum of squared errors, it is the square.
The goal of a linear regression model is to minimize the sum of squared erros:
So we are gonna create a different line of the data, when we introduce an independent variable
that reduce the size of these squares and what will be our best fit-line from out data.’
The only problem is, that we are only using the dependent variable, we need a independent
variable.
- When we introduce the independent variable and put it in our regression model, its gonna
eat of some of the sum squared erros.
When we say a regression model is good, we compare its to how much the sum squared of errors
will be reduced by a large amount. So a simple regression model is always a comparison to what
we would have if we only had the dependent variable.