Problem 1: Analyzing Educational Data
Problem Description:
In this statistical analysis assignment, we delve into a dataset (ps4data.xlsx) focused on educational variables. Our objective is to perform statistical analyses, including calculating means, conducting t-tests, and establishing confidence intervals.
Part a: Descriptive Statistics and Confidence Interval
sample.mean <- mean(ps4data$educ)
print(sample.mean)
## [1] 7.044534
# Standard error
sample.n <- length(ps4data$educ)
sample.sd <- sd(ps4data$educ)
sample.se <- sample.sd/sqrt(sample.n)
print(sample.se)
## [1] 0.1061065
# t score corresponding to the Confidence Interval
alpha = 0.05
degrees.freedom = sample.n - 1
t.score = qt(p=alpha/2, df=degrees.freedom,lower.tail=F)
print(t.score)
## [1] 1.963175
# Marginal Error
margin.error <- t.score * sample.se
# Confidence Interval
lower. bound <- sample. mean - margin. error
upper.bound <- sample.mean + margin.error
print(c(lower.bound,upper.bound))
## [1] 6.836229 7.252840
Outcome: The mean education level is 7.044534, and we are 95% confident that it falls within the range of (6.836229, 7.252840).
Part b: One Sample t-test
t.test(ps4data$educ, mu = 5, alternative = "two.sided")
##
## One Sample t-test
##
## data: ps4data$educ
## t = 19.269, df = 740, p-value < 2.2e-16
## Alternative hypothesis: the true mean is not equal to 5
## 95 percent confidence interval:
## 6.836229 7.252840
## sample estimates:
## mean of x
## 7.044534
Outcome: The analysis reveals a rejection of the null hypothesis, suggesting that the true mean is not equal to 5.
Part c: One Sample T-test with Different Hypothesis
t.test(ps4data$educ, mu = 7.2, alternative = "two.sided")
##
## One Sample t-test
##
## data: ps4data$educ
## t = -1.4652, df = 740, p-value = 0.1433
## Alternative hypothesis: the true mean is not equal to 7.2
## 95 percent confidence interval:
## 6.836229 7.252840
## sample estimates:
## mean of x
## 7.044534
Outcome: When testing against a mean of 7.2, we fail to reject the null hypothesis, indicating no significant difference.
Part d: Two-Sample t-test
Y_t <- subset(ps4data, ps4data$abd == 1)
Y_c <- subset(ps4data, ps4data$abd == 0)
# two sided t-test
t.test(Y_t$educ, Y_c$educ, alternative = "two.sided", var.equal = FALSE)
##
## Welch Two Sample T-test
##
## data: Y_t$educ and Y_c$educ
## t = -2.6798, df = 551.58, p-value = 0.007587
## Alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.0318702 -0.1589784
## sample estimates:
## mean of x mean of y
## 6.820346 7.415771
Outcome: The two-sample t-test suggests no significant difference in means between two distinct subsets.
Part e: Advantages of One-Tailed Test
# Explanation of advantages
Outcome: Opting for a one-tailed test provides increased statistical power at the same significance level.
Part f: One-Tailed Two-Sample t-test
Y_t <- subset(ps4data, ps4data$abd == 1)
Y_c <- subset(ps4data, ps4data$abd == 0)
# two sided t-test
t.test(Y_t$educ, Y_c$educ, alternative = "less", var.equal = FALSE)
##
## Welch Two Sample T-test
##
## data: Y_t$educ and Y_c$educ
## t = -2.6798, df = 551.58, p-value = 0.003794
## Alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -0.2293362
## sample estimates:
## mean of x mean of y
## 6.820346 7.415771
Outcome: Exploring if the mean of Y_t is less than Y_c yields a p-value of 0.003794.
Part g: Two-Sample t-test with Different Variable
Y_t <- subset(ps4data, ps4data$abd == 1)
Y_c <- subset(ps4data, ps4data$abd == 0)
# two sided t-test
t.test(Y_t$fthr_ed, Y_c$fthr_ed, alternative = "two.sided", var.equal = FALSE)
##
## Welch Two Sample T-test
##
## data: Y_t$fthr_ed and Y_c$fthr_ed
## t = -1.1125, df = 572.99, p-value = 0.2664
## Alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.8408032 0.2327410
## sample estimates:
## mean of x mean of y
## 5.764069 6.068100
Outcome: Testing a different variable suggests no significant difference in means, given the sample size.
Part h: Minimizing Type I Error
# Explanation on minimizing Type I error
Outcome: To minimize Type I error, consider decreasing the significance level; altering sample size has no effect.
Part i: One Sample t-test for Wages Improvement
# Explanation and code for Part i
Outcome: The one-sample t-test assesses wage improvement, comparing those with vocational training to those without.
Problem 2: Simulation and Central Limit Theorem
Problem Description:
In this scenario, the challenge lies in understanding the impact of sample size on hypothesis testing and the subsequent insights derived from the Central Limit Theorem. We delve into the intricacies of rejection rates, providing a hands-on perspective on the importance of appropriate sample sizes in statistical analyses.
Part a: Small Sample Size Issue
# Explanation and code for Part a
Outcome: Simulating small samples from an exponential distribution leads to a higher rejection rate due to the small sample size issue.
Part b: Larger Sample Size and Central Limit Theorem
# Explanation and code for Part b
Outcome: Using a larger sample size (100) reduces the rejection rate, emphasizing the impact of the central limit theorem.