티스토리 뷰

독립 변수의 범위를 제한함에 따라 회귀 계수의 표본오차가 증가함을 보았습니다. 하지만 만약 데이터가 굉장히 많다면 독립 변수의 범위가 제한되어 있음에도 굉장히 정확하게 회귀 계수를 추정할 수 있습니다.

 

위의 그래프에서 왼쪽은 표본의 독립변수의 범위가 굉장히 제한되어 있지만 표본의 크기를 50, 500, 5000으로 증가함에 따라 추정된 회귀선의 정확성이 높아지고 있음을 볼 수 있습니다.

 

R source

 

par(mfcol=c(3,2))

 

x<-rnorm(50000, mean=55, sd=10)

y<-.7*x+rnorm(50000,sd=5)

plot(x,y,xlim=c(20,90),ylim=c(0,80), main="Population, sd=10")

mylm<-lm(y~x)

abline(mylm,col="red", lwd=3)

newx<-seq(20,90)

prd<-predict(mylm,newdata=data.frame(x=newx),interval = c("confidence"),

level = 0.90,type="response")

 

x_pop=x; y_pop=y;

 

index<-sample(1:500, 50)

x<-x_pop[index]

y<-y_pop[index]

 

plot(x,y,xlim=c(20,90),ylim=c(0,80), main="Sampling without range restriction")

mylm<-lm(y~x)

abline(mylm,col="red", lwd=3)

newx<-seq(20,90)

prd<-predict(mylm,newdata=data.frame(x=newx),interval = c("confidence"),

level = 0.90,type="response")

lines(newx,prd[,2],col="red",lty=2)

lines(newx,prd[,3],col="red",lty=2)

 

#======= Small Data : samplesize 50

 

index=which(x_pop>55-2*5 & x_pop<55+2*5)

 

index<-sample(index, 50)

x<-x_pop[index]

y<-y_pop[index]

 

plot(x,y,xlim=c(20,90),ylim=c(0,80), , main="Sampling, range : mean+-2*5")

mylm<-lm(y~x)

abline(mylm,col="red", lwd=2)

newx<-seq(20,90)

prd<-predict(mylm,newdata=data.frame(x=newx),interval = c("confidence"),

level = 0.90,type="response")

lines(newx,prd[,2],col="red",lty=2)

lines(newx,prd[,3],col="red",lty=2)

 

#======= Sampling, range restriction

 

index=which(x_pop>55-2*1 & x_pop<55+2*1)

 

index<-sample(index, 50)

x<-x_pop[index]

y<-y_pop[index]

 

plot(x,y,xlim=c(20,90),ylim=c(0,80), , main="Sampling, range : mean+-2*1")

mylm<-lm(y~x)

abline(mylm,col="red", lwd=1)

newx<-seq(20,90)

prd<-predict(mylm,newdata=data.frame(x=newx),interval = c("confidence"),

level = 0.90,type="response", lwd=3)

lines(newx,prd[,2],col="red",lty=2)

lines(newx,prd[,3],col="red",lty=2)

 

#======= Big Data : samplesize 500

 

index=which(x_pop>55-2*1 & x_pop<55+2*1)

 

index<-sample(index, 500)

x<-x_pop[index]

y<-y_pop[index]

 

plot(x,y,xlim=c(20,90),ylim=c(0,80), , main="Sampling, range : mean+-2*1")

mylm<-lm(y~x)

abline(mylm,col="red", lwd=1)

newx<-seq(20,90)

prd<-predict(mylm,newdata=data.frame(x=newx),interval = c("confidence"),

level = 0.90,type="response", lwd=3)

lines(newx,prd[,2],col="red",lty=2)

lines(newx,prd[,3],col="red",lty=2)

 

#======= Big Data : samplesize 5000

 

index=which(x_pop>55-2*1 & x_pop<55+2*1)

 

index<-sample(index, 5000)

x<-x_pop[index]

y<-y_pop[index]

 

plot(x,y,xlim=c(20,90),ylim=c(0,80), , main="Sampling, range : mean+-2*1")

mylm<-lm(y~x)

abline(mylm,col="red", lwd=1)

newx<-seq(20,90)

prd<-predict(mylm,newdata=data.frame(x=newx),interval = c("confidence"),

level = 0.90,type="response", lwd=3)

lines(newx,prd[,2],col="red",lty=2)

lines(newx,prd[,3],col="red",lty=2)

 

 



공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
«   2025/01   »
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31
글 보관함