빅데이터의 효용 01

티스토리 뷰

차기작 : R을 배우자

빅데이터의 효용 01

quantlab 2014. 1. 5. 10:54

독립 변수의 범위를 제한함에 따라 회귀 계수의 표본오차가 증가함을 보았습니다. 하지만 만약 데이터가 굉장히 많다면 독립 변수의 범위가 제한되어 있음에도 굉장히 정확하게 회귀 계수를 추정할 수 있습니다.

위의 그래프에서 왼쪽은 표본의 독립변수의 범위가 굉장히 제한되어 있지만 표본의 크기를 50, 500, 5000으로 증가함에 따라 추정된 회귀선의 정확성이 높아지고 있음을 볼 수 있습니다.

R source

par(mfcol=c(3,2))

x<-rnorm(50000, mean=55, sd=10)

y<-.7*x+rnorm(50000,sd=5)

plot(x,y,xlim=c(20,90),ylim=c(0,80), main="Population, sd=10")

mylm<-lm(y~x)

abline(mylm,col="red", lwd=3)

newx<-seq(20,90)

prd<-predict(mylm,newdata=data.frame(x=newx),interval = c("confidence"),

level = 0.90,type="response")

x_pop=x; y_pop=y;

index<-sample(1:500, 50)

x<-x_pop[index]

y<-y_pop[index]

plot(x,y,xlim=c(20,90),ylim=c(0,80), main="Sampling without range restriction")

mylm<-lm(y~x)

abline(mylm,col="red", lwd=3)

newx<-seq(20,90)

prd<-predict(mylm,newdata=data.frame(x=newx),interval = c("confidence"),

level = 0.90,type="response")

lines(newx,prd[,2],col="red",lty=2)

lines(newx,prd[,3],col="red",lty=2)

#======= Small Data : samplesize 50

index=which(x_pop>55-2*5 & x_pop<55+2*5)

index<-sample(index, 50)

x<-x_pop[index]

y<-y_pop[index]

plot(x,y,xlim=c(20,90),ylim=c(0,80), , main="Sampling, range : mean+-2*5")

mylm<-lm(y~x)

abline(mylm,col="red", lwd=2)

newx<-seq(20,90)

prd<-predict(mylm,newdata=data.frame(x=newx),interval = c("confidence"),

level = 0.90,type="response")

lines(newx,prd[,2],col="red",lty=2)

lines(newx,prd[,3],col="red",lty=2)

#======= Sampling, range restriction

index=which(x_pop>55-2*1 & x_pop<55+2*1)

index<-sample(index, 50)

x<-x_pop[index]

y<-y_pop[index]

plot(x,y,xlim=c(20,90),ylim=c(0,80), , main="Sampling, range : mean+-2*1")

mylm<-lm(y~x)

abline(mylm,col="red", lwd=1)

newx<-seq(20,90)

prd<-predict(mylm,newdata=data.frame(x=newx),interval = c("confidence"),

level = 0.90,type="response", lwd=3)

lines(newx,prd[,2],col="red",lty=2)

lines(newx,prd[,3],col="red",lty=2)

#======= Big Data : samplesize 500

index=which(x_pop>55-2*1 & x_pop<55+2*1)

index<-sample(index, 500)

x<-x_pop[index]

y<-y_pop[index]

plot(x,y,xlim=c(20,90),ylim=c(0,80), , main="Sampling, range : mean+-2*1")

mylm<-lm(y~x)

abline(mylm,col="red", lwd=1)

newx<-seq(20,90)

prd<-predict(mylm,newdata=data.frame(x=newx),interval = c("confidence"),

level = 0.90,type="response", lwd=3)

lines(newx,prd[,2],col="red",lty=2)

lines(newx,prd[,3],col="red",lty=2)

#======= Big Data : samplesize 5000

index=which(x_pop>55-2*1 & x_pop<55+2*1)

index<-sample(index, 5000)

x<-x_pop[index]

y<-y_pop[index]

plot(x,y,xlim=c(20,90),ylim=c(0,80), , main="Sampling, range : mean+-2*1")

mylm<-lm(y~x)

abline(mylm,col="red", lwd=1)

newx<-seq(20,90)

prd<-predict(mylm,newdata=data.frame(x=newx),interval = c("confidence"),

level = 0.90,type="response", lwd=3)

lines(newx,prd[,2],col="red",lty=2)

lines(newx,prd[,3],col="red",lty=2)

'차기작 : R을 배우자' 카테고리의 다른 글

구글 양식과 R 연동 (0)	2014.02.07
빅데이터의 효용 02 (0)	2014.01.11
다중 회귀 분석시 표본의 최소 크기 (0)	2013.12.28
다중 회귀 분석과 독립 변수의 상관 (0)	2013.12.14
독립 변수의 범위를 제한함에 따라 생기는 회귀계수의 표본오차 증가 (0)	2013.12.08

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2024/05 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

글 보관함

기초 통계학의 숨은 원리 이해하기

티스토리 뷰

빅데이터의 효용 01

'차기작 : R을 배우자' 카테고리의 다른 글

티스토리툴바