문서에 들어있는 단어의 수 세기

티스토리 뷰

차기작 : R을 배우자

문서에 들어있는 단어의 수 세기

quantlab 2014. 2. 11. 23:53

다음은 R에서 Text를 처리하는 간단한 예를 보여줍니다. countWordsIn2 함수는 파일 또는 http 주소를 받아서 관심있는 단어가 몇 번이나 나왔는지를 세어서 보여줍니다.

countWordsIn2 = function(filename, wInterested, to.lower=F)

filename은 화일 또는 url로, 벡터가 가능합니다. wInterested는 횟수를 세고자 하는 단어로, 역시 벡터가 가능합니다. to.lower는 알파벳의 경우 “Love"와 "love"를 구분할 것인지를 결정합니다. F는 "Love"를 "love"로 취급합니다.

countWordsIn2 <- function(filename, wInterested, to.lower = F) {
    result = data.frame(filename = filename)
    resultMatrix = matrix(rep(NA, length(filename) * length(wInterested)), nrow = length(filename), 
        ncol = length(wInterested))
    if (to.lower) {
        wInterested = tolower(wInterested)
    }
    for (i in 1:length(filename)) {
        lines = readLines(filename[i])
        words = strsplit(lines, "\\s+")
        words <- unlist(words)
        words <- words[words != ""]
        words <- tolower(words)
        words2 <- gsub("\\W", "", words)
        tblWords = table(words2)
        resultMatrix[i, ] = tblWords[wInterested]
    }
    result2 = as.data.frame(resultMatrix)
    names(result2) = wInterested
    cbind(result, result2)
}

이제 이 함수를 이용하여 구텐베르크 프로젝트의 소설에 특정한 단어가 몇 번이나 사용되었는지를 확인해 봅시다.

book.title = c("Alice’s Adventrues in Wonderland", "Les Miserables", "Romeo and Juliet")
book.url = c("http://www.gutenberg.org/cache/epub/28885/pg28885.txt", "http://www.gutenberg.org/cache/epub/135/pg135.txt", 
    "http://www.gutenberg.org/cache/epub/2261/pg2261.txt")

library(xtable)
result <- countWordsIn2(book.url, c("love", "loved", "loving", "hate", "hated", 
    "hating"), to.lower = T)
result$filename = book.title
print(xtable(result), type = "html")

	filename	love	loved	loving	hate	hated	hating
1	Alice’s Adventrues in Wonderland	3		1	1
2	Les Miserables	361	88	16	18	15	5
3	Romeo and Juliet				11	1

'차기작 : R을 배우자' 카테고리의 다른 글

and R (0)	2014.02.15
Python (0)	2014.02.15
R Studio에서 View의 한글깨짐 문제 (4)	2014.02.08
구글 양식 설정 방법 (0)	2014.02.07
구글 양식과 R 연동 (0)	2014.02.07

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/03 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

글 보관함

기초 통계학의 숨은 원리 이해하기

티스토리 뷰

문서에 들어있는 단어의 수 세기

'차기작 : R을 배우자' 카테고리의 다른 글

티스토리툴바