Success is a long continuous journey.

데이터분석2016. 5. 12. 18:49

Kaggler's Day #1

난 일주일에 세번!!!! Kaggle에 올라온 Famous Script를 따라서 해보고 감상평과 내용요약을 하고자 한다.

1. 주제

Kaggle에서는 사용자들이 Public Dataset을 가지고 분석한 스크립트와 보고서를 보고 Voting을 한다. 그중 가장 많은 아마존의 Find Food의 리뷰 분석 실습이다.. 데이터는 스탠포드에서 제공하고 2012년 10월에 아마존 사용자들의 568,454 리뷰 데이터를 모은것이다.

데이터 구성은 csv로 아래와 같다.

Id
ProductId - unique identifier for the product
UserId - unqiue identifier for the user
ProfileName
HelpfulnessNumerator - number of users who found the review helpful
HelpfulnessDenominator - number of users who indicated whether they found the review helpful
Score - rating between 1 and 5
Time - timestamp for the review
Summary - brief summary of the review
Text - text of the review

그리고 내가 선정한 스크립트는 바로 Ben Hamner이란 분이 올린 script가 되시겠다. 참고로 8개의 voting을 받았다.

library(RSQLite)
library(tm)
library(wordcloud)

library(RSQLite)
db <- dbConnect(dbDriver("SQLite"), "../input/database.sqlite")

reviews <- dbGetQuery(db, "
SELECT *
FROM Reviews
LIMIT 10000")

make_word_cloud <- function(documents) {
  corpus = Corpus(VectorSource(tolower(documents)))
  corpus = tm_map(corpus, removePunctuation)
  corpus = tm_map(corpus, removeWords, stopwords("english"))
  
  frequencies = DocumentTermMatrix(corpus)
  word_frequencies = as.data.frame(as.matrix(frequencies))
  
  words <- colnames(word_frequencies)
  freq <- colSums(word_frequencies)
  wordcloud(words, freq,
            min.freq=sort(freq, decreasing=TRUE)[[400]],
            colors=brewer.pal(8, "Dark2"),
            random.color=TRUE)  
}

png("wordcloud.png")
make_word_cloud(reviews$Text) 
dev.off()

분명 데이터도 sqlite에서 잘 뽑아와서 corpus를 munging하고 wordcloud함수도 잘 생성되었으나... 쉬운예제임에도불구하고 그림이 안 나옴..

실패되시겠습니다. 그래도 여튼 원래장소에서 가져와서붙이고 내가 한걸로.. ㅡ.ㅡ

저작자표시

'데이터분석' 카테고리의 다른 글

Kaggler's Day #8 (0)	2016.06.10
Kaggler's Day #7 (0)	2016.06.08
Kaggler's Day #6 (0)	2016.05.31
Kaggler's Day #5 (0)	2016.05.27
Kaggler's Day #3 (0)	2016.05.16

Posted by 억사마

일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Success is a long continuous journey.

Kaggler's Day #1

1. 주제

데이터 구성은 csv로 아래와 같다.

그리고 내가 선정한 스크립트는 바로 Ben Hamner이란 분이 올린 script가 되시겠다. 참고로 8개의 voting을 받았다.

분명 데이터도 sqlite에서 잘 뽑아와서 corpus를 munging하고 wordcloud함수도 잘 생성되었으나... 쉬운예제임에도불구하고 그림이 안 나옴..

실패되시겠습니다. 그래도 여튼 원래장소에서 가져와서붙이고 내가 한걸로.. ㅡ.ㅡ

'데이터분석' 카테고리의 다른 글

카테고리

공지사항

태그목록

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

달력

링크

티스토리툴바