Kaggler's Day #9

늦게 연재하는 9번째 Kaggler's Day입니다.

이번 차수에서는 직접 Kaggle Competition에 올라온걸 EDA에서 그치는게 아니라 직접 예측 모델까지 돌려서 예측까지 해보는걸 해볼려고 합니다. 우선 오늘은 데이터셋 탐색까지 진행토록 하곘습니다.

주제는 부동산 가격 예측(여기)입니다.

아이오와주의 에이메스(Ames)라는 도시의 부동산 가격 에측을 할려고 합니다. 80여개의 다양한 주택 관련 피쳐를 제공하고 이를 가지고 부동산 가격을 예측하는게 목표인데요, 한국 부동산 시장하고는 미국 부동산 시장은 참 많이 다른것 같습니다. 왜냐하면 한국하면 지하철 역세권 여부와 거리, 좋은 학군에 속하는 학교와 거리 등등이 중요하다고 보면 여기 피쳐들은 몇 개는 일부 유관해보이지만 대부분의 피쳐가 집의 상태들이 중요합니다.

데이터를 한번 살펴보겠습니다.

LotFrontAge

Street 의 경우 두가지 타입중에서 자갈일 경우는 싸고, pave는 다양하게 걸쳐져 있다

Alley(골목)도 마찬가지로 Pave가 Gravel보다 높은 가격을 대체적으로 형성

Lotshape의 경우은 IR 시리즈보다 가격이 Reg로 된 집이 좀 낮다

LandContour도 마찬가지로 가격이 타입별로 다르게 형성

Utilities는 AllPubs가 가격범위가 크고 NoSeWa는 거의 1.5? 고정

NeighborHood는 확실히 동네가 어디인지에 따라 가격대가 다르게 형성

YearBuilt 확실히 오래된 아파트는 비싼게 없군

HouseStyle 1층과 2층짜리 집이 가격이 높게 형성

OverallQual 당연히 품질이 좋을수록 집이 비쌈

OverallCond 5인 경우에는 가격 범위가 전체에 걸쳐 있다는게 이상 나머지는 정상대로 비례관계

YearRemoteAdd 리모델링 날짜를 말하는 것으로써, 당연히 최근일수록 비싸나 생각보다 그리 비례비가 높지 않고 또 아래 박스권에 대부분 몰려 있음

RoofStyle: Type of roof , 그닥 차이가 없음 Hip이 그나마 가격이 좀 높은 정도? 차이가 그리 없음.

Flat Flat

Gable Gable

Gambrel Gabrel (Barn)

Hip Hip

Mansard Mansard

Shed Shed

RoofMatl: Roof material, Wood Shingles이 가격이 쌔고, Gravel & Tar / Wood Shakes 이 그다음 가격대 그리고 Membrane, Metal, Roll의 경우는 집 가격대가 아주 좁게 형성되어있는점이 특이!!

ClyTile Clay or Tile

CompShg Standard (Composite) Shingle

Membran Membrane

Metal Metal

Roll Roll

Tar&Grv Gravel & Tar

WdShake Wood Shakes

WdShngl Wood Shingles

Exterior1st: Exterior covering on house , 집 외부 마감재가 뭐에 따라 다르다. PreCast, Vinyl Siding 같은 경우는 비싼편

AsbShng Asbestos Shingles

AsphShn Asphalt Shingles

BrkComm Brick Common

BrkFace Brick Face

CBlock Cinder Block

CemntBd Cement Board

HdBoard Hard Board

ImStucc Imitation Stucco

MetalSd Metal Siding

Other Other

Plywood Plywood

PreCast PreCast

Stone Stone

Stucco Stucco

VinylSd Vinyl Siding

Wd Sdng Wood Siding

WdShing Wood Shingles

Exterior2nd: Exterior covering on house (if more than one material) 가격이 위와 비슷

AsbShng Asbestos Shingles

AsphShn Asphalt Shingles

BrkComm Brick Common

BrkFace Brick Face

CBlock Cinder Block

CemntBd Cement Board

HdBoard Hard Board

ImStucc Imitation Stucco

MetalSd Metal Siding

Other Other

Plywood Plywood

PreCast PreCast

Stone Stone

Stucco Stucco

VinylSd Vinyl Siding

Wd Sdng Wood Siding

WdShing Wood Shingles

MasVnrType: Masonry veneer type Stone > Brick Face > Brick Common, None

BrkCmn Brick Common

BrkFace Brick Face

CBlock Cinder Block

None None

Stone Stone

MasVnrArea: Masonry veneer area in square feet 관계성이 생각보다 너무 없다.ㄷ

ExterQual: Evaluates the quality of the material on the exterior, Excellent가 최고

Ex Excellent

Gd Good

TA Average/Typical

Fa Fair

Po Poor

ExterCond: Evaluates the present condition of the material on the exterior, Excellent > Average/Typical 순으로 높다

Ex Excellent

Gd Good

TA Average/Typical

Fa Fair

Po Poor

Foundation: Type of foundation , Poured Contrete, Stone, Wood가 높다

BrkTil Brick & Tile

CBlock Cinder Block

PConc Poured Contrete

Slab Slab

Stone Stone

Wood Wood

BsmtQual: Evaluates the height of the basement 높이가 높을수록 퀄리티가 높은가 봄.

Ex Excellent (100+ inches)

Gd Good (90-99 inches)

TA Typical (80-89 inches)

Fa Fair (70-79 inches)

Po Poor (<70 inches

NA No Basement

BsmtCond: Evaluates the general condition of the basement , Excellent가 빠져있고 마찬가지!

Ex Excellent

Gd Good

TA Typical - slight dampness allowed

Fa Fair - dampness or some cracking or settling

Po Poor - Severe cracking, settling, or wetness

NA No Basement

BsmtExposure: Refers to walkout or garden level walls , walkout은 외부로 연결된 집안의 통로를 말하는 것으로, 이것도 집 가격이 가겨갣가 별도로 형성된다ㅣ

Gd Good Exposure

Av Average Exposure (split levels or foyers typically score average or above)

Mn Mimimum Exposure

No No Exposure

NA No Basement

BsmtFinType1: Rating of basement finished area 이것도 마찬가지로 지하실 상태를 나타내는 거 같음.

GLQ Good Living Quarters

ALQ Average Living Quarters

BLQ Below Average Living Quarters

Rec Average Rec Room

LwQ Low Quality

Unf Unfinshed

NA No Basement

BsmtFinSF1: Type 1 finished square feet, 비례관계

BsmtFinType2: Rating of basement finished area (if multiple types), 그닥 관계성이 없다능.

GLQ Good Living Quarters

ALQ Average Living Quarters

BLQ Below Average Living Quarters

Rec Average Rec Room

LwQ Low Quality

Unf Unfinshed

NA No Basement

BsmtFinSF2: Type 2 finished square feet, 이것도 마찬가지 그리 관계가 없다.

BsmtUnfSF: Unfinished square feet of basement area, 관계 없다.

TotalBsmtSF: Total square feet of basement area, 비례관계

Heating: Type of heating, 이거는 가스가 높은 집값을 보증해준다는거 GasA, GasW가 둘다 상위권

Floor Floor Furnace

GasA Gas forced warm air furnace

GasW Gas hot water or steam heat

Grav Gravity furnace

OthW Hot water or steam heat other than gas

Wall Wall furnace

HeatingQC: Heating quality and condition, 이것 또한 상태가 좋은게 집값 결정이 영향을 미침.

Ex Excellent

Gd Good

TA Average/Typical

Fa Fair

Po Poor

CentralAir: Central air conditioning 중앙냉방이 있으면 가격이 쎄다

N No

Y Yes

Electrical: Electrical system SBrkr > FuseA > FuseF > FuseP > Mix

SBrkr Standard Circuit Breakers & Romex

FuseA Fuse Box over 60 AMP and all Romex wiring (Average)

FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)

FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)

Mix Mixed

1stFlrSF: First Floor square feet , 비례관계

2ndFlrSF: Second floor square feet 2층이 없는거 제외하고 비례관계

LowQualFinSF: Low quality finished square feet (all floors) , 영향 없음

GrLivArea: Above grade (ground) living area square feet 비례관계

BsmtFullBath: Basement full bathrooms, 개수가 증가할수록 가격 범위 시작가격이 조금 올라가나 영향은 없다고 볼 수 있음

BsmtHalfBath: Basement half bathrooms, 개수가 증가할수록 가격 범위 시작가격이 조금 올라가나 영향은 없다고 볼 수 있음

FullBath: Full bathrooms above grade, 개수가 증가할수록 가격 범위 시작가격이 조금 올라가나 영향은 없다고 볼 수 있음

HalfBath: Half baths above grade, 개수가 증가할수록 가격 범위 시작가격이 조금 올라가나 영향은 없다고 볼 수 있음

Bedroom: Bedrooms above grade (does NOT include basement bedrooms) ??

Kitchen: Kitchens above grade ??

KitchenQual: Kitchen quality, 마찬가지 비례

Ex Excellent

Gd Good

TA Typical/Average

Fa Fair

Po Poor

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) 살짝 비례관계???!

Functional: Home functionality (Assume typical unless deductions are warranted)

Typ Typical Functionality

Min1 Minor Deductions 1

Min2 Minor Deductions 2

Mod Moderate Deductions

Maj1 Major Deductions 1

Maj2 Major Deductions 2

Sev Severely Damaged

Sal Salvage only

Fireplaces: Number of fireplaces

FireplaceQu: Fireplace quality, 이것도 비례관계

Ex Excellent - Exceptional Masonry Fireplace

Gd Good - Masonry Fireplace in main level

TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement

Fa Fair - Prefabricated Fireplace in basement

Po Poor - Ben Franklin Stove

NA No Fireplace

GarageType: Garage location, BuiltIn, Attchd 가 비싸다.

2Types More than one type of garage

Attchd Attached to home

Basment Basement Garage

BuiltIn Built-In (Garage part of house - typically has room above garage)

CarPort Car Port

Detchd Detached from home

NA No Garage

GarageYrBlt: Year garage was built , 창고가 지어진 년도도 최근게 당연히 비쌈

GarageFinish: Interior finish of the garage, 비례관계 Fin > RFn > Unf

Fin Finished

RFn Rough Finished

Unf Unfinished

NA No Garage

GarageCars: Size of garage in car capacity, 이것도 비례관계가 구성이 될 듯. 차대수가 많을수록 가격 범위가 점점 올라간다.

GarageArea: Size of garage in square feet, 800 square feet까지는 맞으나 그뒤로는 비례관계가 깨진다.

GarageQual: Garage quality, 가격에 영향이 없다. Ex = Gd가 같다.

Ex Excellent

Gd Good

TA Typical/Average

Fa Fair

Po Poor

NA No Garage

GarageCond: Garage condition, 유의하지 않다. 오히려 Excellent가 Good보다 집 가격이 못하다.

Ex Excellent

Gd Good

TA Typical/Average

Fa Fair

Po Poor

NA No Garage

PavedDrive: Paved driveway 포장일수록 가격이 높다. Y > P > N

Y Paved

P Partial Pavement

N Dirt/Gravel

WoodDeckSF: Wood deck area in square feet , 약간 비례관계, 400 square feet이후는 데이터가 별로 없다.

OpenPorchSF: Open porch area in square feet , 200 square feet 이후에는 데이터가 별로 없다.

EnclosedPorch: Enclosed porch area in square feet, 유의하지 않다.

3SsnPorch: Three season porch area in square feet, 데이터가 별로 없다.

ScreenPorch: Screen porch area in square feet, 데이터가 별로 없어 유의한지 판단하기 어렵다.

PoolArea: Pool area in square feet 데이터가 너무 없어 유의한지 판단하기 어렵다.

PoolQC: Pool quality, Excellent같은 경우 확실히 가격대가 높고 Gd가 오히려 Fa보다 가격이 낮아 유의할지 의문

Ex Excellent

Gd Good

TA Average/Typical

Fa Fair

NA No Pool

Fence: Fence quality, Good Privacy는 높으나 나머지가 비교하기 어려울듯, GdPrv와 나머지로 분류해서 categorical factor로 만들지 고민

GdPrv Good Privacy

MnPrv Minimum Privacy

GdWo Good Wood

MnWw Minimum Wood/Wire

NA No Fence

MiscFeature: Miscellaneous feature not covered in other categories, Tennis Court가 있음 비쌈 , 창고가 하나 더있어도 비쌈.

Elev Elevator

Gar2 2nd Garage (if not described in garage section)

Othr Other

Shed Shed (over 100 SF)

TenC Tennis Court

NA None

MiscVal: $Value of miscellaneous feature, 데이터도 별로 없음, 유의하지 않음

MoSold: Month Sold (MM) 3,4,7이 더 높은 판매가격을 자랑함.

YrSold: Year Sold (YYYY) 유의하지 않으며 매년 비슷한 모습

SaleType: Type of sale, New와 Con이 시세가 비싸 보이며, 나머지는 비슷

WD Warranty Deed - Conventional

CWD Warranty Deed - Cash

VWD Warranty Deed - VA Loan

New Home just constructed and sold

COD Court Officer Deed/Estate

Con Contract 15% Down payment regular terms

ConLw Contract Low Down payment and low interest

ConLI Contract Low Interest

ConLD Contract Low Down

Oth Other

SaleCondition: Condition of sale, Partical이 가격대가 높아 보인다.

Normal Normal Sale

Abnorml Abnormal Sale - trade, foreclosure, short sale

AdjLand Adjoining Land Purchase

Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit

Family Sale between family members

Partial Home was not completed when last assessed (associated with New Homes)

실제로 이건 description에 있는 정보와 SalesPrice를 보고 연관성을 같이 살펴본 결과를 적은 것이고 이제는 실제로 데이터 전처리를 다음 시간에 하도록 하겠습니다.

저작자표시

'데이터분석' 카테고리의 다른 글

킥오프용 문서입니다 (0)	2017.04.24
Kaggler's Day #8 (0)	2016.06.10
Kaggler's Day #7 (0)	2016.06.08
Kaggler's Day #6 (0)	2016.05.31
Kaggler's Day #5 (0)	2016.05.27

Posted by 억사마

PyconKR 2016 정리

Pycon 2016 정리

작년에 일정에 쫓겨 못갔지만 이번 파이선은 아주 천재일우의 기회로 다녀오게 됨. 그것도 3일간 풀로 말이다. 물론 기대가 너무 커서 실망도 크지만 얻은것도 많았다. 대부분 ML 주제가 거의 차지한 거같았다. 다른 세션은 많이 비우는 경우도 생기고. 데이터 분석이 대세인가 싶다. 마지막 날은 튜토리얼 하는 날인데 가서 실제로 Django가지고 간단한 웹서비스도 만들었고 이를 cloud에 배포까지 해보는 실습을 해보면서 장고걸스 / django를 알아보는 좋은 기회도 있었다.

아래는 내가 우선 놓친(다른 것을 듣느라) 세션들이다. 나중에 유투브가 올라오면 영상까지 볼 만한 것들이다.

뉴스를 재미있게 만드는 방법; 뉴스잼 링크

Django로 쇼핑몰 만들자 링크

Basic Statistics with Python 링크

TOROS: Python Framework for Recommender System 링크

파이썬으로 기초 산수 풀어보기 (이것은 튜토리얼이므로 나중에 업데이트할 예정)

Python으로 IoT, 인지(Cognitive), 머신러닝 삼종세트 활용하기 링크

Django vs Flask, 까봅시다! 링크

검색 로그 시스템 with Python 링크

Decision making with Genetic Algorithms using DEAP 링크

파이썬 데이터 분석 3종 세트 - statsmodels, scikit-learn, theano 링크

나중에 들은것은 다시 강의 보고 한번 내용 요약해서 올릴 예정.

저작자표시

'IT > 후기' 카테고리의 다른 글

Polyglot Programming (0)	2015.03.13
2013 JCO 후기 (0)	2013.02.23

Posted by 억사마

특정 주제 목록 캡쳐그림/목록 가져오기 (Python)

파이선 컨을 앞두고 파이선 재미난 코드 snippet 을 보다가 하나 따라해봄. (출처)

주피터에 있는걸 복붙. 사전에 webkit2png 설치가 필요하며 그리고 실행하는 경로에 images라는 폴더 생성을 해야한다.

import requests # http 
import bs4  # beautiful soup
import re # reqular expression
import subprocess # capture screen
import json  #json util
import os # os util 사용하기 위

In [39]:

# Requests
BASE_URL_SDS = "http://search.daum.net/search?nil_suggest=btn&w=news&DA=SBC&cluster=y&q=%EC%82%BC%EC%84%B1sds"
data = requests.get(BASE_URL_SDS)

# row개수 확인
data = bs4.BeautifulSoup(data.text)
# 아래는 reqular expression을 이용하여 totalCount를 가져오는 예제
match = re.search("totalCount: [0-9]+", data.text)
# total Count를 가져오는 부분이며 두번째 인덱스에 숫자가들어있것지
total_count = int(match.group(0).split("totalCount: ")[1])

/Users/jouk/Workspace/python/test/images

In [50]:

# 총 페이지 개수 (페이지당 10개)
pages = total_count / 10 + 1
article_data = [] #아티클 보관할 배열 생성

# 오호라 이 문법은 정말 신기하구먼 자바랑 좀 다른건가 for in 하고 비슷하긴한데 range라는게 있구먼..
for page in range(1, pages+1):
    TARGET_URL = BASE_URL_SDS + "&p=" + str(page)
    data = requests.get(TARGET_URL)
    data = bs4.BeautifulSoup(data.text)
    articles = data.findAll("div", attrs={'class': 'cont_inner'})

    for article in articles:
        title_and_link = article.findAll("a")[0]
        title = title_and_link.text.encode('utf-8')
        link = title_and_link["href"]

        date_and_media = str(article.findAll("span", attrs={'class': 'date'})[0])
        date = date_and_media.split("\n")[1]
        media = date_and_media.split("\n")[2].split("</span> ")[1]

        article_data.append(
            {
                "title": title,
                "link": link,
                "date": date,
                "media": media,
            }
        )
        
        # 아래를 실행하기 위해서는 http://www.paulhammond.org/webkit2png/ 에서 우선 webkit2png가 필요!!
        # ScreenShot
        subprocess.call([
            "webkit2png",
            "-F",   # only create fullsize screenshot
            "--filename=temporary",
            "--dir=" + os.path.join(os.getcwd(), "images"),
            link
        ])
        # Rename Screenshot
        # webkit2png --filename=FILENAME 옵션을 사용하면 한글깨짐 문제 발생
        for filename in os.listdir("./images/"):
            if filename.startswith("temporary"):
                os.rename(
                    os.path.join(os.getcwd(), "images", filename),
                    os.path.join(os.getcwd(), "images",
                                "Screenshot_" + date + "_" + media + "_" + title.replace(" ", "_") + ".png")
                )

# Result as JSON
# 단, ensure_ascii 옵션으로 UTF-8 ( 한글로 보이도록 ) 출력한다.
with open('result.json', 'w') as outfile:
    json.dump(article_data, outfile, ensure_ascii=False)

저작자표시

'데이터분석 > R & Python' 카테고리의 다른 글

주식 회복 탄력성 지수 구하는 모듈 ver 1.0 (0)	2017.03.09

Posted by 억사마

Success is a long continuous journey.

Kaggler's Day #9

'데이터분석' 카테고리의 다른 글

PyconKR 2016 정리

'IT > 후기' 카테고리의 다른 글

특정 주제 목록 캡쳐그림/목록 가져오기 (Python)

'데이터분석 > R & Python' 카테고리의 다른 글

카테고리

공지사항

태그목록

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

달력

링크

티스토리툴바