특정 주제 목록 캡쳐그림/목록 가져오기 (Python)

파이선 컨을 앞두고 파이선 재미난 코드 snippet 을 보다가 하나 따라해봄. (출처)

주피터에 있는걸 복붙. 사전에 webkit2png 설치가 필요하며 그리고 실행하는 경로에 images라는 폴더 생성을 해야한다.

import requests # http 
import bs4  # beautiful soup
import re # reqular expression
import subprocess # capture screen
import json  #json util
import os # os util 사용하기 위

In [39]:

# Requests
BASE_URL_SDS = "http://search.daum.net/search?nil_suggest=btn&w=news&DA=SBC&cluster=y&q=%EC%82%BC%EC%84%B1sds"
data = requests.get(BASE_URL_SDS)

# row개수 확인
data = bs4.BeautifulSoup(data.text)
# 아래는 reqular expression을 이용하여 totalCount를 가져오는 예제
match = re.search("totalCount: [0-9]+", data.text)
# total Count를 가져오는 부분이며 두번째 인덱스에 숫자가들어있것지
total_count = int(match.group(0).split("totalCount: ")[1])

/Users/jouk/Workspace/python/test/images

In [50]:

# 총 페이지 개수 (페이지당 10개)
pages = total_count / 10 + 1
article_data = [] #아티클 보관할 배열 생성

# 오호라 이 문법은 정말 신기하구먼 자바랑 좀 다른건가 for in 하고 비슷하긴한데 range라는게 있구먼..
for page in range(1, pages+1):
    TARGET_URL = BASE_URL_SDS + "&p=" + str(page)
    data = requests.get(TARGET_URL)
    data = bs4.BeautifulSoup(data.text)
    articles = data.findAll("div", attrs={'class': 'cont_inner'})

    for article in articles:
        title_and_link = article.findAll("a")[0]
        title = title_and_link.text.encode('utf-8')
        link = title_and_link["href"]

        date_and_media = str(article.findAll("span", attrs={'class': 'date'})[0])
        date = date_and_media.split("\n")[1]
        media = date_and_media.split("\n")[2].split("</span> ")[1]

        article_data.append(
            {
                "title": title,
                "link": link,
                "date": date,
                "media": media,
            }
        )
        
        # 아래를 실행하기 위해서는 http://www.paulhammond.org/webkit2png/ 에서 우선 webkit2png가 필요!!
        # ScreenShot
        subprocess.call([
            "webkit2png",
            "-F",   # only create fullsize screenshot
            "--filename=temporary",
            "--dir=" + os.path.join(os.getcwd(), "images"),
            link
        ])
        # Rename Screenshot
        # webkit2png --filename=FILENAME 옵션을 사용하면 한글깨짐 문제 발생
        for filename in os.listdir("./images/"):
            if filename.startswith("temporary"):
                os.rename(
                    os.path.join(os.getcwd(), "images", filename),
                    os.path.join(os.getcwd(), "images",
                                "Screenshot_" + date + "_" + media + "_" + title.replace(" ", "_") + ".png")
                )

# Result as JSON
# 단, ensure_ascii 옵션으로 UTF-8 ( 한글로 보이도록 ) 출력한다.
with open('result.json', 'w') as outfile:
    json.dump(article_data, outfile, ensure_ascii=False)

저작자표시

'데이터분석 > R & Python' 카테고리의 다른 글

주식 회복 탄력성 지수 구하는 모듈 ver 1.0 (0)	2017.03.09

Posted by 억사마

«이전 1 다음»

일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Success is a long continuous journey.

'2016/08/11'에 해당되는 글 1건

특정 주제 목록 캡쳐그림/목록 가져오기 (Python)

'데이터분석 > R & Python' 카테고리의 다른 글

카테고리

공지사항

태그목록

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

달력

링크

티스토리툴바