Python BeautifulSoup (爬蟲工具系列)

說明
實驗用的 HTML

筆記自動化與爬蟲工具系列之 BeautifulSoup Library，如何利用自動化與爬蟲技術提高工作效率。

說明

快速使用

from bs4 import BeautifulSoup

with open('index.html', 'r', encoding='utf-8') as file:
    html = file.read()

# <class 'bs4.BeautifulSoup'>
soup = BeautifulSoup(html, 'html.parser')

soup = BeautifulSoup('<p>Analyze <span>html</span> with requests</p>', 'html.parser')

搭配 Requests 使用

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

url = 'https://zh.wikipedia.org/wiki/ISO_3166-1'

headers = {
    'User-Agent': UserAgent().random,
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    response.encoding = 'utf-8'
    soup = BeautifulSoup(response.text, 'html.parser')
    table = soup.find('table')

bs4.element.Tag

常見的 bs4 類型，代表的就是 Html 標籤，標籤可以是巢狀的。

soup.find('h1')
title_tag.text 
# All text including children element

title_tag.get_text()
# Method Style

title_tag.string 
# String element self only but when element has nested element, return None

title_tag.name
title_tag.attrs

如果要處理 <br/> 分隔的字串，將取得的字串轉換為 list，可以使用 get_text 搭配 seperator 以及 split

'<p>lorem<br>ipsum'.get_text(separator='\n').split('\n')

Find Element

可以根據 Id, Css Selector 或者 Attrs 的方式去搜尋 Element。

By Tag

soup(['a', 'div'])

soup.find_all(['a', 'div'])

soup.find_all({'a', 'div'})

By Id 🐧

p = soup.find(id = 'p2')
p = soup.select_one('#p2')
p = soup.find_all(attrs={"id" : "p2"})[0]

By CSS Selector 🐧

soup.find('p', class_='text-primary')
soup.find_all('p', class_='text-primary')
soup.select('p.text-primary')
soup.select_one('p.text-primary')

By Attr (Href) 🐧

soup.find_all('a', href=lambda href: href and 'foo.bar' in href)

By Attr (data-*) 🐧

soup.find_all('div', {'data-val': ['3', '4']})

Or Conditions 🐧

rs = soup.select('p, td')
for e in rs:
    print(e.)

Str(), Prettify

取得的 bs4.element.Tag，可以顯示出標籤內容。

str(soup.find('p'))
# <p><span>John Doe</span></p>

soup.find('p').prettify()
# <p>
#   <span>John Doe</span>
# </p>

Text, String, Strings, Contents

如果要取得數值，有 Text, String, Strings, Contents 等數種資料格式。

<p><span>John Doe</span></p>

print(soup.find('p').text)
# John Doe

print(soup.find('p').string)
# John Doe

print(soup.find('p').contents)
# [<span>John Doe</span>]

for s in soup.find('p').strings:
    print(s)
    # John Doe

<p>Author: <span>John Doe</span></p>

print(soup.find('p').text)
# Author: John Doe

print(soup.find('p').string)
# None

print(soup.find('p').contents)
# ['Author: ', <span>John Doe</span>]

for s in soup.find('p').strings:
    print(s)
    # Author:
    # John Doe

藉由 string Property 取得的型別是 'bs4.element.NavigableString'。

清除雜訊 Decompose, Extract

可以透過 decompose 或者是 extract 的方式清除不需要的 element，常見的是來自 wikipedia 上表格當中的 sub, sup 等標籤，其他常見的移除也包含 style, script 等對於資料內容分析無直接需要的標籤。

tags_to_remove = soup.select('sup, sub')

for tag in tags_to_remove:
    tag.decompose()

for tag in tags_to_remove:
    extract = tag.extract()

實驗用的 HTML

<html>
<head>
    <title>示例文章</title>
</head>
<body>
    <h1>標題：這是示例文章</h1>
    <p>作者：<span>John Doe</span></p>
    <div class="content">
        <p>這是文章的內容。</p>
        <p>這是文章的第二段內容。</p>
        <p>這是文章的第三段內容。</p>
    </div>

    <!-- Shops of Warrior -->

    <table class="table" id="shop" data-table="shop">
        <thead>
            <tr>
                <th>商品名稱</th>
                <th>價格</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>劍</td>
                <td>50金幣</td>
            </tr>
            <tr>
                <td>盾牌</td>
                <td>30金幣</td>
            </tr>
            <tr>
                <td>魔法藥水</td>
                <td>10金幣</td>
            </tr>
        </tbody>
    </table>
</body>
</html>