Python 程式設計/Web

beautifulsoup4
螢幕抓取庫
PyPi 連結	https://pypi.python.org/pypi/beautifulsoup4
Pip 命令	pip install beautifulsoup4
匯入命令	import bs4

requests
為人類設計的 Python HTTP
PyPi 連結	https://pypi.python.org/pypi/requests
Pip 命令	pip install requests

Python 網頁請求/解析非常簡單，並且有一些必備的模組可以幫助完成此操作。

Urllib

Urllib 是 Python 內建的用於 HTML 請求的模組，主要文章是 Python 程式設計/網際網路.

try:
    import urllib2
except (ModuleNotFoundError, ImportError): #ModuleNotFoundError is 3.6+
    import urllib.parse as urllib2
    
url = 'https://www.google.com'
u = urllib2.urlopen(url)
content = u.read() #content now has all of the html in google.com

Requests

Python requests 庫簡化了 HTTP 請求。它包含每個 HTTP 請求的功能

GET (requests.get)
POST (requests.post)
HEAD (requests.head)
PUT (requests.put)
DELETE (requests.delete)
OPTIONS (requests.options)

基本請求

import requests

url = 'https://www.google.com'
r = requests.get(url)

響應物件

上一個函式的響應包含許多變數/資料檢索。

>>> import requests
>>> r = requests.get('https://www.google.com')
>>> print(r)
<Response [200]>
>>> dir(r) # dir shows all variables, functions, basically anything you can do with var.n where n is something to do
['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']

r.content 和 r.text 提供類似的 HTML 內容，但 r.text 更受歡迎。
r.encoding 將顯示網站的編碼。
r.headers 顯示網站返回的頭部資訊。
r.is_redirect 和 r.is_permanent_redirect 顯示原始連結是否重定向。
r.iter_content 將以位元組的形式迭代 HTML 中的每個字元。要將位元組轉換為字串，必須使用 r.encoding 中的編碼進行解碼。
r.iter_lines 類似於 r.iter_content，但它將迭代 HTML 中的每一行。它也是以位元組形式。
r.json 將在返回輸出為 JSON 時將 JSON 轉換為 Python 字典。
r.raw 將返回基本 urllib3.response.HTTPResponse 物件。
r.status_code 將返回伺服器傳送的 HTML 程式碼。程式碼 200 表示成功，而任何其他程式碼表示錯誤。r.raise_for_status 如果狀態程式碼不是 200，則將返回異常。
r.url 將返回傳送的 URL。

身份驗證

Requests 內建了身份驗證。這是一個使用基本身份驗證的示例。

import requests

r = requests.get('http://example.com', auth = requests.auth.HTTPBasicAuth('username', 'password'))

如果是基本身份驗證，你只需要傳遞一個元組。

import requests

r = requests.get('http://example.com', auth = ('username', 'password'))

所有其他型別的身份驗證都在 requests 文件中。

查詢

HTML 中的查詢傳遞值。例如，當你進行谷歌搜尋時，搜尋 URL 類似於 https://www.google.com/search?q=My+Search+Here&...。問號後面的所有內容都是查詢。查詢是 url?name1=value1&name2=value2...。Requests 有一個系統可以自動進行這些查詢。

>>> import requests
>>> query = {'q':'test'}
>>> r = requests.get('https://www.google.com/search', params = query)
>>> print(r.url) #prints the final url
https://www.google.com/search?q=test

真正的強大之處體現在多個條目中。

>>> import requests
>>> query = {'name':'test', 'fakeparam': 'yes', 'anotherfakeparam': 'yes again'}
>>> r = requests.get('http://example.com', params = query)
>>> print(r.url) #prints the final url
http://example.com/?name=test&fakeparam=yes&anotherfakeparam=yes+again

它不僅傳遞這些值，還將特殊字元 & 空格更改為 HTML 相容版本。

BeautifulSoup4

BeautifulSoup4 是一個強大的 HTML 解析命令。讓我們嘗試一些示例 HTML。

>>> import bs4
>>> example_html = """<!DOCTYPE html>
... <html>
... <head>
... <title>Testing website</title>
... <style>.b{color: blue;}</style>
... </head>
... <body>
... <h1 class='b', id = 'hhh'>A Blue Header</h1>
... <p> I like blue text, I like blue text... </p>
... <p class = 'b'> This text is blue, yay yay yay!</p>
... <p class = 'b'>Check out the <a href = '#hhh'>Blue Header</a></p>
... </body>
... </html>
... """
>>> bs = bs4.BeautifulSoup(example_html)
>>> print(bs)
<!DOCTYPE html>
<html><head><title>Testing website</title><style>.b{color: blue;}</style></head><body><h1 class="b" id="hhh">A Blue Header</h1><p> I like blue text, I like blue text... </p><p class="b"> This text is blue, yay yay yay!</p><p class="b">Check out the <a href="#hhh">Blue Header</a></p></body></html>
>>> print(bs.prettify()) #adds in newlines
<!DOCTYPE html>
<html>
 <head>
  <title>
   Testing website
  </title>
  <style>
   .b{color: blue;}
  </style>
 </head>
 <body>
  <h1 class="b" id="hhh">
   A Blue Header
  </h1>
  <p>
   I like blue text, I like blue text...
  </p>
  <p class="b">
   This text is blue, yay yay yay!
  </p>
  <p class="b">
   Check out the
   <a href="#hhh">
    Blue Header
   </a>
  </p>
 </body>
</html>

獲取元素

有兩種方法可以訪問元素。第一種方法是手動輸入標籤，按照順序向下遍歷，直到到達你想要的標籤。

>>> print(bs.html)
<html><head><title>Testing website</title><style>.b{color: blue;}</style></head><body><h1 class="b" id="hhh">A Blue Header</h1><p> I like blue text, I like blue text... </p><p class="b"> This text is blue, yay yay yay!</p><p class="b">Check out the <a href="#hhh">Blue Header</a></p></body></html>
>>> print(bs.html.body)
<body><h1 class="b" id="hhh">A Blue Header</h1><p> I like blue text, I like blue text... </p><p class="b"> This text is blue, yay yay yay!</p><p class="b">Check out the <a href="#hhh">Blue Header</a></p></body>
>>> print(bs.html.body.h1)

但是，對於大型 HTML 來說，這很不方便。有一個函式 find_all，可以查詢特定元素的所有例項。它接收一個 HTML 標籤，如 h1 或 p，並返回其所有例項。

>>> p = bs.find_all('p')
>>> p
[<p> I like blue text, I like blue text... </p>, <p class="b"> This text is blue, yay yay yay!</p>, <p class="b">Check out the <a href="#hhh">Blue Header</a></p>]

這在大型網站中仍然不方便，因為會有數千個條目。可以透過查詢類或 ID 來簡化它。

>>> blue = bs.find_all('p', _class = 'b')
>>> blue
[]

但是，它沒有返回任何結果。因此，我們可能需要使用自己的查詢系統。

>>> p = bs.find_all('p')
>>> p
[<p> I like blue text, I like blue text... </p>, <p class="b"> This text is blue, yay yay yay!</p>, <p class="b">Check out the <a href="#hhh">Blue Header</a></p>]
>>> blue = [p for p in p if 'class' in p.__dict__['attrs'] and 'b' in p.__dict__['attrs']['class']]
>>> blue
[<p class="b"> This text is blue, yay yay yay!</p>, <p class="b">Check out the <a href="#hhh">Blue Header</a></p>]

這將檢查每個元素中是否存在任何類，然後檢查是否存在類，以及是否存在類 b。從列表中，我們可以對每個元素執行某些操作，例如檢索其內部的文字。

>>> b = blue[0].text
>>> print(bb)
 This text is blue, yay yay yay!