【Python】BeautifulSoupでhtml内の「&」「<」「>」などのエスケープ記号を元に戻す方法

Beautiful Soup

2022.09.02

この記事は約3分で読めます。

こんにちは、ミナピピン(@python_mllover)です！

beautifulsoupのreplace_with()でタグを変換する際に<などが入っていると関数を実行した際にエスケープされてlt;みたいな特殊文字に変換されてしまう際の解決法をメモっておきたいと思います。

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
from xml.sax.saxutils import unescape

headers = {
    'User-Agent':'Mozilla/5.0'
}

#対象のサイトURL
url = 'https://swallow.5ch.net/test/read.cgi/livejupiter/1627502377/'

#URLリソースを開く
res = requests.get(url, headers=headers)
unker_data = []
main_data = []
unker_number = []
#インスタンスの作成
soup = BeautifulSoup(res.content, "html.parser")
post = soup.find_all('div', class_='post')[3]
text = post.find_all('span', class_='escaped')[0]
try:
    for tag in text.find_all('a', class_='image'):
        tag.replace_with('<img src="' + tag.text + '" width="400" height="400"></img>')
        
except:
    pass
print(post)

解決法

解決法はxml.sax.saxutilsのunescape()で置換することで元の記号に戻せます。

from xml.sax.saxutils import unescape
text = '「&amp;」「&lt;」「&gt;」' 
print(unescape(text))  # => '& < >'

Pythonによるスクレイピング＆機械学習開発テクニック増補改訂

クジラ飛行机ソシム 2019年01月

楽天ブックス

Amazon

Kindle

参照：https://www.takasay.com/entry/2015/07/07/095739

たより:

2023年4月29日 12:42 AM

# => ‘& ‘
ただしくは
# => ‘「&」「」’
ではありませんか？

返信
- ミナピピン@データアナリストより:
  
  2023年5月9日 8:06 PM
  
  あっそうですね、ご指摘ありがとうございます！
  
  返信