Python爬虫利器二之BeautifulSoup(网页解析器)
浏览量:572
官方下载地址:http://www.crummy.com/software/BeautifulSoup/
linux下安装:pip install beautifulsoup4
测试:
#!/usr/bin/python # -*- coding: utf-8 -*- import bs4 print bs4
获取资源并引入:
from bs4 import BeautifulSoup
soup = BeautifulSoup(
html_doc #html文本字符串
'html.parser' #html解析器
from_encoding = 'utf-8' #html文本编码
)
使用方法 find_all(name,atts,string)
#查找所有标签为a的节点
soup.find_all('a')
#查找所有标签为a,链接符号为/view/123.html形式的节点
soup.find_all('a',href='/view/123.html')
soup.find_all('a',href=re.compile(r'/view/\d+\.html')) 正则
#查找所有标签为div class为abc,文字为python的节点
soup.find_all('div',class='abc',string='python')
#得到节点:<a href = '1.html'>Python</a>
#获取查找到的节点的标签名字
node.name
#获取查找到的a节点的href属性
node['href']
#获取查找到的a节点的链接文字
node.get_text()
例子:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
html_doc= """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'html.parser',from_encoding='utf-8')
print "获取所有的链接"
links = soup.find_all('a')
for link in links:
print link.name,link['href'],link.get_text()
神回复
发表评论:
◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。