Python爬虫利器二之BeautifulSoup(网页解析器)
浏览量:536
官方下载地址:http://www.crummy.com/software/BeautifulSoup/
linux下安装:pip install beautifulsoup4
测试:
#!/usr/bin/python # -*- coding: utf-8 -*- import bs4 print bs4
获取资源并引入:
from bs4 import BeautifulSoup
soup = BeautifulSoup(
html_doc #html文本字符串
'html.parser' #html解析器
from_encoding = 'utf-8' #html文本编码
)
使用方法 find_all(name,atts,string)
#查找所有标签为a的节点
soup.find_all('a')
#查找所有标签为a,链接符号为/view/123.html形式的节点
soup.find_all('a',href='/view/123.html')
soup.find_all('a',href=re.compile(r'/view/\d+\.html')) 正则
#查找所有标签为div class为abc,文字为python的节点
soup.find_all('div',class='abc',string='python')
#得到节点:<a href = '1.html'>Python</a>
#获取查找到的节点的标签名字
node.name
#获取查找到的a节点的href属性
node['href']
#获取查找到的a节点的链接文字
node.get_text()
例子:
#!/usr/bin/python # -*- coding: utf-8 -*- from bs4 import BeautifulSoup html_doc= """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(html_doc,'html.parser',from_encoding='utf-8') print "获取所有的链接" links = soup.find_all('a') for link in links: print link.name,link['href'],link.get_text()
神回复
发表评论:
◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。