北漂的女孩-Python爬虫利器二之BeautifulSoup（网页解析器）

Python爬虫利器二之BeautifulSoup（网页解析器）

时间:2016年05月11日浏览量:546

官方下载地址：http://www.crummy.com/software/BeautifulSoup/

linux下安装：pip install beautifulsoup4

测试：

#!/usr/bin/python
# -*- coding: utf-8 -*-
import bs4
print bs4

获取资源并引入：

from bs4 import BeautifulSoup

soup = BeautifulSoup(

html_doc #html文本字符串

'html.parser' #html解析器

from_encoding = 'utf-8' #html文本编码

)

使用方法 find_all(name,atts,string)

#查找所有标签为a的节点

soup.find_all('a')

#查找所有标签为a，链接符号为/view/123.html形式的节点

soup.find_all('a',href='/view/123.html')

soup.find_all('a',href=re.compile(r'/view/\d+\.html')) 正则

#查找所有标签为div class为abc，文字为python的节点

soup.find_all('div',class='abc',string='python')

#得到节点：<a href = '1.html'>Python</a>

#获取查找到的节点的标签名字

node.name

#获取查找到的a节点的href属性

node['href']

#获取查找到的a节点的链接文字

node.get_text()

例子：

#!/usr/bin/python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
html_doc= """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,'html.parser',from_encoding='utf-8')

print "获取所有的链接"
links = soup.find_all('a')
for link in links:
    print link.name,link['href'],link.get_text()

一	二	三	四	五	六	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

北漂的女孩

Good Luck To You!

Python爬虫利器二之BeautifulSoup（网页解析器）

时间:2016年05月11日浏览量:546

神回复

Python爬虫利器二之BeautifulSoup（网页解析器）

时间:2016年05月11日 浏览量:546

神回复

时间:2016年05月11日浏览量:546