Beautiful Soup 是一个处理 Python HTML/XML 的模块
Beautiful Soup的官方页面
(一)Beautiful Soup 下载与安装
下载地址
安装其实很简单,Beautiful Soup只有一个文件BeautifulSoup.py,只要把这个文件拷到你的工作目录,就可以了。
from BeautifulSoup import BeautifulSoup # For processing HTML
from BeautifulSoup import BeautifulStoneSoup # For processing XML
import BeautifulSoup # To get everything
(二)创建 Beautiful Soup 对象
BeautifulSoup对象需要一段html文本就可以创建了。
下面的代码就创建了一个BeautifulSoup对象:
1
2 from BeautifulSoup import BeautifulSoup
3 doc = ['<html><head><title>PythonClub.org</title></head>',
4 '<body><p id="firstpara" align="center">This is paragraph one',
5 '<p id="secondpara" align="blah">This is paragraph two',
6 '</html>']
7 soup = BeautifulSoup(''.join(doc))
8 print soup.prettify()
9 # <title>PythonClub.org</title>
10 # <html>
11 # <head>
12 # <title>
13 # PythonClub.org
14 # </title>
15 # </head>
16 # <body>
17 # <p id="firstpara" align="center">
18 # This is paragraph
19 # <b>
20 # one
21 # </b>
22 # of ptyhonclub.org.
23 # </p>
24 # <p id="secondpara" align="blah">
25 # This is paragraph
26 # <b>
27 # two
28 # </b>
29 # of pythonclub.org.
30 # </p>
31 # </body>
32 # </html>
navigate soup的一些方法:
1
2 >>> soup.contents[0].name
3 u'html'
4 >>> soup.contents[0].contents[0].name
5 u'head'
6 >>> head = soup.contents[0].contents[0]
7 >>> head.parent.name
8 u'html'
9 >>> head.next
10 <title>Page title</title>
11 >>> head.nextSibling.name
12 u'body'
13 >>> head.nextSibling.contents[0]
14 <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
15 >>> head.nextSibling.contents[0].nextSibling
16 <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
下面是一些方法搜索soup,获得特定标签或有着特定属性的标签:
1
2 >>> titleTag = soup.html.head.titletitleTag
3 <title>Page title</title>
4 >>> titleTag.string
5 u'Page title'
6 >>> len(soup('p'))
7 2
8 >>> soup.findAll('p', align="center")
9 [<p id="firstpara" align="center">This is paragraph <b>one</b>. </p>]
10 >>> soup.find('p', align="center")
11 <p id="firstpara" align="center">This is paragraph <b>one</b>. </p>
12 >>> soup('p', align="center")[0]['id']
13 u'firstpara'
14 >>> soup.find('p', align=re.compile('^b.*'))['id']
15 u'secondpara'
16 >>> soup.find('p').b.string
17 u'one'
18 >>> soup('p')[1].b.string
19 u'two'
修改soup也很简单:
1
2 >>> titleTag['id'] = 'theTitle'
3 >>> titleTag.contents[0].replaceWith("New title")
4 >>> soup.html.head
5 <head><title id="theTitle">New title</title></head>
6 >>> soup.p.extract()
7 >>> soup.prettify()
8 # <html>
9 # <head>
10 # <title id="theTitle">
11 # New title
12 # </title>
13 # </head>
14 # <body>
15 # <p id="secondpara" align="blah">
16 # This is paragraph
17 # <b>
18 # two
19 # </b>
20 # .
21 # </p>
22 # </body>
23 # </html>
24 >>> soup.p.replaceWith(soup.b)
25 # <html>
26 # <head>
27 # <title id="theTitle">
28 # New title
29 # </title>
30 # </head>
31 # <body>
32 # <b>
33 # two
34 # </b>
35 # </body>
36 # </html>
37 >>> soup.body.insert(0, "This page used to have ")
38 >>> soup.body.insert(2, " <p> tags!")
39 >>> soup.body# <body>This page used to have <b>two</b> <p> tags!</body>
一个实际例子,用于抓取 ICC Commercial Crime Services weekly piracy report页面, 使用Beautiful Soup剖析并获得发生的盗版事件:
1 import urllib2
2 from BeautifulSoup import BeautifulSoup
3 page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
4 soup = BeautifulSoup(page)
5 for incident in soup('td', width="90%"):
6 where, linebreak, what = incident.contents[:3]
7 print where.strip()
8 print what.strip()
9 print
更多可查看官方中文教程:)