Beautiful Soup-Quick Start

16 Aug 2011

Beautiful Soup 是一个处理 Python HTML/XML 的模块
Beautiful Soup的官方页面

(一)Beautiful Soup 下载与安装
下载地址
安装其实很简单,Beautiful Soup只有一个文件BeautifulSoup.py,只要把这个文件拷到你的工作目录,就可以了。
from BeautifulSoup import BeautifulSoup # For processing HTML
from BeautifulSoup import BeautifulStoneSoup # For processing XML
import BeautifulSoup # To get everything

(二)创建 Beautiful Soup 对象
BeautifulSoup对象需要一段html文本就可以创建了。
下面的代码就创建了一个BeautifulSoup对象:

 1  
 2 from BeautifulSoup import BeautifulSoup
 3 doc = ['<html><head><title>PythonClub.org</title></head>',
 4        '<body><p id="firstpara" align="center">This is paragraph one',
 5        '<p id="secondpara" align="blah">This is paragraph two',
 6        '</html>']
 7 soup = BeautifulSoup(''.join(doc))
 8 print soup.prettify()
 9 # <title>PythonClub.org</title>
10 # <html>
11 #  <head>
12 #   <title>
13 #    PythonClub.org
14 #   </title>
15 #  </head>
16 #  <body>
17 #   <p id="firstpara" align="center">
18 #    This is paragraph
19 #    <b>
20 #     one
21 #    </b>
22 #    of ptyhonclub.org.
23 #   </p>
24 #   <p id="secondpara" align="blah">
25 #    This is paragraph
26 #    <b>
27 #     two
28 #    </b>
29 #    of pythonclub.org.
30 #   </p>
31 #  </body>
32 # </html>

navigate soup的一些方法:

 1  
 2 >>> soup.contents[0].name
 3 u'html'
 4 >>> soup.contents[0].contents[0].name
 5 u'head'
 6 >>> head = soup.contents[0].contents[0]
 7 >>> head.parent.name
 8 u'html'
 9 >>> head.next
10 <title>Page title</title>
11 >>> head.nextSibling.name
12 u'body'
13 >>> head.nextSibling.contents[0]
14 <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
15 >>> head.nextSibling.contents[0].nextSibling
16 <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

下面是一些方法搜索soup,获得特定标签或有着特定属性的标签:

 1  
 2 >>> titleTag = soup.html.head.titletitleTag
 3 <title>Page title</title>
 4 >>> titleTag.string
 5 u'Page title'
 6 >>> len(soup('p'))
 7 2
 8 >>> soup.findAll('p', align="center")
 9 [<p id="firstpara" align="center">This is paragraph <b>one</b>. </p>]
10 >>> soup.find('p', align="center")
11 <p id="firstpara" align="center">This is paragraph <b>one</b>. </p>
12 >>> soup('p', align="center")[0]['id']
13 u'firstpara'
14 >>> soup.find('p', align=re.compile('^b.*'))['id']
15 u'secondpara'
16 >>> soup.find('p').b.string
17 u'one'
18 >>> soup('p')[1].b.string
19 u'two'

修改soup也很简单:

 1  
 2 >>> titleTag['id'] = 'theTitle'
 3 >>> titleTag.contents[0].replaceWith("New title")
 4 >>> soup.html.head
 5 <head><title id="theTitle">New title</title></head>
 6 >>> soup.p.extract()
 7 >>> soup.prettify()
 8 # <html>
 9 #  <head>
10 #   <title id="theTitle">
11 #    New title
12 #   </title>
13 #  </head>
14 #  <body>
15 #   <p id="secondpara" align="blah">
16 #    This is paragraph
17 #    <b>
18 #     two
19 #    </b>
20 #     .
21 #   </p>
22 #  </body>
23 # </html>
24 >>> soup.p.replaceWith(soup.b)
25 # <html>
26 #  <head>
27 #   <title id="theTitle">
28 #    New title
29 #   </title>
30 #  </head>
31 #  <body>
32 #   <b>
33 #    two
34 #   </b>
35 #  </body>
36 # </html>
37 >>> soup.body.insert(0, "This page used to have ")
38 >>> soup.body.insert(2, " &lt;p&gt; tags!")
39 >>> soup.body# <body>This page used to have <b>two</b> &lt;p&gt; tags!</body>

一个实际例子,用于抓取 ICC Commercial Crime Services weekly piracy report页面, 使用Beautiful Soup剖析并获得发生的盗版事件:

1 import urllib2
2 from BeautifulSoup import BeautifulSoup
3 page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
4 soup = BeautifulSoup(page)
5 for incident in soup('td', width="90%"):
6     where, linebreak, what = incident.contents[:3]
7     print where.strip()
8     print what.strip()
9     print

更多可查看官方中文教程:)

Fork me on GitHub