Overviewofpythonwebscrapingtools 120518065647 Phpapp02 PDF

Overview of Python
web scraping tools
Maik Rder
Barcelona Python Meetup Group
17.05.2012
Friday, May 18, 2012

Data Scraping
Automated Process
Explore and download pages
Grab content
Store in a database or in a text file

urlparse
Manipulate URL strings

urlparse.urlparse()
urlparse.urljoin()
urlparse.urlunparse()

urllib
Download data through different protocols

HTTP, FTP, ...
urllib.parse()
urllib.urlopen()
urllib.urlretrieve()

Scrape a web site
Example: http://www.wunderground.com/

Preparation
>>> from StringIO import StringIO
>>> from urllib2 import urlopen
>>> f = urlopen('http://
www.wunderground.com/history/airport/
BCN/2007/5/17/DailyHistory.html')
>>> p = f.read()
>>> d = StringIO(p)
>>> f.close()

Beautifulsoup
HTML/XML parser
designed for quick turnaround projects like
screen-scraping
http://www.crummy.com/software/
BeautifulSoup

BeautifulSoup
from BeautifulSoup import *

a = BeautifulSoup(d).findAll('a')
[x['href'] for x in a]

Faster BeautifulSoup
from BeautifulSoup import *
p = SoupStrainer('a')
a = BeautifulSoup(d, parseOnlyThese=p)
[x['href'] for x in a]

Inspect the Element
Inspect the Maximum temperature

Find the node
>>> from BeautifulSoup import
BeautifulSoup
>>> soup = BeautifulSoup(d)
>>> attrs = {'class':'nobr'}
>>> nobrs = soup.findAll(attrs=attrs)
>>> temperature = nobrs[3].span.string
>>> print temperature
23

htmllib.HTMLParser
Interesting only for historical reasons

based on sgmllib

htmllib5
Using the custom simpletree format
a built-in DOM-ish tree type (pythonic idioms)
from html5lib import parse
from html5lib import treebuilders
e = treebuilders.simpletree.Element
i = parse(d)
a =[x for x in d if isinstance(x, e)
and x.name= 'a']
[x.attributes['href'] for x in a]

lxml
Library for processing XML and HTML
Based on C libraries
sudo aptitude install libxml2-dev
sudo aptitude install libxslt-dev
Extends the ElementTree API

e.g. with XPath

lxml
from lxml import etree

t = etree.parse('t.xml')
for node in t.xpath('//a'):
node.tag
node.get('href')
node.items()
node.text
node.getParent()

twill
Simple
No JavaScript
http://twill.idyll.org
Some more interesting concepts
Pages, Scenarios
State Machines

twill
Commonly
go()
used methods:
code()
show()
showforms()
formvalue() (or fv())
submit()

Twill
>>> from twill import commands as

twill
>>> from twill import get_browser
>>> twill.go('http://www.google.com')
>>> twill.showforms()
>>> twill.formvalue(1, 'q', 'Python')
>>> twill.showforms()
>>> twill.submit()
>>> get_browser().get_html()

Twill - acknowledge_equiv_refresh
>>> twill.go("http://
www.wunderground.com/history/
airport/BCN/2007/5/17/
DailyHistory.html")
...
twill.errors.TwillException:
infinite refresh loop discovered;
aborting.
Try turning off
acknowledge_equiv_refresh...

Twill
>>> twill.config
("acknowledge_equiv_refresh", "false")
>>> twill.go("http://
BCN/2007/5/17/DailyHistory.html")
==> at http://www.wunderground.com/
history/airport/BCN/2007/5/17/
DailyHistory.html
'http://www.wunderground.com/history/
DailyHistory.html'

mechanize
Stateful programmatic web browsing
navigation history
HTML form state
cookies
ftp:, http: and file: URL schemes
redirections
proxies
Basic and Digest HTTP authentication
mechanize - robots.txt
>>> import mechanize
>>> browser = mechanize.Browser()
>>> browser.open('http://
DailyHistory.html')
mechanize._response.httperror_see
k_wrapper: HTTP Error 403:
request disallowed by robots.txt

mechanize - robots.txt
Do not handle robots.txt
browser.set_handle_robots(False)
Do not handle equiv

browser.set_handle_equiv(False)
browser.open('http://
DailyHistory.html')

Selenium
http://seleniumhq.org
Support for JavaScript

Selenium
from selenium import webdriver

from selenium.common.exceptions \
import NoSuchElementException
from selenium.webdriver.common.keys \
import Keys
import time

Selenium
>>> browser = webdriver.Firefox()
>>> browser.get("http://
BCN/2007/5/17/DailyHistory.html")
>>> a = browser.find_element_by_xpath
("(//span[contains(@class,'nobr')])
[position()=2]/span").text
browser.close()
>>> print a
23
Phantom JS
http://www.phantomjs.org/

Overviewofpythonwebscrapingtools 120518065647 Phpapp02 PDF

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Overviewofpythonwebscrapingtools 120518065647 Phpapp02 PDF

Cargado por

Copyright:

Formatos disponibles

Overview of Python

web scraping tools

Friday, May 18, 2012

Friday, May 18, 2012

Manipulate URL strings

Friday, May 18, 2012

Download data through different protocols

Friday, May 18, 2012

Friday, May 18, 2012

Friday, May 18, 2012

Friday, May 18, 2012

from BeautifulSoup import *

Friday, May 18, 2012

Friday, May 18, 2012

Friday, May 18, 2012

Friday, May 18, 2012

Interesting only for historical reasons

Friday, May 18, 2012

Friday, May 18, 2012

Extends the ElementTree API

Friday, May 18, 2012

from lxml import etree

Friday, May 18, 2012

Friday, May 18, 2012

Friday, May 18, 2012

>>> from twill import commands as

Friday, May 18, 2012

Friday, May 18, 2012

Friday, May 18, 2012

Friday, May 18, 2012

Do not handle equiv

Friday, May 18, 2012

Friday, May 18, 2012

from selenium import webdriver

Friday, May 18, 2012

Friday, May 18, 2012

También podría gustarte