Está en la página 1de 27

Overview of Python

web scraping tools

Maik Rder
Barcelona Python Meetup Group
17.05.2012

Friday, May 18, 2012


Data Scraping

Automated Process
Explore and download pages
Grab content
Store in a database or in a text file

Friday, May 18, 2012


urlparse

Manipulate URL strings


urlparse.urlparse()
urlparse.urljoin()
urlparse.urlunparse()

Friday, May 18, 2012


urllib

Download data through different protocols


HTTP, FTP, ...
urllib.parse()
urllib.urlopen()
urllib.urlretrieve()

Friday, May 18, 2012


Scrape a web site
Example: http://www.wunderground.com/

Friday, May 18, 2012


Preparation
>>> from StringIO import StringIO
>>> from urllib2 import urlopen
>>> f = urlopen('http://
www.wunderground.com/history/airport/
BCN/2007/5/17/DailyHistory.html')
>>> p = f.read()
>>> d = StringIO(p)
>>> f.close()

Friday, May 18, 2012


Beautifulsoup

HTML/XML parser
designed for quick turnaround projects like
screen-scraping
http://www.crummy.com/software/
BeautifulSoup

Friday, May 18, 2012


BeautifulSoup

from BeautifulSoup import *


a = BeautifulSoup(d).findAll('a')
[x['href'] for x in a]

Friday, May 18, 2012


Faster BeautifulSoup
from BeautifulSoup import *
p = SoupStrainer('a')
a = BeautifulSoup(d, parseOnlyThese=p)
[x['href'] for x in a]

Friday, May 18, 2012


Inspect the Element
Inspect the Maximum temperature

Friday, May 18, 2012


Find the node
>>> from BeautifulSoup import
BeautifulSoup
>>> soup = BeautifulSoup(d)
>>> attrs = {'class':'nobr'}
>>> nobrs = soup.findAll(attrs=attrs)
>>> temperature = nobrs[3].span.string
>>> print temperature
23

Friday, May 18, 2012


htmllib.HTMLParser

Interesting only for historical reasons


based on sgmllib

Friday, May 18, 2012


htmllib5
Using the custom simpletree format
a built-in DOM-ish tree type (pythonic idioms)
from html5lib import parse
from html5lib import treebuilders
e = treebuilders.simpletree.Element
i = parse(d)
a =[x for x in d if isinstance(x, e)
and x.name= 'a']
[x.attributes['href'] for x in a]

Friday, May 18, 2012


lxml
Library for processing XML and HTML
Based on C libraries
sudo aptitude install libxml2-dev
sudo aptitude install libxslt-dev

Extends the ElementTree API


e.g. with XPath

Friday, May 18, 2012


lxml

from lxml import etree


t = etree.parse('t.xml')
for node in t.xpath('//a'):
node.tag
node.get('href')
node.items()
node.text
node.getParent()

Friday, May 18, 2012


twill
Simple
No JavaScript
http://twill.idyll.org
Some more interesting concepts
Pages, Scenarios
State Machines

Friday, May 18, 2012


twill

Commonly
go()
used methods:

code()
show()
showforms()
formvalue() (or fv())
submit()

Friday, May 18, 2012


Twill

>>> from twill import commands as


twill
>>> from twill import get_browser
>>> twill.go('http://www.google.com')
>>> twill.showforms()
>>> twill.formvalue(1, 'q', 'Python')
>>> twill.showforms()
>>> twill.submit()
>>> get_browser().get_html()

Friday, May 18, 2012


Twill - acknowledge_equiv_refresh
>>> twill.go("http://
www.wunderground.com/history/
airport/BCN/2007/5/17/
DailyHistory.html")
...
twill.errors.TwillException:
infinite refresh loop discovered;
aborting.
Try turning off
acknowledge_equiv_refresh...

Friday, May 18, 2012


Twill
>>> twill.config
("acknowledge_equiv_refresh", "false")
>>> twill.go("http://
www.wunderground.com/history/airport/
BCN/2007/5/17/DailyHistory.html")
==> at http://www.wunderground.com/
history/airport/BCN/2007/5/17/
DailyHistory.html
'http://www.wunderground.com/history/
airport/BCN/2007/5/17/
DailyHistory.html'

Friday, May 18, 2012


mechanize
Stateful programmatic web browsing
navigation history
HTML form state
cookies
ftp:, http: and file: URL schemes
redirections
proxies
Basic and Digest HTTP authentication
Friday, May 18, 2012
mechanize - robots.txt
>>> import mechanize
>>> browser = mechanize.Browser()
>>> browser.open('http://
www.wunderground.com/history/
airport/BCN/2007/5/17/
DailyHistory.html')
mechanize._response.httperror_see
k_wrapper: HTTP Error 403:
request disallowed by robots.txt

Friday, May 18, 2012


mechanize - robots.txt
Do not handle robots.txt
browser.set_handle_robots(False)

Do not handle equiv


browser.set_handle_equiv(False)
browser.open('http://
www.wunderground.com/history/
airport/BCN/2007/5/17/
DailyHistory.html')

Friday, May 18, 2012


Selenium

http://seleniumhq.org
Support for JavaScript

Friday, May 18, 2012


Selenium

from selenium import webdriver


from selenium.common.exceptions \
import NoSuchElementException
from selenium.webdriver.common.keys \
import Keys
import time

Friday, May 18, 2012


Selenium
>>> browser = webdriver.Firefox()
>>> browser.get("http://
www.wunderground.com/history/airport/
BCN/2007/5/17/DailyHistory.html")
>>> a = browser.find_element_by_xpath
("(//span[contains(@class,'nobr')])
[position()=2]/span").text
browser.close()
>>> print a
23
Friday, May 18, 2012
Phantom JS

http://www.phantomjs.org/

Friday, May 18, 2012

También podría gustarte