Documentos de Académico
Documentos de Profesional
Documentos de Cultura
0.1
0.2
Lab 2a - Warmup
The iPython notebooks (Jupyter) will have a pdf attached (an export of the notebook).
The advantage of the notebooks (vs terminal) is that you can interactively read the questions, write the answers and see the result.
During this tutorial we will cover:
Web scraping with BeautifulSoup4 (unstructured format)
Working with CSV JSON and XML and HTML documents
Extracting and analyzing tweets using an API called twython (structured format)
In addition we will briefly work with Pandas Matplotlib and Scipy (not the topic of this tutorial)
0.2.1
started
for
the
next
practice
DIY
tasks.
Docs
here:
An XML/HTML primer The following line contains the basic elements of an XML/HTML document.
<tag attribute="value">content</tag>
The following is an html code with a tag table containing rows tr and each row has three data td columns. The style attribute of the
table is used to tell the browser to make the width of the table as much as the browser width. Jill Smith and 50 are the contents (usually
called text) of the tags (in this case td).
<table style="width:100%">
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
1
</tr>
</table>
In the next code block, we load BeautifulSoup 4 and parse the html for the front page of the University of Melbournes webpage:
In [ ]: from bs4 import BeautifulSoup as bsoup
import requests, html5lib
base_url = "http://unimelb.edu.au"
try:
soup = bsoup(requests.get(base_url).text, html5lib)
except Exception as ex:
print("Exception occured:", ex)
print(soup)
Well typically work with Tag objects, which correspond to the tags representing the structure of an HTML page. For example, to
find the first < p > tag (and its contents) you can use find. You can get the text contents of a Tag using its text property.
In [ ]: first_paragraph = soup.find(p) #
print(first_paragraph)
or just soup.p
first_paragraph_text = soup.p.text
print(first_paragraph_text)
first_paragraph_words = soup.p.text.split() #
print(first_paragraph_words)
except KeyError:
print("No ID found")
first_paragraph_id2 = soup.p.get(id) #
print(first_paragraph_id2)
returns None if no id
print(*span_inside_divs)
0.2.2
File Formats
Many websites and web services provide application programming interfaces (APIs), which allow you to explicitly request data in a
structured format. This saves you the trouble of having to scrape them.
XML
Sometimes an API provider only provides responses in XML. You can use BeautifulSoup or the Python ElementTree module to get data
from XML similarly to how we used it to get data from HTML. Just like HTML the XML file can have attributes. If you think about
it, HTMLs are a very specific subset of XML files. More on namespaces: http://www.xml.com/pub/a/1999/01/namespaces.html and
http://www.xmlmaster.org/en/article/d01/c10/
<?xml version="1.0" encoding="UTF-8"?>
<Books>
<Book>
<Title>Designing for Sustainability</Title>
<Author style="font-weight: bold">Tim Frick</Author>
<PublicationYear>2016</PublicationYear>
<Topics>
<Topic>web development</Topic>
<Topic>design</Topic>
</Topics>
</Book>
<Book>
<Title>Building Microservices</Title>
<Author style="font-weight: bold">Sam Newman</Author>
<PublicationYear>2015</PublicationYear>
<Topics>
<Topic>web development</Topic>
4
<Topic>autonomous services</Topic>
<Topic>distributed systems</Topic>
</Topics>
</Book>
</Books>
In [ ]: xml_string = """<?xml version="1.0" encoding="UTF-8"?>
<Books>
<Book>
<Title>Designing for Sustainability</Title>
<Author style="font-weight: bold">Tim Frick</Author>
<PublicationYear>2016</PublicationYear>
<Topics>
<Topic>web development</Topic>
<Topic>design</Topic>
</Topics>
</Book>
<Book>
<Title>Building Microservices</Title>
<Author style="font-weight: bold">Sam Newman</Author>
<PublicationYear>2015</PublicationYear>
<Topics>
<Topic>web development</Topic>
<Topic>autonomous services</Topic>
<Topic>distributed systems</Topic>
</Topics>
</Book>
</Books>"""
# parsing using ElementTree
import xml.etree.ElementTree as ET
root = ET.fromstring(xml_string)
5
# getting a dict of tags and text within, we print the tag and contents only for tags within the Book tag
all_books = list(root.iter())
print([(elem.tag, elem.text) for elem in all_books if elem.tag not in[Books, Book, Topics]])
}
]
We can parse JSON using Pythons json module. Namely the load function, which deserializes a string representing a JSON object
into a Python object:
In [ ]: import json
serialized = """
[
{
"title" : "Designing for Sustainability",
"author" : "Tim Frick",
"publicationYear" : 2016,
"topics" : [ "web development", "design"]
},
{
"title" : "Building Microservices",
"author" : "Sam Newman",
"publicationYear" : 2015,
"topics" : [ "web development", "autonomous services", "distributed systems"]
}
]
"""
# parse the JSON to create a Python dict
deserialized = json.loads(serialized)
# print items that have "design" as a topic
print(*[item for item in iter(deserialized) if "design" in item["topics"]])
Now we can normalize it using Pandas
7
In [ ]: import pandas as pd
from pandas.io.json import json_normalize
table_data = json_normalize(deserialized)
print(table_data)
We can also import the json data using Pandas as a DataFrame and automatically normalize (flatten it) into a two dimensional
table:
In [ ]: # first, type the url in a browser and take a look at the response
url = "https://maps.googleapis.com/maps/api/geocode/json?address=melbourne%20university%20australia"
# create a DataFrame object using the read_json function in Pandas
data_frame = pd.read_json(url)
print("Raw DataFrame:", type(data_frame), data_frame)
# normalize the data automatically (flatten)
table_data = json_normalize(data_frame[results])
# we drop the first column since its redundant
table_data.drop(address_components, axis=1, inplace=True)
table_data
p Now, lets try to plot the Euclidean distance between coordinates using the Scipy pdist function: $ d(p,q) =
(q p) (q p)$
(You dont have to worry about Scipy and Matplotlib during this tutorial)
In [ ]: import numpy as np
from scipy.spatial.distance import pdist, squareform
import matplotlib.pyplot as plt
%matplotlib inline
8
pPn
i=1 (qi
pi )2 =
import csv
with open(normed_data.csv) as csvfile:
reader = csv.DictReader(csvfile)
print("Header:", *reader.fieldnames)
for row in reader:
print(row[geometry.location.lng], row[geometry.location.lat], "->", row[formatted_address])
In [ ]:
10