Está en la página 1de 11

Workshop 2a: Warmup Exercises

COMP20008 Elements of Data Processing

0.1

Elements Of Data Processing

0.2

Lab 2a - Warmup

The iPython notebooks (Jupyter) will have a pdf attached (an export of the notebook).
The advantage of the notebooks (vs terminal) is that you can interactively read the questions, write the answers and see the result.
During this tutorial we will cover:
Web scraping with BeautifulSoup4 (unstructured format)
Working with CSV JSON and XML and HTML documents
Extracting and analyzing tweets using an API called twython (structured format)
In addition we will briefly work with Pandas Matplotlib and Scipy (not the topic of this tutorial)

0.2.1

BeautifulSoup web scraping

The examples in this section should get you


http://www.crummy.com/software/BeautifulSoup/bs4/doc/

started

for

the

next

practice

DIY

tasks.

Docs

here:

An XML/HTML primer The following line contains the basic elements of an XML/HTML document.
<tag attribute="value">content</tag>
The following is an html code with a tag table containing rows tr and each row has three data td columns. The style attribute of the
table is used to tell the browser to make the width of the table as much as the browser width. Jill Smith and 50 are the contents (usually
called text) of the tags (in this case td).
<table style="width:100%">
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
1

</tr>
</table>
In the next code block, we load BeautifulSoup 4 and parse the html for the front page of the University of Melbournes webpage:
In [ ]: from bs4 import BeautifulSoup as bsoup
import requests, html5lib
base_url = "http://unimelb.edu.au"
try:
soup = bsoup(requests.get(base_url).text, html5lib)
except Exception as ex:
print("Exception occured:", ex)
print(soup)
Well typically work with Tag objects, which correspond to the tags representing the structure of an HTML page. For example, to
find the first < p > tag (and its contents) you can use find. You can get the text contents of a Tag using its text property.
In [ ]: first_paragraph = soup.find(p) #
print(first_paragraph)

or just soup.p

first_paragraph_text = soup.p.text
print(first_paragraph_text)
first_paragraph_words = soup.p.text.split() #
print(first_paragraph_words)

by default split works on spaces

And you can extract a tags attributes by treating it like a dict:


In [ ]: try:
first_paragraph_id = soup.p[id] # raises KeyError if no id
2

except KeyError:
print("No ID found")
first_paragraph_id2 = soup.p.get(id) #
print(first_paragraph_id2)

returns None if no id

In [ ]: # You can get multiple tags at once using find_all:


all_paragraphs = soup.find_all(p) # or just soup(p)
print("First 5 paragraphs:\n")
print(*(p.text for p in all_paragraphs[:5])) # printing just first 5, converting each element to a string
paragraphs_with_ids = [p for p in soup(p) if p.get(id)]
print("paragraphs_with_ids:", paragraphs_with_ids) # this will be empty
#Frequently youll want to find tags with a specific class:
prominent_link = soup(a, {class : prominent})
print("prominent_link:", prominent_link)
prominent_link2 = soup(a, prominent)
print("prominent_link2:", prominent_link2)
prominent_link3 = [p for p in soup(a) if fs in p.get(id, [])]
print("prominent_link3:", prominent_link3)
And you can combine these to implement more elaborate logic. For example, if you want to find every < span > element that is
contained inside a < div > element, you could do this: (warning, will return the same span multiple times if it sits inside multiple divs,
be more clever if thats the case)
In [ ]: span_inside_divs = (span
for div in soup(div) # for each <div> on the page
for span in div(span)) # find each <span> inside it
3

print(*span_inside_divs)

0.2.2

File Formats

Many websites and web services provide application programming interfaces (APIs), which allow you to explicitly request data in a
structured format. This saves you the trouble of having to scrape them.
XML
Sometimes an API provider only provides responses in XML. You can use BeautifulSoup or the Python ElementTree module to get data
from XML similarly to how we used it to get data from HTML. Just like HTML the XML file can have attributes. If you think about
it, HTMLs are a very specific subset of XML files. More on namespaces: http://www.xml.com/pub/a/1999/01/namespaces.html and
http://www.xmlmaster.org/en/article/d01/c10/
<?xml version="1.0" encoding="UTF-8"?>
<Books>
<Book>
<Title>Designing for Sustainability</Title>
<Author style="font-weight: bold">Tim Frick</Author>
<PublicationYear>2016</PublicationYear>
<Topics>
<Topic>web development</Topic>
<Topic>design</Topic>
</Topics>
</Book>
<Book>
<Title>Building Microservices</Title>
<Author style="font-weight: bold">Sam Newman</Author>
<PublicationYear>2015</PublicationYear>
<Topics>
<Topic>web development</Topic>
4

<Topic>autonomous services</Topic>
<Topic>distributed systems</Topic>
</Topics>
</Book>
</Books>
In [ ]: xml_string = """<?xml version="1.0" encoding="UTF-8"?>
<Books>
<Book>
<Title>Designing for Sustainability</Title>
<Author style="font-weight: bold">Tim Frick</Author>
<PublicationYear>2016</PublicationYear>
<Topics>
<Topic>web development</Topic>
<Topic>design</Topic>
</Topics>
</Book>
<Book>
<Title>Building Microservices</Title>
<Author style="font-weight: bold">Sam Newman</Author>
<PublicationYear>2015</PublicationYear>
<Topics>
<Topic>web development</Topic>
<Topic>autonomous services</Topic>
<Topic>distributed systems</Topic>
</Topics>
</Book>
</Books>"""
# parsing using ElementTree
import xml.etree.ElementTree as ET
root = ET.fromstring(xml_string)
5

# getting a dict of tags and text within, we print the tag and contents only for tags within the Book tag
all_books = list(root.iter())
print([(elem.tag, elem.text) for elem in all_books if elem.tag not in[Books, Book, Topics]])

# using find and findall to get all sub-elements


print("\n\nUsing find and findall:")
for book in root.findall(Book):
title = book.find(Title).text
topics = book.find(Topics).findall(Topic)
print(Title:, title, \nTopics:, [topic.text for topic in topics])
JSON
Because HTTP is a protocol for transferring text, the data you request through a web API needs to be serialized into a string format.
Often this serialization uses JavaScript Object Notation (JSON). JavaScript objects look quite similar to Python dict, which makes their
string representations easy to interpret:
[
{
"title" : "Designing for Sustainability",
"author" : "Tim Frick",
"publicationYear" : 2016,
"topics" : [ "web development", "design"]
},
{
"title" : "Building Microservices",
"author" : "Sam Newman",
"publicationYear" : 2015,
"topics" : [ "web development", "autonomous services", "distributed systems"]
6

}
]
We can parse JSON using Pythons json module. Namely the load function, which deserializes a string representing a JSON object
into a Python object:
In [ ]: import json
serialized = """
[
{
"title" : "Designing for Sustainability",
"author" : "Tim Frick",
"publicationYear" : 2016,
"topics" : [ "web development", "design"]
},
{
"title" : "Building Microservices",
"author" : "Sam Newman",
"publicationYear" : 2015,
"topics" : [ "web development", "autonomous services", "distributed systems"]
}
]
"""
# parse the JSON to create a Python dict
deserialized = json.loads(serialized)
# print items that have "design" as a topic
print(*[item for item in iter(deserialized) if "design" in item["topics"]])
Now we can normalize it using Pandas
7

In [ ]: import pandas as pd
from pandas.io.json import json_normalize
table_data = json_normalize(deserialized)
print(table_data)
We can also import the json data using Pandas as a DataFrame and automatically normalize (flatten it) into a two dimensional
table:
In [ ]: # first, type the url in a browser and take a look at the response
url = "https://maps.googleapis.com/maps/api/geocode/json?address=melbourne%20university%20australia"
# create a DataFrame object using the read_json function in Pandas
data_frame = pd.read_json(url)
print("Raw DataFrame:", type(data_frame), data_frame)
# normalize the data automatically (flatten)
table_data = json_normalize(data_frame[results])
# we drop the first column since its redundant
table_data.drop(address_components, axis=1, inplace=True)
table_data
p Now, lets try to plot the Euclidean distance between coordinates using the Scipy pdist function: $ d(p,q) =
(q p) (q p)$
(You dont have to worry about Scipy and Matplotlib during this tutorial)
In [ ]: import numpy as np
from scipy.spatial.distance import pdist, squareform
import matplotlib.pyplot as plt
%matplotlib inline
8

pPn

i=1 (qi

pi )2 =

import seaborn as sns; sns.set()


# computing the distance between all records
dst = np.vstack((table_data[geometry.location.lat], table_data[geometry.location.lng]))
# converting the array into a matrix
matt = squareform(pdist(dst.T, euclidean))
# setting up a mask for diagonal elements and upper part of the matrix
mask = np.zeros_like(matt, dtype=np.bool)
mask[np.triu_indices_from(matt)] = True
# seting up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# generating a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# drawing the heatmap with the mask and correcting the aspect ratio
sns.heatmap(matt, mask=mask, cmap=cmap, vmax=.3,
square=True, xticklabels=8, yticklabels=8,
linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)
Now, we can save the results as a CSV file. Loading csv files is as trivial as saving:
In [ ]: # we can save from numpy directly as a csv (the distance matrix)
np.savetxt("distances.csv", matt, delimiter=",")
# saving the entire JSON data that we imported into pandas
table_data.to_csv("normed_data.csv")
# using the csv python module to load the file we saved previously
9

import csv
with open(normed_data.csv) as csvfile:
reader = csv.DictReader(csvfile)
print("Header:", *reader.fieldnames)
for row in reader:
print(row[geometry.location.lng], row[geometry.location.lat], "->", row[formatted_address])
In [ ]:

10

También podría gustarte