Ch1- Web Scraping

Basic Example

from urllib.request import urlopen 
from bs4 import BeautifulSoup 
html = urlopen('http://www.pythonscraping.com/pages/page1.html') 
bs = BeautifulSoup(html.read(), 'html.parser') 
print(bs.h1)

The output is as follows:

<h1>An Interesting Title</h1>

Connecting Reliably and Handling Exceptions

In the first situation, an HTTP error will be returned. This HTTP error may be “404 Page Not Found,” “500 Internal Server Error,” and so forth. In all of these cases, the urlopen function will throw the generic exception HTTPError. You can handle this exception in the following way:

from urllib.request import urlopen 
from urllib.error import HTTPError 
from urllib.error import URLError 
try: 
	html = urlopen('https://pythonscrapingthisurldoesnotexist.com') 
except HTTPError as e: 
	print(e) 
except URLError as e: 
	print('The server could not be found!') 
else: 
	print('It Worked!')

The following line (where nonExistentTag is a made-up tag, not the name of a real BeautifulSoup function)

print(bs.nonExistentTag)

returns a None object. This object is perfectly reasonable to handle and check for. The trouble comes if you don’t check for it, but instead go on and try to call another func‐ tion on the None object, as illustrated in the following:

print(bs.nonExistentTag.someTag)

This returns an exception:

AttributeError: 'NoneType' object has no attribute 'someTag'

This checking and handling of every error does seem laborious at first, but it’s easy to add a little reorganization to this code to make it less difficult to write (and, more important, much less difficult to read). This code, for example, is our same scraper written in a slightly different way:

from urllib.request import urlopen
from urllib.error import HTTPError 
from bs4 import BeautifulSoup 
def getTitle(url): 
	try: 
		html = urlopen(url) 
	except HTTPError as e: 
		return None 
	try: 
		bs = BeautifulSoup(html.read(), 'html.parser') 
		title = bs.body.h1 
	except AttributeError as e: 
		return None 
	return title 
title = getTitle('http://www.pythonscraping.com/pages/page1.html') 
if title == None: 
	print('Title could not be found') 
else: 
	print(title)