Beautiful Soup
BeautifulSoup [1] is a Python (programming language) library for getting data out of HTML , XML and other markup languages. Say you have found some webpages that display data relevant to your research, such as data or address information, but that do not provide any way to download the data directly. BeautifulSoup helps you pull particular content from a webpage, remove the HTML markup, and save the information. It's a tool for web scraping that helps you clean up and parse the documents you have pulled down from the web.It commonly saves programmers hours or days of work. It creates a parse tree for parsed pages that can be used to extract data from HTML/XML, which is useful for web scraping. It's available for Python 2.6+ and Python 3.
Contents
How To Install?
BeautifulSoup 4 is published through PyPi[2], so you can install it with easy_install or pip as following:
$ easy_install beautifulsoup4
$ pip install beautifulsoup4
How It Works?
Making The Soup
It's so straight forward, just pass the document to the BeautifulSoup constructor. It takes a string or an open file handle
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")
soup = BeautifulSoup("<xml>data</xml>")
Kinds Of Objects
Beautiful Soup converts the HTML/XML document to a complex tree of Python objects. You will only deal with about four kinds of objects:
- Tag : Corresponding to XML or HTML tag in the original document. Tags have a lot of attributes and methods.
- NavigableString : is a string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text. It's just like a Python string.
- BeautifulSoup :The BeautifulSoup object itself represents the document as a whole. For most purposes, you can treat it as a Tag object.
- Comment : The Comment object is just a special type of NavigableString to access comments within a document.
BeautifulSoup provides a lot of different attributes for navigating and iterating over a tag's children.
The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the <head> tag, just say soup.head. You can do use this trick again and again to zoom in on a certain part of the parse tree. Using a tag name as an attribute will give you only the first tag by that name. If you need to get all the <a> tags or anything else more complicated than the first tag with a certain name, you'll need to use find_all() method.
Code Example
#assuming you have BeautifulSoup library installed
#One common task is extracting all the URLs found within a page’s <a> tags
from bs4 import BeautifulSoup
import urllib2
webpage_url = "http://verify.wiki/wiki/Python_(programming_language)"
#use urllib to download the webpage source code
webpage = urllib2.urlopen(webpage_url)
soup = BeautifulSoup(webpage)
#find all <a> tags and print the value of href attribute which is the actual link for another webpage
for anchor in soup.find_all('a'):
print(anchor.get('href'))