pseudoramble
The Briefest Introduction to BeautifulSoup
Published on 2015-07-22 Modified on 2015-07-22

I love using Python's BeautifulSoup library. It's a frontend to markup parsers that makes doing hack-jobs that I enjoy quite simple. Things such as my website and other things I've been working on use it.

For example, on this blog I use it to smash together mark up generated by Markdown into the template I like. In combination with Python, I use it to fill in the list of previous entries that you'll see on the link sections.

def add_previous_entries(view, prev_entries):
    for prev_entry in prev_entries:
        with prev_entry.open() as prev_entry_fd:
            prev_soup = BeautifulSoup(prev_entry_fd.read())
            title = prev_soup.find('div', class_='article-header').string

        link = view.new_tag('a', href=prev_entry.name)
        link.string = title

        li = view.new_tag('li')
        li.append(link)

        view.find('ul', class_='previous-entries').append(li)

    return view

In the code above , the 'view' parameter is just the entry being modified. prev-entries are Path objects to a set of previous entries to link to. With that, we create a new bowl-o-soup (known as prev_soup in this case), and find the title of the entry. (listed under the div with a class of 'article-header'. We then hook it into the view's list of previous entries with a new link and li, and move on.

In another example that's shown below we can also use it for XML quite easily.

def get_show_ids():
    def identifier_strings(tag):
        return tag.has_attr('name') and tag['name'] == 'identifier'

    soup = BeautifulSoup(request.urlopen(SHOW_IDS_URL).read(), "xml")
    show_ids = [x.string for x in soup.find_all(identifier_strings)]

    return show_ids

This demonstrates using BeautifulSoup to specify what a valid tag to find is via find or find_all . In this case, I provided a function to look at he name and ensure it has the correct attribute. I then use the Tag 's string attribute to access the text value contained within that node.

Of course there is plenty more you can do with BeautifulSoup. I recommend you read the pretty solid set of documentation the library includes .

Now go forth, and commit crimes against markup everywhere!