
Python 2.x: Parsing an HTML Page From a String With html5lib
For Python 2.x there is a well-known library for parsing html pages (). This library requires a File Object as the parsing source, but sometimes the raw HTML of a page is contained in a string variable. So how do we access a string with a File Object? Use StringIO!
When you create a StringIO
object, you can treat that object exactly like a File Object: writing, seeking and reading with all the standard functions.
data = "A whole bunch of information"; # Create a stream on the string called 'data'. from StringIO import StringIO dataStream = StringIO() dataStream.write(data)
Now you can pass dataStream
to any function expecting a File Object!
Combined with html5lib we can parse an HTML page like this:
from html5lib import html5parser, treebuilders treebuilder = treebuilders.getTreeBuilder("simpleTree") parser = html5parser.HTMLParser(tree=treebuilder) document = parser.parse(dataStream)
Now the variable document
contains the tree representation of the HTML contained in dataStream
.…