For Python 2.x there is a well-known library for parsing html pages (html5lib). This library requires a File Object as the parsing source, but sometimes the raw HTML of a page is contained in a string variable. So how do we access a string with a File Object? Use StringIO!
When you create a
StringIO object, you can treat that object exactly like a File Object: writing, seeking and reading with all the standard functions.
data = "A whole bunch of information"; # Create a stream on the string called 'data'. from StringIO import StringIO dataStream = StringIO() dataStream.write(data)
Now you can pass
dataStream to any function expecting a File Object!
Combined with html5lib we can parse an HTML page like this:
from html5lib import html5parser, treebuilders treebuilder = treebuilders.getTreeBuilder("simpleTree") parser = html5parser.HTMLParser(tree=treebuilder) document = parser.parse(dataStream)
Now the variable
document contains the tree representation of the HTML contained in