========= lxml.html ========= :Author: Ian Bicking Since version 2.0, lxml comes with a dedicated package for dealing with HTML: ``lxml.html``. It provides a special Element API for HTML elements, as well as a number of utilities for common tasks. .. contents:: .. 1 Parsing HTML 1.1 Parsing HTML fragments 1.2 Really broken pages 2 HTML Element Methods 3 Running HTML doctests 4 Creating HTML with the E-factory 4.1 Viewing your HTML 5 Working with links 5.1 Functions 6 Forms 6.1 Form Filling Example 6.2 Form Submission 7 Cleaning up HTML 7.1 autolink 7.2 wordwrap 8 HTML Diff 9 Examples 9.1 Microformat Example The main API is based on the `lxml.etree`_ API, and thus, on the ElementTree_ API. .. _`lxml.etree`: tutorial.html .. _ElementTree: http://effbot.org/zone/element-index.htm Parsing HTML ============ Parsing HTML fragments ---------------------- There are several functions available to parse HTML: ``parse(filename_url_or_file)``: Parses the named file or url, or if the object has a ``.read()`` method, parses from that. If you give a URL, or if the object has a ``.geturl()`` method (as file-like objects from ``urllib.urlopen()`` have), then that URL is used as the base URL. You can also provide an explicit ``base_url`` keyword argument. ``document_fromstring(string)``: Parses a document from the given string. This always creates a correct HTML document, which means the parent node is ````, and there is a body and possibly a head. ``fragment_fromstring(string, create_parent=False)``: Returns an HTML fragment from a string. The fragment must contain just a single element, unless ``create_parent`` is given; e.g,. ``fragment_fromstring(string, create_parent='div')`` will wrap the element in a ``
``. ``fragments_fromstring(string)``: Returns a list of the elements found in the fragment. ``fromstring(string)``: Returns ``document_fromstring`` or ``fragment_fromstring``, based on whether the string looks like a full document, or just a fragment. Really broken pages ------------------- The normal HTML parser is capable of handling broken HTML, but for pages that are far enough from HTML to call them 'tag soup', it may still fail to parse the page. A way to deal with this is ElementSoup_, which deploys the well-known BeautifulSoup_ parser to build an lxml HTML tree. .. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/ .. _ElementSoup: elementsoup.html HTML Element Methods ==================== HTML elements have all the methods that come with ElementTree, but also include some extra methods: ``.drop_tree()``: Drops the element and all its children. Unlike ``el.getparent().remove(el)`` this does *not* remove the tail text; with ``drop_tree`` the tail text is merged with the previous element. ``.drop_tag()``: Drops the tag, but keeps its children and text. ``.find_class(class_name)``: Returns a list of all the elements with the given CSS class name. Note that class names are space separated in HTML, so ``doc.find_class_name('highlight')`` will find an element like ``