This project has been created to transform HTML into well-formed XML, as part of a Maven build. A case where this is required, is when HTML has been generated from DocBook content and needs to be included in a (by Maven generated) project site.
The transformation, of HTML into well-formed XML, is done through HTML Cleaner .
When a Maven project site is generated, it will incorporate static pages and generated reports . Apart from the static pages in the standard Maven site directory ${project}/src/site/${format} it will also include dynamically generated static pages , which are placed in the directory ${projectDir}/target/generated-site/${format} . Here the ${format} should be one of the Maven Site Plug-in supported formats ( xdoc , apt , fml , xhtml , twiki , confluence ). Both sets of static pages will be pickup automatically by the Maven Site Plug-in and converted together with the generated reports into a project site.
Not all tools create well-formed (X)HTML, which then causes a problem, when these document need to be included in a Maven generated project site. One of the great Maven Plug-ins for generating HTML (and PDF) documents from DocBook content is the Maven DocBkx Plug-in , but the generated HTML is not well-formed XML. Here this Maven HTML Cleaner Plug-in comes into action, to transform the generate HTML into well-formed XML. This is just an example, this Maven HTML Cleaner Plug-in can transform any document, which can be handled by HTML Cleaner .
HTML Cleaner is a nice Java API which is specialized in transforming HTML documents into well-formed XML documents. It comes standard with an Ant task, but a Maven2 Plug-in is missing.
When this plug-in is ready, it is possible to transform HTML documents into well-formed XML documents through the Maven build process. A case in which this immediately adds value, is in cleaning up incorrectly generated HTML files, for inclusion in Maven project sites (to include HTML inside a Maven generated project site, the HTML needs to be well-formed).
Next to the plug-in itself, a User Guide , describing the setup and usage of the plug-in.
The deliverables
The Maven2 Plug-in maven-html-cleaner-plugin
Planning to get a first release 1.0.0 beta-1 out in April 2010, to see if it full fills the required needs.
This document is based on Ready-to-use Software Engineering Templates template Project Overview .