Mar 16
htmlcleaner

Sometimes you can get as messy HTML programmer (eg in Word) and then to separate the wheat from the chaff, ie to make it into a compliant XML document for further processing.

Anyone who has seen the HTML code generated by Word, know what I'm talking about. So you do not clean up each document by hand using tools such as HTML Tidy must come quickly to the desire for a library that you can incorporate into your own Java programs in order to clean do their duty.

I have good experience with html cleaner made. Here is a small code snippet:

  "<html><body><unsaubererInhalt /></body></html>" ; String html = "<body> <unsaubererInhalt /> </ body> </ html>";

 / / Read configuration and clean html
 HtmlCleaner ( ) ; Html = new Html Cleaner Cleaner Cleaner ();
 ( ) ; CleanerProperties props = getProperties cleaner ().;
 ( html ) ; TagNode node = cleaner clean (html).;

 / / Extract the body and prepare as XML
 PrettyXmlSerializer ( props ) ; PrettyXmlSerializer XmlSerializer = new PrettyXmlSerializer (props);
 xmlSerializer. getXmlAsString ( node. findElementByName ( "body" , true ) ) ; String xml = XMLSerializer getXmlAsString (node. findElementByName ("body", true)).; 

The whole way works very well with dirty XML.

gklinkmann written by \ \ tags: , , ,

10 Comments to "clean HTML with Java"

  1. Recommendations from Tuesday 17 M | Biggle's Blog Says:

    [...] HTML with Java cleanse [...]

  2. Hama Says:

    Hi, I've been trying to clean up an HTML page using HTML cleaner, but does not work! can you help me maybe? how can I clean it with java a URL page!
    What you have written, I've tried and as I said but it's not!
    I thank you in advance.
    mfg

  3. Hama Says:

    Hello again,
    So, my problem is the "Win Latin" characters (Cp1252) is!
    after an xml file created using the drawing type "Cp1252", I can not help to parse an XSLT parser the xml file, or even read!
    the fail meldeung is as follows:

    ERROR: 'Invalid encoding name "Cp1252".'
    ERROR: 'com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: Invalid encoding name "Cp1252".'
    ...
    etc.

    Thank you for a response posetive :-)

  4. gklinkmann Says:

    As a workaround, try using it with UTF-8. There seems to be problems in older versions of Xerces give (see bugs.sun.com ).

  5. Hama Says:

    Workaround
    Open the EAR file in the deploytool GUI and then save it.
    The tool will automatically change the encoding from Cp1252
    to UTF-8.

    I do not understand! -> (The EAR file in the deploytool GUI).

  6. Hama Says:

    Hello,

    I could not solve the problem!
    I've looked at a lot of the Internet for "Cp1252 encoding to UTF8 with Java" and did not find any solution!
    can you please explain to me the idea for a solution!
    I am very grateful.
    mfg

  7. Hama Says:

    as it seems -> add a workaround:
    An additional JAR file must be in geschpeichert EAR file!

    Among my Eclipse> EAR Libraries are (access rules) and (Native library location)

    I do not know which file and where do geschpeichert!?

  8. Hama Says:

    hello,

    I've tried to read the file as a test file and the first line "encoding Cp1252" with "Encoding UTF-8" to exchange! and then again in the file. xml convert! But the file can not be treated as xml!!
    can you please help me!
    thank you.

  9. Hama Says:

    Hello,

    as you asked me, I've sent you an email last week.
    I'll be very grateful if you can give me a solution :-)

    mfg,
    Hama Baker

  10. Hama Says:

    Hello,

    You may have found the solution to this problem?
    I desperately needed! If you give me a solution, I would be very grateful.
    mfg,
    Hama

Add a comment

Yes, I would like to be notified about comments!