Mar 16

Sometimes you can get as messy HTML programmer (eg in Word) and then to separate the wheat from the chaff, ie to make it into a compliant XML document for further processing.
Anyone who has seen the HTML code generated by Word, know what I'm talking about. So you do not clean up each document by hand using tools such as HTML Tidy must come quickly to the desire for a library that you can incorporate into your own Java programs in order to clean do their duty.I have good experience with html cleaner made. Here is a small code snippet:
"<html><body><unsaubererInhalt /></body></html>" ; String html = "<body> <unsaubererInhalt /> </ body> </ html>"; / / Read configuration and clean html HtmlCleaner ( ) ; Html = new Html Cleaner Cleaner Cleaner (); ( ) ; CleanerProperties props = getProperties cleaner ().; ( html ) ; TagNode node = cleaner clean (html).; / / Extract the body and prepare as XML PrettyXmlSerializer ( props ) ; PrettyXmlSerializer XmlSerializer = new PrettyXmlSerializer (props); xmlSerializer. getXmlAsString ( node. findElementByName ( "body" , true ) ) ; String xml = XMLSerializer getXmlAsString (node. findElementByName ("body", true)).;
The whole way works very well with dirty XML.




March 17th, 2009 at 12:47 pm
[...] HTML with Java cleanse [...]
September 26th, 2009 at 11:41 am
Hi, I've been trying to clean up an HTML page using HTML cleaner, but does not work! can you help me maybe? how can I clean it with java a URL page!
What you have written, I've tried and as I said but it's not!
I thank you in advance.
mfg
September 26th, 2009 at 12:17 pm
Hello again,
So, my problem is the "Win Latin" characters (Cp1252) is!
after an xml file created using the drawing type "Cp1252", I can not help to parse an XSLT parser the xml file, or even read!
the fail meldeung is as follows:
ERROR: 'Invalid encoding name "Cp1252".'
ERROR: 'com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: Invalid encoding name "Cp1252".'
...
etc.
Thank you for a response posetive
September 26th, 2009 at 6:19 pm
As a workaround, try using it with UTF-8. There seems to be problems in older versions of Xerces give (see bugs.sun.com ).
September 26th, 2009 at 7:43 pm
Workaround
Open the EAR file in the deploytool GUI and then save it.
The tool will automatically change the encoding from Cp1252
to UTF-8.
I do not understand! -> (The EAR file in the deploytool GUI).
September 27th, 2009 at 6:05 pm
Hello,
I could not solve the problem!
I've looked at a lot of the Internet for "Cp1252 encoding to UTF8 with Java" and did not find any solution!
can you please explain to me the idea for a solution!
I am very grateful.
mfg
September 27th, 2009 at 6:37 pm
as it seems -> add a workaround:
An additional JAR file must be in geschpeichert EAR file!
Among my Eclipse> EAR Libraries are (access rules) and (Native library location)
I do not know which file and where do geschpeichert!?
September 29th, 2009 at 1:07 pm
hello,
I've tried to read the file as a test file and the first line "encoding Cp1252" with "Encoding UTF-8" to exchange! and then again in the file. xml convert! But the file can not be treated as xml!!
can you please help me!
thank you.
October 6th, 2009 at 12:29 pm
Hello,
as you asked me, I've sent you an email last week.
I'll be very grateful if you can give me a solution
mfg,
Hama Baker
October 11th, 2009 at 7:46 pm
Hello,
You may have found the solution to this problem?
I desperately needed! If you give me a solution, I would be very grateful.
mfg,
Hama