Wednesday, September 24, 2014

HTML Parsing and Querying with JSoup

Now a days in web environment we have too much work with HTML pages.

If you have any requirement like parsing, querying or adding missing tags in html document than from my point of view JSoup is the answer for it.

Here, I am going to show some little work around HTML document with the help of JSoup utility.

Please download JSoup jar file from JSoup download page and add this into your projects lib directory.

Below is the code snippet for above discussion,

HTML File :

<!DOCTYPE html>
<body bgcolor="#FFFFFF" text="#000000">
    <div align="center"  id="mainDiv"><font size="6">Welcome to JSoup Test Demo</font> 
    </div>
    <form name="testForm" method="post" action="controller_context_path">
        <table width="90%" border="0" cellspacing="4" cellpadding="4">
            <tr>
                <td width="50%">
                    <div align="right">
                        User Id:
                    </div>
                </td>
                <td width="50%">
                    <div align="left">
                        <input type="text" name="userId" value="">
                    </div>
                </td>
            </tr>
            <tr>
                <td width="50%">
                    <div align="right"  id="div1">
                        Name:
                    </div>
                </td>
                <td width="50%">
                    <div align="left" id="div2">
                        <input type="text" name="orgId" value="">
                    </div>
                </td>
            </tr>
            <tr>
                <td width="50%">
                    <div align="right"></div>
                </td>
                <td width="50%">
                    <div align="left"  id="div3">
                        <input type="submit" name="Submit" value="Login">
                    </div>
                </td>
            </tr>
        </table>
        <p>&nbsp;</p>
    </form>
</body>
</html>
-----------------------------------------------------------
Java Code:
Below is the code snippet to parse above document and querying to specific tag in HTML page,

File htmlFile =  new File("HTML file path");
org.jsoup.nodes.Document document = Jsoup.parse(xslfile, "UTF-8");

This will create html document with all the tags. This will format whole html file if it broken or some tags which is not ended with their respective tag,
Now if we want body section of the html page than
Element body = document.body();
If we want to convert body element into string format than simply use body.toString() method of Element.

For traversing or querying across html document below is the code snippet.

Elements element = document.select("div.div1"); //tagName.id of the tag

This will returns all the div tags which have id div1 in html document.

Now querying exact document from the html document below is the code snippet,

Element element = document.select("div.div3").select("input[name$=Submit]").first(); 

This will returns the element of input type whose attribute name contains Submit text.

For in-depth querying pattern visit http://jsoup.org/cookbook/extracting-data/selector-syntax

This is things which i have mentioned here. You can visit http://jsoup.org for more information.

Here is the comparison of different HTML parser which will be useful in selecting the html parser which are currently available. Visit http://en.wikipedia.org/wiki/Comparison_of_HTML_parsers

This might be helpful to you guys to initiate work related html document.

Please don't forgot to add your comment or feedback if you like this post or useful in any ways.

Cheers,
Ashish Mishra