Unit-III Introduction To XML
Unit-III Introduction To XML
Unit-III Introduction To XML
• XML (eXtensible Markup Language) is a meta language used to write markup languages, which is
used to describe data. There are two major merits to using XML.
i. XML is written in a plain text format. This allows it to be compatible with existing computing
environments.
ii. The extensibility of XML: developers can create their own markup tags, or elements, to best represent
the structure and nature of the data.
• HTML is a markup language that is very useful for specifying how data should be displayed, whereas
XML is very powerful for specifying the structure and context of data.
• XML excels as a format for describing data in a way that can be shared by multiple applications on
many platforms.
• XML can be used as a universal data format, and for the exchange of information between systems on
intranets or the Internet using Web browsers and Java.
• XML is beginning to play an important role in e-business and B2Bs as a universal data format.
XML and HTML
• It’s important to start off by mentioning that XML and HTML are not the same; and while both have
their roots in SGML, they are not even close to being the same.
• HTML is a specific markup language and is an application of SGML (or is supposed to be, if properly
written). XML itself operates at the same level as SGML, not that of HTML.
• It is interesting to note that HTML was not originally designed to visually present information. It
started as an abstract document markup language, breaking things up into paragraphs and similar
conceptual units, which the browser was supposed to decide how to render visually.
• XML itself only describes the structure of the data. XML facilitates the separation of data structure and
rendering description, but that's in large part because XML is itself neutral, which allows tools to span
that gap without having to step outside the XML boundaries.
• The following is an HTML sample.
<HTML> <HEAD> <TITLE>Information</TITLE> </HEAD>
<BODY>
<H2>Customer ID Search Results</H2>
<TABLE border=1>
<TR><TD>CustomerID</TD><TD>0000002150</TD></TR>
<TR><TD>Last Name:</TD><TD>SMITH</TD></TR>
<TR><TD>First Name</TD><TD>AUBREY</TD></TR>
<TR><TD>Company</TD><TD>HILLSIDE DR</TD></TR>
<TR><TD>Address</TD><TD>PO BOX 2134</TD></TR>
<TR><TD>Zip</TD><TD>75034</TD></TR>
<TR><TD>Zone</TD><TD>Old Town</TD></TR> </TABLE> <BR>
</BODY> </HTML>
• Now let’s look at a simple example of XML.:
<?xml version="1.0" encoding="UTF-8" ?>
<Customer>
<ID>0000002150</ID>
<Lastname>SMITH</Lastname>
<Firstname>AUBREY</Firstname>
<Company>HILLSIDE DR</Company>
<Address>PO BOX 2134</Address>
<Zip>75034</Zip>
<Zone>Old Town</Zone>
</Customer>
Why use XML: There are two main reasons for using XML:
i. For data exchange: XML is a universal data format. If any system needs to exchange data
with another system, there must be an agreed upon common data format. Using XML, any
system can use data directly without any data format conversion.
ii. To present to a browser client: XML does not include any formatting information for the
purpose of display by a browser client. However, by using extensible style sheet
transformation (XSLT) technology, we can transform any XML data format for presentation
on any browser platform. The browser only needs to support XML and XSLT.
Connecting with XML: The real value of XML technology is realized when data is shared
among various systems and applications. Since XML is a universal data format, the best way
to use it is to share data between systems. There are two types of connections using XML:
i. Connecting server to server: we can use XML as a data format,
It is often used in connecting data from server to server. As for exchanging data, it is
sometimes used to transfer data between servers for batch processing, for example,
transferring with FTP.
The other use for XML is in referring to data in real time. In this case, each application
program directly handles XML between other. For example, an application program can
access XML data from data storage and do some processing on it. The main transport
technologies used to connect servers are FTP, HTTP, SMTP and IBM MQ series.
ii. Connecting server to client: The other way of using XML is for connecting client to server,
or server to client. Since XML does not contain the information necessary to display or
present itself, we need to use XSL with XML. In this case we need to use clients which can
view XML and XSL, or introduce middleware that uses XSL to “style” the data.
XML document structure
• Using XML, we can describe many types of data which we are used to handling in a database or relational
database.
XML syntax: Let’s start with the following simple example of XML.
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<!DOCTYPE doc SYSTEM "doc.dtd">
<doc>
<title>This is the most simple XML</title>
</doc>
Example: <?xml version = "1.0"?>
<contact-info>
<name>Tanmay Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 123-4567</phone>
</contact-info>
XML document three parts.: 1. XML declaration 2. An optional DTD reference 3. XML entities
1. XML declaration: The XML declaration does the following things:
Declares that the document type is XML
Specifies the XML version
Specifies the character encoding
Indicates whether this document is logically complete or references an external entity
• The current version of XML is “1.0”; it is declared in the following part:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
• In this example, encoding style is set as “UTF-8”.XML as a universal data format: XML can handle Unicode
character encoding.
• The last attribute, “standalone,” defines the relationship of this XML document with other external files like
DTDs or XSLs. If we set “yes” in this attribute, this XML document will not refer to any other external files.
When we omit this attribute, the value defaults to “no.”
2. An optional DTD reference:
• The next part after the declaration is a DTD part. “DTD” means Document Type Definition; it
is used to define the structure of an XML document. DTD is defined in either of the following
two ways:
• include the DTD in the XML document itself, or create it in an external file and point to it
from the XML document.
• We call the first the internal DTD and the second is the external DTD. The following is an
example of placing the DTD outside of the XML document. (In this case, DTD is another file
named “doc.dtd”)
<!DOCTYPE doc SYSTEM "doc.dtd">
• In this example, the XML document has the “doc” element as a root element and will use
“doc.dtd” to validate the data.
• “SYSTEM” means that this DTD is defined by a non-public entity. If the DTD used in an
XML document is a public one, we would use “PUBLIC“ in this declaration.
3. XML entities:
• The last part, XML entities, contains the real body of the XML document. All the data
elements are defined here. To define this data, we use elements and attributes in XML entities.
An element is a unit of data.
• Every element consists of a “start tag,” “contents,” and “end tag.” The following sentence is a
example of an element.
<Redbook>XML powered by Domino</Redbook>
• In an XML document, we can add some additional information to this element as an attribute.
Attributes must be written inside of the start tag. An example of using attributes is as follows:
<Redbook BookID="LO-0053-R">XML powered by Domino</Redbook>
Syntax rules: XML documents need a very strict syntax structure. The following are the key points of
the syntax structure.
• Every start tag needs to have a counterpart end tag. The end tag is the same as the start tag with a / character
before it. For example, if you open an element with <document>, you must put </document> at the end of
that element.
• Capital and small letters are regarded as different. Case matters. For example, the following is incorrect
syntax:
<Message> Hello World! </message>
• The XML tree structure must be nested perfectly. For example, the following is incorrect:
<Tree> <NestedTree> leaf </Tree> </NestedTree>
This must be defined as follows.
<Tree> <NestedTree> leaf </NestedTree> </Tree>
• There can be only one root element in an XML document. For example, the <doc> tag is the root element in
the following:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<!DOCTYPE doc SYSTEM "doc.dtd">
<doc>
<title>This is the most simple XML</title>
</doc>
So, the following document is incorrect because there are two root elements in a single XML
document:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<!DOCTYPE doc SYSTEM "doc.dtd">
<doc>
<title>doc for one</title>
</doc>
<doc>
<title>doc for two</title>
</doc>
• Empty elements are allowed. For example, <element></element> is allowed. We can also write
this as <element />.
• Each element can have any number of attributes. The following is an example of using an
attribute: <employee EmployeeID=''00125''>Tetsuya Miwa</employee>
Syntax Rules for XML Attributes:
Attribute names are defined without quotation marks, whereas attribute values must always
appear in quotation marks double quotes('') or single quotes('). Following example demonstrates
incorrect xml syntax:
<employee EmployeeID=00125>Tetsuya Miwa</employee>
Attribute names in XML (unlike HTML) are case sensitive. That is, HREF and hrefare
considered two different XML attributes.
Same attribute cannot have two values in a syntax. The following example shows incorrect
syntax because the attribute b is specified twice: <a b = "x" c = "y" b = "z">....</a>
• Comment statements are allowed using the following tags “<!--” “-->”. The following is an
example of a comment.
<!-- This is a sample for comment. We must close the comment tag -->
XML Comments Rules: Following rules should be followed for XML comments −
Comments cannot appear before XML declaration.
Comments may appear anywhere in a document.
Comments must not appear within attribute values.
Comments cannot be nested inside the other comments.
• XML References: References usually allow to add or include additional text or markup in an
XML document. References always begin with the symbol "&" which is a reserved character
and end with the symbol ";". XML has two types of references −
Entity References − An entity reference contains a name between the start and the end
delimiters. For example & where amp is name. The name refers to a predefined string of
text and/or markup.
Character References − These contain references, such as A, contains a hash mark (“#”)
followed by a number. The number always refers to the Unicode code of a character. In this
case, 65 refers to alphabet "A".
• XML Text: The names of XML-elements and XML-attributes are case-sensitive, which means
the name of start and end elements need to be written in the same case. To avoid character
encoding problems, all XML files should be saved as Unicode UTF-8 or UTF-16 files.
• Whitespace characters like blanks, tabs and line-breaks between XML-elements and between
the XML-attributes will be ignored.
• Some characters are reserved by the XML syntax itself. Hence, they cannot be used directly.
To use them, some replacement-entities are used, which are listed below −
Well-formed XML documents: Every XML document must satisfy all of these syntax requirements.
XML documents which satisfy the syntax rules are referred to as well-formed XML documents. The
following example is a more complex XML document.
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE customerlist SYSTEM "sample.dtd">
<customerlist>
<customer customerID="00123">
<name>
<last>smith</last>
<first>aubrey</first>
/name>
</customer>
<company>HILLSIDE DR</company>
<customer customerID="00125">
<name>
<last>Miwa</last>
<first>Tetsuya</first>
</name>
</customer>
<company>JOHNSON Co.Ltd</company>
</customerlist>
• This is a well-formed XML document. Every start tag has a counterpart endtag and all the elements
compose a tree structure. The “customerlist” entry is the root element for this document and
attributes are also described properly.
Well-formed and valid XML documents:These XML documents can be classified into two
categories.
• The first category is “Well-formed XML documents” and the other is “Valid XML
documents.” Actually, every XML document must be a well-formed XML document.
• Well-formed means the XML document conforms to the XML syntax rules. Well-formed
documents do not always need a DTD.
• On the other hand, Valid XML documents must have a DTD and must strongly adhere to the
structure defined in the DTD. Table 2 summarizes the difference between these two
categories.
• When we use XML Documents to transfer data between client and server, we usually are only
concerned that the document is well-formed.
• On the other hand, when we use XML for exchanging data between two servers, using valid
XML is very important as it enables us to verify that the document we are receiving is in the
correct format before we process it.
XML Character Entities:
• An XML entity is "The document entity serves as the root of the entity tree and a starting-
point for an XML processor".
• This means, entities are the placeholders in XML. These can be declared in the document
prolog or in a DTD. There are different types of entities.
• Both, HTML and XML, have some symbols reserved for their use, which cannot be used as
content in XML code. For example, < and > signs are used for opening and closing XML tags.
To display these special characters, the character entities are used.
• There are few special characters or symbols which are not available to be typed directly from
the keyboard. Character Entities can also be used to display those symbols/special characters.
• Types of Character Entities: There are three types of character entities −
i. Predefined Character Entities
ii. Numbered Character Entities
iii. Named Character Entities
• Predefined Character Entities: They are introduced to avoid the ambiguity while using
some symbols. For example, less than ( < ) or greater than ( > ) symbol is used with the angle
tag (<>). Character entities are basically used to delimit tags in XML. Following is a list of
pre-defined character entities from XML specification. These can be used to express
characters without ambiguity.
Ampersand − &
Single quote − '
Greater than − >
Less than − <
Double quote − "
• Numeric Character Entities: The numeric reference is used to refer to a character entity.
Numeric reference can either be in decimal or hexadecimal format. As there are thousands of
numeric references available. Numeric reference refers to the character by its number in the
Unicode character set. General syntax for decimal numeric reference is −
&# decimal number ;
General syntax for hexadecimal numeric reference is −
&#x Hexadecimal number ;
• Named Character Entity; The most preferred type of character entity is the named character
entity. Here, each entity is identified with a name. For example −
'Aacute' represents capital character with acute accent.
'ugrave' represents the small with grave accent.
• As this figure shows, node object is the primary data type in this model. Using the DOM API, we can
handle node objects in many ways. There are two ways of handling node objects:
Referring and setting a value to each node object
Accessing the tree structure with a node object
• Referring and setting a value to each node object: For the first way, referring and setting
value to each node object, we often use the following four methods:
getNodeName(): This method returns the name of a node. The name depends on the type of
node it is. Sometimes it is an element name and sometimes it is an attribute name.
getNodeValue(): This method returns the value of a node. The value depends on the type of
node it is.
getNodeType():This method returns the type of a node.
setNodeValue(arg):This method sets a value to the node.
• Accessing the tree structure with a node object: For the second way, accessing tree
structure with node object, we often use the following methods:
getParentNode(): This method gets the parent node.
getPreviousSibling(), getNextSibling(): These methods get the same tree level of node.
getFirstChild(), getLastChild(), getChildNode(), getElementsByTagName(): These methods
get children nodes.
appendChild(), removeChild(), replaceChild(): These methods are used for adding and
deleting a child node.
• In this way, we can control XML documents easily with the DOM API. The following code is a
simple example of a Java program that uses the DOM API in a Notes Java Agent.
import lotus.domino.*;
import org.w3c.dom.*;
public class JavaAgent extends AgentBase {
public void NotesMain() {
int i;
try {
Session ns = getSession();
AgentContext ac = ns.getAgentContext();
lotus.domino.Document doc = ac.getDocumentContext();
Item rawXML = doc.getFirstItem("XMLDATA");
org.w3c.dom.Document xDoc = rawXML.parseXML(false);
Element el = xDoc.getDocumentElement();
//Using DOM API
String rootTag = el.getTagName();
System.out.println("The Root Element is " + rootTag);
NodeList nl = xDoc.getElementsByTagName(rootTag);
System.out.println("There is " + nl.getLength() + " node in the Root Node List");
Node n = nl.item(0);
nl = n.getChildNodes();
System.out.println("The " + rootTag + " Root Tag has " + nl.getLength() + " child nodes");
for ( i = 0 ; i < 10 ; i++) {
n=nl.item(i);
if (n.getNodeType() == n.ELEMENT_NODE) {
System.out.println("Node Name is:" + n.getNodeName());
System.out.println("Node Value is: " + n.getNodeValue());
System.out.println("Node Value is: " + n.getNodeType());
System.out.println("This node has child nodes: " +
n.hasChildNodes());
}
}
}
catch(Exception e) {
e.printStackTrace();
}
}
}
Simple API for XML (SAX): Another API for handling XML documents is Simple API for
XML (SAX). While DOM is the current W3C recommendation,
• SAX was developed by the people on the xml-dev mailing list on the Web. SAX was originally
written by this group because they wanted a simple and lightweight API for processing XML
documents.
• SAX is becoming the de-facto standard for server to server XML processing. SAX does not
create an object tree; instead, it is an event-driven lightweight API.
• The XML document is processed just once, passing each element event to the event handler.
• Every application which uses SAX must register SAX event handlers to a parser object. If we
write this in Java code, the following shows this registration.
parser.setDocumentHandler(new myDochandler() );
• SAX provides three handler interfaces: 1. DocumentHandler 2. DTDHandler 3.ErrorHandler
• The most important interface is “DocumentHandler,” which handles elements. Other two are
handlers for DTD and XML errors. Table 3 shows methods of the Document Handler interface.
• The following is a simple example of a SAX application that is used in a Notes/Domino application as a
Java Agent.
import org.xml.sax.*;
import org.xml.sax.Parser;
import lotus.domino.*;
import org.xml.sax.helpers.ParserFactory;
import java.io.*;
public class JavaAgent extends AgentBase {
public void NotesMain() {
int i;
try {
Session ns = getSession();
AgentContext ac = ns.getAgentContext();
lotus.domino.Document doc = ac.getDocumentContext();
Item rawXML = doc.getFirstItem("XMLDATA");
Reader r = rawXML.getReader();
//The following line of code implements the parser directly
//com.ibm.xml.parsers.SAXParser p = new com.ibm.xml.parsers.SAXParser();
//or us this to implement the parser through ParserFactory
Parser p = ParserFactory.makeParser("com.ibm.xml.parsers.SAXParser");
p.setDocumentHandler(new myDochandler() );
p.parse(new InputSource(r));
} catch(Exception e) {
e.printStackTrace();
}}}
myDocumenhandler.java
import org.xml.sax.*;
public class myDochandler extends HandlerBase {
public void startDocument() {
System.out.println("Begining of Document");
}
public void endDocument() {
System.out.println("End of Document");
}
public void endElement(String name) throws SAXException {
System.out.println("End of Element - " + name);
}
public void startElement(String name, AttributeList attbs) throws
SAXException {
System.out.println("Start of Element - " + name); }
public void processingInstruction(String t, String d) throws
SAXException {
System.out.println("PI target - " + t); }
public void characters(char c[] , int start , int length ) throws
SAXException {
String s = new String(c, start, length);
if (s.length() > 1) System.out.println(s);
}}
This program will parse the XML document and print the result of its parsing with SAX events.
For example, we show the sample data and its result.
<?xml version="1.0" encoding="UTF-8" ?>
<Customer>
<CustomerID>0000002150</CustomerID>
<Lastname>SMITH</Lastname>
<Firstname>AUBREY</Firstname>
<Company>HILLSIDE DR</Company>
<Address>PO BOX 2134</Address>
<Zip>75034</Zip>
<Zone>Old Town</Zone>
</Customer>
The result of this document is the following:
Begining of Document
Start of Element - Customer
Start of Element - CustomerID
0000002150
End of Element - CustomerID
Start of Element - Lastname
SMITH
End of Element - Lastname
Start of Element - Firstname
AUBREY
End of Element – Firstname
Start of Element - Company
HILLSIDE DR
End of Element - Company
Start of Element - Address
PO BOX 2134
End of Element - Address
Start of Element - Zip
75034
End of Element - Zip
Start of Element - Zone
Old Town
End of Element - Zone
End of Element - Customer
End of Document
Figure 6 shows the image of SAX programming.