Contents:
The URL Class
Web Browsers and Handlers
Writing a Content Handler
Writing a Protocol Handler
Talking to CGI Programs
A URL points to an object on the Internet. It's a collection of information that identifies an item, tells you where to find it, and specifies a method for communicating with it or retrieving it from its source. A URL refers to any kind of information source. It might point to static data, such as a file on a local filesystem, a Web server, or an FTP archive; or it can point to a more dynamic object such as a news article on a news spool or a record in a WAIS database. URLs can even refer to less tangible resources such as Telnet sessions and mailing addresses.
The Java URL classes provide an
API for accessing well-defined networked resources,
like documents and applications on servers. The classes use an
extensible set of prefabricated protocol and content handlers to
perform the necessary communication and data conversion for accessing
URL resources. With URLs, an
application can fetch a complete file or database record from a server
on the network with just a few lines of code. Applications like Web
browsers, which deal with networked content, use the
URL class to simplify the task
of network programming. They also take advantage of the dynamic
nature of Java, which allows handlers for new types of
URLs to be added on the fly. As new types of
servers and new formats for content evolve, additional
URL handlers can be supplied to retrieve and
interpret the data without modifying the original application.
A URL is usually presented as a string of text, like an address.[1] Since there are many different ways to locate an item on the Net, and different mediums and transports require different kinds of information, there are different formats for different kinds of URLs. The most common form specifies three things: a network host or server, the name of the item and its location on that host, and a protocol by which the host should communicate:
[1] The term URL was coined by the Uniform Resource Identifier (URI) working group of the IETF to distinguish URLs from the more general notion of Uniform Resource Names or URNs. URLs are really just static addresses, whereas URNs would be more persistent and abstract identifiers used to resolve the location of an object anywhere on the Net. URLs are defined in RFC 1738 and RFC 1808.
protocol://hostname/location/item
protocol is an identifier such as
"http," "ftp," or "gopher"; hostname
is an Internet hostname; and the location and
item components form a path that identifies the
object on that host. Variants of this form allow extra information to
be packed into the URL, specifying things like port
numbers for the communications protocol and fragment identifiers that
reference parts inside the object.
We sometimes speak of a URL that is relative to a base URL. In that case we are using the base URL as a starting point and supplying additional information. For example, the base URL might point to a directory on a Web server; a relative URL might name a particular file in that directory.
A URL is represented by an instance of the
java.net.URL class. A URL
object manages all information in a URL string
and provides methods for retrieving the object it identifies. We can
construct a URL object from a
URL specification string or from its component
parts:
try {
URL aDoc = new URL( "http://foo.bar.com/documents/homepage.html" );
URL sameDoc = new URL("http","foo.bar.com","documents/homepage.html");
}
catch ( MalformedURLException e ) { } The two URL objects above point to the same network
resource, the homepage.html document on the
server foo.bar.com. Whether or not the resource
actually exists and is available isn't known until we try to
access it. At this point, the URL object just
contains data about the object's location and how to access
it. No connection to the server has been made. We can examine the
URL's components with the
getProtocol(), getHost(), and
getFile() methods. We can also compare it to
another URL with the sameFile()
method. sameFile() determines if two
URLs point to the same resource. It can be fooled,
but sameFile does more than compare the
URLs for equality; it takes into account the
possibility that one server may have several names, and other factors.
When a URL is created, its specification is
parsed to identify the protocol component. If the protocol
doesn't make sense, or if Java can't find a protocol
handler for it, the URL constructor throws a
MalformedURLException. A protocol handler is a Java
class that implements the communications protocol for accessing the
URL resource. For example, given an
"http" URL, Java prepares to use the
HTTP protocol handler to retrieve documents from
the specified server.
The lowest level way to get data back from
URL is to ask for an
InputStream from the
URL by calling
openStream(). Currently, if you're writing
an applet that will be running under Netscape or Internet Explorer,
this is about your only choice. It's particularly useful if you want
to receive continuous updates from a dynamic information source. The
drawback is that you have to parse the contents of the object
yourself. Not all types of URLs support the
openStream() method; you'll get an
UnknownServiceException if yours doesn't.
The following code prints the contents of an HTML file:
try {
URL url = new URL("http://server/index.html");
BufferedReader bin = new BufferedReader (
new InputStreamReader( url.openStream() ));
String line;
while ( (line = bin.readLine()) != null )
System.out.println( line );
} catch (Exception e) { }We ask for an InputStream with
openStream() and wrap it in a
BufferedReader to read the lines of text.
Because we specify the "http" protocol in the
URL, we still require the services of an
HTTP protocol handler. As we'll discuss later,
that raises some questions about what handlers we have
available. This example partially works around those
issues because no content handler is involved; we read the data and
interpret it as a content handler would. However, there are even more
limitations on what applets can do right now. For the time being, if
you construct URLs relative to the applet's
codeBase(), you should be able to use them in
applets as in the above example. This should guarantee that the needed
protocol is available and accessible to the applet. (If you are
just trying to get data associated with an applet, there are better ways;
see the discussion of getResource() in Chapter 10.)
openStream() operates at a lower level than the
more general content-handling mechanism implemented by the
URL class. We showed it first because, until some
things are settled, you'll be limited as to when you can use
URLs in their more powerful role. When a proper
content handler is available to Java, you'll be able to retrieve the
object the URL addresses as a complete object, by
calling the URL's
getContent() method.
(Currently, this only works if you supply one with your application or
install one in the local classpath for HotJava, as we'll discuss later.)
getContent() initiates a connection to the
host, fetches the data for you, determines the MIME (Multipurpose Internet Mail Extensions) type
of the contents, and invokes a content handler to turn the bytes into a
Java object.
MIME is a standard that was developed to facilitate multimedia email, but it has become widely used as a general way to specify how to treat data; Java uses MIME to help it pick the right content handler.
For example, given the URL
http://foo.bar.com/index.html, a call to
getContent() would use the HTTP
protocol handler to retrieve data and an HTML
content handler to turn the data into an appropriate document object. A
URL that points to a plain-text file might use a
text-content handler that returns a String
object.
Similarly, a GIF file might be turned into an
Image object using a
GIF content handler. If we accessed the
GIF file using an "ftp"
URL, Java would use the same content handler but
would use the FTP protocol handler to receive the
data.
getContent() returns the output of the
content handler. Now we're faced with a problem: exactly what
did we get? Since the content handler has to be able to return almost
anything, the return type of getContent() is
Object.
In a moment we'll describe how we could ask the protocol handler about
the object's MIME type, which it discovered.
Based on this, and whatever other knowledge we have about the kind of object
we are expecting, we can cast the Object to
its appropriate, more specific type.
For example, if we expect a
String, we'll cast the result of
getContent() to a
String:
try {
String content = (String)myURL.getContent();
} catch ( ClassCastException e ) { ... }
If we're wrong about the type, we'll get a
ClassCastException.
As an alternative, we could check the type of the returned object using the
instanceof operator:
if ( content instanceof String ) {
String s = (String)content;
...
Various kinds of errors can occur when trying to retrieve the
data. For example, getContent() can throw an
IOException if there is a communications error.
Other kinds of errors can occur at the application level:
some knowledge of how the application-specific content and protocol handlers
deal with errors is necessary.
One problem that could arise is that a content handler
for the data's MIME type wouldn't
be available. In this case, getContent() just
invokes an "unknown type" handler and
returns the data as a raw InputStream. A
sophisticated application might specialize this behavior to try to
decide what to do with the data on its own.
In some situations we may also need knowledge of the protocol handler.
For example, consider a URL that refers to a
nonexistent file on an HTTP server. When
requested, the server probably returns a valid HTML
document that contains the familiar "404 Not Found" message.
In a naive implementation, an
HTML content handler might be invoked to interpret this
message and return it as it would any other HTML document.
To check the validity of protocol-specific operations like this, we may need
to talk to the protocol handler.
The openStream() and
getContent() methods both implicitly create the
connection to the remote URL object.
When the connection is set up, the protocol handler is consulted to
create a URLConnection object. The
URLConnection manages the
protocol-specific communications.
We can get a URLConnection for our URL with the
openConnection() method.
One of the things we can do with the
URLConnection is ask for the object's
content type. For example:
URLConnection connection = myURL.openConnection(); String mimeType = connection.getContentType(); ... Object contents = myURL.getContents();We can also get protocol-specific information. Different protocols provide different types of
URLConnection
objects. The HttpURLConnection object,
for instance, can interpret the "404 Not Found" message
and tell us about the problem.
We'll examine URLConnections further when
we start writing protocol handlers.