12. Working with URLs

Contents:
The URL Class
Web Browsers and Handlers
Writing a Content Handler
Writing a Protocol Handler
Talking to CGI Programs

A URL points to an object on the Internet. It's a collection of information that identifies an item, tells you where to find it, and specifies a method for communicating with it or retrieving it from its source. A URL refers to any kind of information source. It might point to static data, such as a file on a local filesystem, a Web server, or an FTP archive; or it can point to a more dynamic object such as a news article on a news spool or a record in a WAIS database. URLs can even refer to less tangible resources such as Telnet sessions and mailing addresses.

The Java URL classes provide an API for accessing well-defined networked resources, like documents and applications on servers. The classes use an extensible set of prefabricated protocol and content handlers to perform the necessary communication and data conversion for accessing URL resources. With URLs, an application can fetch a complete file or database record from a server on the network with just a few lines of code. Applications like Web browsers, which deal with networked content, use the URL class to simplify the task of network programming. They also take advantage of the dynamic nature of Java, which allows handlers for new types of URLs to be added on the fly. As new types of servers and new formats for content evolve, additional URL handlers can be supplied to retrieve and interpret the data without modifying the original application.

A URL is usually presented as a string of text, like an address.[1] Since there are many different ways to locate an item on the Net, and different mediums and transports require different kinds of information, there are different formats for different kinds of URLs. The most common form specifies three things: a network host or server, the name of the item and its location on that host, and a protocol by which the host should communicate:

[1] The term URL was coined by the Uniform Resource Identifier (URI) working group of the IETF to distinguish URLs from the more general notion of Uniform Resource Names or URNs. URLs are really just static addresses, whereas URNs would be more persistent and abstract identifiers used to resolve the location of an object anywhere on the Net. URLs are defined in RFC 1738 and RFC 1808.

protocol://hostname/location/item

protocol is an identifier such as "http," "ftp," or "gopher"; hostname is an Internet hostname; and the location and item components form a path that identifies the object on that host. Variants of this form allow extra information to be packed into the URL, specifying things like port numbers for the communications protocol and fragment identifiers that reference parts inside the object.

We sometimes speak of a URL that is relative to a base URL. In that case we are using the base URL as a starting point and supplying additional information. For example, the base URL might point to a directory on a Web server; a relative URL might name a particular file in that directory.

12.1 The URL Class

A URL is represented by an instance of the java.net.URL class. A URL object manages all information in a URL string and provides methods for retrieving the object it identifies. We can construct a URL object from a URL specification string or from its component parts:

try { 
    URL aDoc = new URL( "http://foo.bar.com/documents/homepage.html" ); 
    URL sameDoc = new URL("http","foo.bar.com","documents/homepage.html"); 
}  
catch ( MalformedURLException e ) { }

The two URL objects above point to the same network resource, the homepage.html document on the server foo.bar.com. Whether or not the resource actually exists and is available isn't known until we try to access it. At this point, the URL object just contains data about the object's location and how to access it. No connection to the server has been made. We can examine the URL's components with the getProtocol(), getHost(), and getFile() methods. We can also compare it to another URL with the sameFile() method. sameFile() determines if two URLs point to the same resource. It can be fooled, but sameFile does more than compare the URLs for equality; it takes into account the possibility that one server may have several names, and other factors.

When a URL is created, its specification is parsed to identify the protocol component. If the protocol doesn't make sense, or if Java can't find a protocol handler for it, the URL constructor throws a MalformedURLException. A protocol handler is a Java class that implements the communications protocol for accessing the URL resource. For example, given an "http" URL, Java prepares to use the HTTP protocol handler to retrieve documents from the specified server.

12.1.1 Stream Data

The lowest level way to get data back from URL is to ask for an InputStream from the URL by calling openStream(). Currently, if you're writing an applet that will be running under Netscape or Internet Explorer, this is about your only choice. It's particularly useful if you want to receive continuous updates from a dynamic information source. The drawback is that you have to parse the contents of the object yourself. Not all types of URLs support the openStream() method; you'll get an UnknownServiceException if yours doesn't.

The following code prints the contents of an HTML file:

try { 
    URL url = new URL("http://server/index.html"); 

    BufferedReader bin = new BufferedReader (
        new InputStreamReader( url.openStream() ));

    String line;
    while ( (line = bin.readLine()) != null )  
        System.out.println( line );
} catch (Exception e) { }

We ask for an InputStream with openStream() and wrap it in a BufferedReader to read the lines of text. Because we specify the "http" protocol in the URL, we still require the services of an HTTP protocol handler. As we'll discuss later, that raises some questions about what handlers we have available. This example partially works around those issues because no content handler is involved; we read the data and interpret it as a content handler would. However, there are even more limitations on what applets can do right now. For the time being, if you construct URLs relative to the applet's codeBase(), you should be able to use them in applets as in the above example. This should guarantee that the needed protocol is available and accessible to the applet. (If you are just trying to get data associated with an applet, there are better ways; see the discussion of getResource() in Chapter 10.)

12.1.2 Getting the Content as an Object

openStream() operates at a lower level than the more general content-handling mechanism implemented by the URL class. We showed it first because, until some things are settled, you'll be limited as to when you can use URLs in their more powerful role. When a proper content handler is available to Java, you'll be able to retrieve the object the URL addresses as a complete object, by calling the URL's getContent() method. (Currently, this only works if you supply one with your application or install one in the local classpath for HotJava, as we'll discuss later.) getContent() initiates a connection to the host, fetches the data for you, determines the MIME (Multipurpose Internet Mail Extensions) type of the contents, and invokes a content handler to turn the bytes into a Java object. MIME is a standard that was developed to facilitate multimedia email, but it has become widely used as a general way to specify how to treat data; Java uses MIME to help it pick the right content handler.

For example, given the URL http://foo.bar.com/index.html, a call to getContent() would use the HTTP protocol handler to retrieve data and an HTML content handler to turn the data into an appropriate document object. A URL that points to a plain-text file might use a text-content handler that returns a String object. Similarly, a GIF file might be turned into an Image object using a GIF content handler. If we accessed the GIF file using an "ftp" URL, Java would use the same content handler but would use the FTP protocol handler to receive the data.

getContent() returns the output of the content handler. Now we're faced with a problem: exactly what did we get? Since the content handler has to be able to return almost anything, the return type of getContent() is Object. In a moment we'll describe how we could ask the protocol handler about the object's MIME type, which it discovered. Based on this, and whatever other knowledge we have about the kind of object we are expecting, we can cast the Object to its appropriate, more specific type. For example, if we expect a String, we'll cast the result of getContent() to a String:

try  {
    String content = (String)myURL.getContent(); 
} catch ( ClassCastException e ) { ... }

If we're wrong about the type, we'll get a ClassCastException. As an alternative, we could check the type of the returned object using the instanceof operator:

if ( content instanceof String ) { 
    String s = (String)content; 
    ...

Various kinds of errors can occur when trying to retrieve the data. For example, getContent() can throw an IOException if there is a communications error. Other kinds of errors can occur at the application level: some knowledge of how the application-specific content and protocol handlers deal with errors is necessary.

One problem that could arise is that a content handler for the data's MIME type wouldn't be available. In this case, getContent() just invokes an "unknown type" handler and returns the data as a raw InputStream. A sophisticated application might specialize this behavior to try to decide what to do with the data on its own.

In some situations we may also need knowledge of the protocol handler. For example, consider a URL that refers to a nonexistent file on an HTTP server. When requested, the server probably returns a valid HTML document that contains the familiar "404 Not Found" message. In a naive implementation, an HTML content handler might be invoked to interpret this message and return it as it would any other HTML document. To check the validity of protocol-specific operations like this, we may need to talk to the protocol handler.

The openStream() and getContent() methods both implicitly create the connection to the remote URL object. When the connection is set up, the protocol handler is consulted to create a URLConnection object. The URLConnection manages the protocol-specific communications. We can get a URLConnection for our URL with the openConnection() method. One of the things we can do with the URLConnection is ask for the object's content type. For example:

	URLConnection connection = myURL.openConnection();
	String mimeType = connection.getContentType();
	...
	Object contents = myURL.getContents();

We can also get protocol-specific information. Different protocols provide different types of URLConnection objects. The HttpURLConnection object, for instance, can interpret the "404 Not Found" message and tell us about the problem. We'll examine URLConnections further when we start writing protocol handlers.


11.4 Remote Method Invocation		12.2 Web Browsers and Handlers