4 The Anatomy of a URL
Web addresses are Uniform Resource Locators (URLs). By default, a URL is interpreted as a request for some resource (i.e., file) on the target server.
An example web address is shown below, color coded to show the protocol, domain name, optional port number, path, and data portions of the address:
https://www.google.com:443/search?q=php+and+javascript protocol://domain_name:port/path?data
The Protocol
The first part of the URL specifies the protocol. A protocol is a set of rules that govern communication between two computers. In the web address above, the protocol is HTTPS (Secure Hypertext Transfer Protocol). HTTPS specifies the rules for secure, encrypted, transmission between the browser and the web server . Virtually all public web sites use this protocol now. Most browsers will not allow you to visit a public web site unless it is running this protocol.
If you are using XAMPP to host your own web sites locally (see Get Your Own Web Server) you may have noticed the protocol there is HTTP. This is the older, insecure version of HTTPS, and it’s fine for developing on your own machine, but it’s not safe for anything else. But usually developers do not have to worry about this. If you host your website somewhere public, the service provider will make sure your web site is using the HTTPS protocol.
There are other protocols as well. If you created the “hello world” program following the instructions in Get Your Own Web Server, you can load your index.html file through the server something like this:
http://localhost/hello/index.html
But if you double-click on the file from within your file explorer, it will likely identify it using the file scheme.
file:///C:/xampp/htdocs/hello/index.html
This indicates you’re locating a file directly from hard drive. Technically this is a Uniform Resource Identifier (URI) rather than a Locator (URL). A URI identifies a resource whereas a URL gives directions on how to retrieve it. The distinction between URLs and URIs doesn’t matter all that much for now, but the distinction between the http protocol and the file scheme in an address will become important in Programming on the Server Side.
The Domain Name
The second part of the URL above specifies a domain name. Domain names are assigned by a non-profit organization known as the Internet Corporation for Assigned Names and Numbers (ICANN). Domain names are mapped to an Internet Protocol (IP) Address, which is a series of numbers (you can use sites like this one to look up the IP Address of a domain name). The IP address is used to route the request to the appropriate server computer somewhere on the Internet. Domain names and IP addresses are stored in various public Domain Name Servers (DNS) around the world that browsers use to figure out where to send their HTTP request messages.
The (Optional) Port Number
The third part of the URL is the port number. It tells the operating system on the server machine which app it should deliver an incoming message to. Many port numbers are reserved for particular purposes. For example, port 80 is for HTTP messages and 443 is for HTTPS messages. Incoming messages using these ports will be sent to a web server app such as Apache, if there happens to be one running on the machine that receives the message. Similarly, messages on port 25 will be sent to an email app, 21 will be sent to an FTP app, and so on.
In theory, any application can “claim” any port number. If you are running XAMPP on your local machine (see the Get Your Own Web Server chapter) you can configure Apache to use any port you like, but then you have to include it in the URL. For most web addresses, there’s no need to specify a port number – the default for the protocol will be used.
The Path
The fourth part of a URL is the path to the requested resource. If the computer receiving the message is running an Apache server, it will go to the top level folder where public HTML documents are stored (htdocs on windows, public_html on Linux, and so on) and then apply the path from there. So by default, Apache would interpret search in the URL above as a subfolder of the htdocs or public_html folder. Then it would look for an index file there (index.html, index.php, etc.) and load or run that file to create the response.
The (Optional) Data
The final part of the URL is a set of data that goes along with the resource request. In the example URL above, the data includes text that was typed into a google search bar. The user typed “php and javascript” and the data indicates that it is sending a parameter named “q” (which presumably stands for “query”) that holds the user’s text. This part of the URL is separated from the routing information by a question mark, and is ignored when routing the request.
If Apache sees data attached to a request, it will make it available to the requested resource. If the requested resource is a PHP program, that program can access the data if it needs to. If it’s a flat file (HTML, CSS, images, etc.) it will be ignored. We will have a lot more to say about the data attached to requests when we get to the Programming on the Server Side section of the book.