Vive la HTML

30 Jan 2005

HTML - otherwise known as HyperText Markup Language - is the simplest, the most powerful, the most accessible, and the most convertible electronic document format on the planet. Invented in 1991 by Tim Berners-Lee, who is considered to be the father of the World Wide Web, HTML is the language used to write virtually every web page in existence today.

HTML, however, is not what you see when you open a web page in your browser (and I hope that you're using one of the many good browsers out there, rather than the one bad browser). When you open a web page, the HTML is transformed into a (hopefully) beautiful layout of fonts, images, colours, and all the other elements that make up a visually pleasing document. However, try viewing the source code of a web page (this one, for example). You can usually do this by going to the 'view' pull-down menu in your browser, and selecting 'source' or 'page source'.

What you'll see is a not-so-beautiful plain text document. You may notice that many funny words in this document are enclosed in little things called chevrons (greater-than signs and less-than signs), like so:

<p><strong>Greetings</strong>, dear reader!</p>

The words in chevrons are called tags. In HTML, to make anything look remotely fancy, you need to use tags. In the example above, the word "greetings" is surrounded by a 'strong' tag, to make it appear bold. The whole sentence is enclosed in a 'p' tag, to indicate that those words form a single paragraph. The result of this HTML, when transformed using a web browser, is:

Greetings, dear reader!

So now you all know what HTML is (in a nutshell - a very small nutshell). It is a type of document that you create in plain text format. This is different to other formats, such as Microsoft Word (where you need a special program, i.e. Word, to produce a document, because the document is not stored as plain text). You can use any text editor - even one as simple as Windows Notepad - to write an HTML document. HTML uses special elements, called tags, to describe the structure and (in part) the styling of a document. When you open an HTML document using a web browser, the plain text is transformed into what is commonly known as a 'web page'.

Now, what would be your reaction if I said that everyone, from this point onwards, should write (almost) all of their documents in raw HTML? What would you say if I told you to ditch Word, where you can make text bold or italics or underlined by pushing a button, and instead to write documents like this? Would you think I'm nuts? Probably. Obsessive plain-text geeky purist weirdo? I don't deny it. If you've lost hope already, feel free to leave. Or if you think perhaps - just perhaps - there could be a light at the end of this tunnel, then keep reading.

<matrixramble>

Morpheus: You take the blue pill, the story ends, you wake up in your bed and believe whatever you want to believe. [Or] you take the red pill, you stay in Wonderland, and I show you how deep the rabbit hole goes.

Remember, all I'm offering is the truth, nothing more.

The Matrix (1999)

You may also leave now if you've never heard that quote before, or if you've never quoted it yourself. Or if you don't believe that this world has, indeed, "been pulled over your eyes to blind you from the truth... ([and] that you are a slave)".

</matrixramble>

Just kidding. Please kindly ignore the Matrix ramble above.

Anyway, back to the topic. At the beginning of this article, I briefly mentioned a few of the key strengths of HTML. I will now go back to these in greater detail, as a means of convincing you that HTML is the most appropriate format in which to write (almost) all electronic documents.

It's simple

As far as text-based, no-nonsense computer languages go, HTML is really simple. If you want to write plain text, just write it. If you want to do something fancier (e.g. make the plain text look nice, embed an image, structure the text as a table), then you use tags. All tags have a start (e.g. ), and a finish (e.g. ) - although some tags have their start and their finish together (e.g. ). There are over 100 tags, but you don't need to memorise them - you can just look them up, or use special editors to insert them for you. Most tags are self-explanatory.

HTML is not only simple to write, it is also simple to read. You'd be surprised how easy it is to read and to edit an HTML document in its raw text form, if you just know the incredibly simple format of a tag (which I've already told you). And unlike with non-text-based formats, such as Word and PDF, anyone can edit the HTML that you write, and vice versa. Have you ever been unable to open a Word document, because you're running the wrong version of Microsoft Office? How about not being able to open a PDF document, because your copy of Adobe Acrobat is out of date? Problems such as these simply do not happen with HTML: it's plain text, you can open it with the oldest and most basic programs in existence!

As far as simplicity goes, there are no hidden catches. HTML is not a programming language (something that can only be used by short guys with big glasses in dark smelly rooms). It is a markup language. It requires no maths (luckily for me), no logic or problem-solving skills, and very little general technical knowledge. All you need to know is a few tags, and where to write them amidst the plain text of your document, and you're set to go!

It's powerful

The Golden Rule of Geekdom is to never, ever, underestimate the power of plain text. Anyone who considers themself to 'be in computers', or 'in IT', will tell you that:

Text-based things are almsot always more powerful than their graphical equivalents, in the world of computers (e.g. Unix vs Windows/Mac);
All graphical tools - whether they be for document writing, or for anything else - are generating code beneath their shiny exterior, and this code is usually stored as text (characters); and
Generated code is seldom to be trusted - (wholly or partly) hand-written code is (if the hand-writer knows what he/she is doing) cleaner, more efficient, and more reliable.

HTML is no exception to these rules. It is as powerful as other document formats in most ways (although not in all ways, even I admit). It is far cleaner and more efficient than most other formats with similar capabilities (e.g. Rich Text Format - try reading that in plain text!). And best of all, it leaves no room for fear or paranoia that the underlying code of your document is wretched, because you can read that code yourself!

If you're worried that HTML is not powerful enough to meet your needs, go visit a web page. Any web page will do: you're looking at one now, but as any astronomer can tell you, there are plenty of stars in the sky to choose from. Look at the text formatting, the page layout, the use of images, the input forms, and everything else that makes up a modern piece of the Internet. Not bad, huh?

Now look at the source code for that web page. That's right: the whole thing was written with HTML.

Note that many sites embed other technologies, such as Flash, JavaScript, and Java applets within their HTML - but the backbone of the page is almost always HTML. Also note that almost all modern web sites use HTML in conjunction with CSS - that's Cascading Style Sheets, a topic beyond the scope of this article - to produce meticulously crafted designs by controlling how each tag renders itself. When HTML, CSS, and JavaScript are combined together, they form a technology known as DHTML (Dynamic HTML), the power of which is far beyond anything possible in formats such as Word and PDF.

It's accessible

The transition from paper-based to online documents is one of the biggest, and potentially most beneficial changes, in what has been dubbed the 'information revolution'. Multiple copies of documents can now be made electronically, saving millions of sheets of paper every year. Backup is as easy as pushing a button, to copy a document from one electronic storage device to another. Information can now be put online, and read by millions of people around the world in literally a matter of seconds. But unless we make this transition the right way, we will reap only a fraction of the benefits that we could.

Electronic documents are potentially the most accessible pieces of information the world has ever seen. When designed and written properly, not only can they be distributed globally in a matter of seconds, they can also be viewed by anyone, using any device, in any form, and in any language. Unfortunately, just because a document is in electronic form, that alone does not guarantee this Utopian level of accessibility. In fact, as with anything, perfection can never be a given. But by providing a solid foundation with which to write accessible documents, this goal becomes much more plausible. And the best foundation for accessible electronic documents, is an accessible electronic document format. Enter HTML.

HTML was designed from the ground up as an accessible language. By its very definition - as the language used to construct the World Wide Web - it is essential that the exact same HTML document is able to be viewed easily by different people from all around the world, using different hardware and software, and sometimes with radically different presentation requirements.

The list below describes some of the key issues concerning accessibility, as well as how HTML caters for these issues, compared with its two main rivals, Word and PDF.

Screen size: Even amongst regular PC users - the group that forms the largest majority in the online world - everyone has a different screen size. An accessible document format needs to be able to display equally well on an 11" little squeak as on a 23" big boy. This issue has become even more crucial in recent years, with the advent of Internet-enabled pocket devices such as PDA's and mobile (cell) phones. HTML handles this fine, as it is able to display a document with variable margins, user-defined font sizes, and so on. Word handles this reasonably well, largely due to its different 'views' such as a 'layout view' and 'online view', many of which are suited to variable screen sizes. PDF fails miserably in this regard, because it is designed specifically to display a document as it appears in printed form, and is totally incapable of handling variable margins.
Cross-platform compatibility: Electronic documents should be usable on any operating system, as well as on any hardware. HTML can be viewed on absolutely any desktop system (e.g. Windows, Mac, Linux, Solaris, FreeBSD, OS/2), and also on PDAs, mobile phones, WebTVs, Internet fridges... you name it, HTML works on it. PDF - as it's name suggests, with the 'P' for 'Portable' - is also pretty good in this regard, since it's able to run on all major systems, although it isn't so easy to view on smaller devices, as is HTML. Word is compatible only with Windows and Mac PCs - which is so lame it's not even worth commenting on, really.
Language barriers: This is one of the more frustrating aspects of accessibility, as it is one where technology has not yet caught up to the demands of the globalised world. Automated translation is far from perfect at the moment, however it is there, and it does work (but could work better). HTML documents are translated on-the-fly every day, through services such as Altavista's BabelFish, and Google's Language tools. This is made easier through HTML's ability to store the language of a document in a special metadata format. PDF and Word documents are not as readily translatable, although plug-ins are available to give them translating capabilities.
Users with poor eyesight: Many computer users require all documents to be displayed to them with large font sizes, due to their poor vision. HTML achieves this through a mechanism known as relative font sizes. This means that instead of specifying the exact size of text, authors of HTML documents are instead able to use 'fuzzy' terms, such as 'medium' and 'large' (as well as other units of size, e.g. em's). Users can then set the base size themselves: if their vision is impaired, they can increase the base size, and all text on the page will be enlarged accordingly. Word and PDF do not offer this, although they make up for it somewhat with their zoom functionality. However, this can often be cumbersome to use, and often results in horizontal scrolling.
Blind users: This is where HTML really shines above all other document formats. HTML is the only format that, when used properly, can actually be narrated to blind users through a screen reader, with little or no loss in meaning compared with users that are absorbing the document visually. Because HTML has so much structure and semantic information, screen readers are able to determine what parts of the document to emphasise, they can process complex tabular information in auditory form, and they can interpret metadata such as page descriptions and keywords. And if the page uses proper navigation by way of hyperlinks, blind users can even skip straight to the content they want, by performing the equivalent of a regular user's 'click' action. Word and PDF lag far behind in this regard: screen readers can do little more than read out their text in a flat monotone.

It's convertible

Just like a Porsche Boxster... only not quite so sexy. This final advantage of HTML is one that I've found particularly useful, and is - in my opinion - the fundamental reason why all documents should be written in HTML first.

HTML documents can be converted to Word, PDF, RTF, and many other formats, really easily. You can open an HTML document directly in Word, and then just 'Save As...' it in whatever format takes your fancy. The reverse, however, is not nearly so simple. If you were to type up a document in Word, and then use Word's 'Save as HTML' function to convert it to a web page, you would be greeted with an ugly sight indeed. Well, perhaps not such an ugly sight if viewed in a modern web browser; but if you look at the source code that Word generates, you might want to have a brown paper bag (or a toilet) very close by. Word generates revolting HTML code. Remember what I said about never trusting generated code?

Have a look at the following example. The two sets of HTML code below will both display the text "Hello, world!" when viewed in a web browser. Here is the version generated by Word:

<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 9">
<meta name=Originator content="Microsoft Word 9">
<title>Hello, world</title>
<xml>
 <o:DocumentProperties>
  <o:Author>Jeremy Epstein</o:Author>
  <o:LastAuthor>Jeremy Epstein</o:LastAuthor>
  <o:Revision>1</o:Revision>
  <o:TotalTime>1</o:TotalTime>
  <o:Created>2005-01-31T01:33:00Z</o:Created>
  <o:LastSaved>2005-01-31T01:34:00Z</o:LastSaved>
  <o:Pages>1</o:Pages>
  <o:Company>GreenAsh Services</o:Company>
  <o:Lines>1</o:Lines>
  <o:Paragraphs>1</o:Paragraphs>
  <o:Version>9.2720</o:Version>
 </o:DocumentProperties>
</xml>
<style>
 /* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{mso-style-parent:"";
	margin:0cm;
	margin-bottom:.0001pt;
	mso-pagination:widow-orphan;
	font-size:12.0pt;
	font-family:"Times New Roman";
	mso-fareast-font-family:"Times New Roman";}
@page Section1
	{size:595.3pt 841.9pt;
	margin:72.0pt 90.0pt 72.0pt 90.0pt;
	mso-header-margin:35.4pt;
	mso-footer-margin:35.4pt;
	mso-paper-source:0;}
div.Section1
	{page:Section1;}
</style>
</head>

<body lang=EN-AU style='tab-interval:36.0pt'>

<div class=Section1>

<p class=MsoNormal>Hello, world!</p>

</div>

</body>

</html>

And here is the hand-written HTML version:

<html>

<head>
<title>Hello, world</title>
</head>

<body>
<p>Hello, world!</p>
</body>

</html>

Slight difference, don't you think?

This is really important. The ability to convert a document from one format to another, cleanly and efficiently, is something that everyone needs to be able to do, in this modern day and age. This is relevant to everyone, not just to web designers and computer professionals. Sooner or later, your boss is going to ask you to put your research paper online, so he/she can tell his/her friends where to go if they want to read it. And chances are, that document won't have been written originally in HTML. So what are you going to do? Will you convert it using Word, and put up with the nauseating filth that it outputs? Will you just convert it to PDF, and whack that poorly accessible file on the net? Why not just save yourself the hassle, and write it in HTML first. That way, you can convert it to any other format at the click of a button (cleanly and efficiently), and when the time comes to put it online - and let me tell you, it will come - you'll be set to go.

Useful links (to get you started)

Notepad++: A great (free) text editor for those using Micro$oft Windows. Fully supports HTML (as well as many other markup and programming languages), with highlighting of tags, automatic indenting, and expandable/collapsible hierarchies. Even if you never plan to write a line of HTML in your life, this is a great program to have lying around anyway.
HTML-kit: Another great (and free) plain-text HTML editor for Windows, although this one is much more complex and powerful than Notepad++. HTML-kit aims to be more than just a text editor (although it does a great job of that too): it is a full-featured development environment for building a respectable web site. It has heaps of cool features, such as markup validation, automatic code tidying, and seamless file system navigation. Oh yeah, and speaking of validation...
W3C Markup Validator: The World Wide Web Consortium - or W3C - are the folks that actually define what HTML is, as in the official standard. Remember I mentioned Tim Berners-Lee, bloke that invented HTML? Yeah, well he's the head of the W3C. Anyway, they provide an invaluable tool here that checks your HTML code, and validates it to make sure that it's... well, valid! A must for web developers, and for anyone that's serious about learning and using HTML.
Webmonkey: The place to go if you want to start learning HTML and other cool skills. Full of tutorials, code snippets, cheat sheets, and heaps more.
Sizzling HTML Jalfrezi: This site is devoted to teaching you HTML, and lots of it. The tutorials are pretty good, but I find the A-Z tag reference the most useful thing about this site. It's run by Richard Rutter, one of the world's top guns in web design and all that stuff.

← Why junk collecting is good

Basic breadcrumbs and taxonomy →