http://invisible-island.net/scripts/
Copyright © 2015-2022,2024 by Thomas E. Dickey


Synopsis

Here is an improved version of Earl Hood's man2html Perl script,
along with a discussion of why I came to make those improvements.

Background

Initially (in the mid-1990s) all of the documentation for the programs I work on was either plain text README's or nroff (manual pages). HTML came later.

ncurses and info

While ncurses has had some HTML documentation since 1.9.2d in mid-1995, this started with just a few which were intended to be part of a website:

At that point in time, there was in fact no website. But just in case.
The hacker's guide followed at the end of 1995. Still no website.

As these files were added, the dist.mk makefile was updated to provide plain-text versions using lynx dumps.
I made improvements to the makefile, but largely ignored the HTML files, for a while.

Juergen Pfeifer started things going, by providing HTML-ized versions of the Ada95 binding starting in October 1996, and adding the adahtml rule in dist.mk to generate the files. He used the gnathtml script (which appears to be orphaned — I have maintained a copy for ncurses). Also, in Juergen's initial changes, he added HTML-ized manual pages under the Ada95/html directory.

Later, in March 2000, Juergen Pfeifer added the manhtml rule to dist.mk to generate HTML-ized manual pages using man2html.
At that time, he moved the HTML manual pages to their present location in doc/html/man.

Because they were not really part of the Ada95 binding (which Juergen developed and maintained for several years),
I took more of an interest in being able to rebuild those files. I started making improvements

Both gnathtml and man2html are Perl scripts which process the output from other programs (such as gnat and nroff) to provide HTML. Besides regular bug-fixes, I made improvements (in dist.mk) so that the generated pages would validate, e.g., on w3.org's test pages.

Throughout this discussion, the reader may notice that I do these conversions to HTML using automatic tools. I spend a lot of time making them work, and do not manually adjust the result.

As an alternate view, a few people would like to rely upon GNU info format and not use manpage format. In response to a request (at the end of May 2000) to convert to Texinfo, I pointed out that

The lack of tools is two-fold:

A one-way manual conversion (based on comments made in a similar request the previous month) would take a few weeks of full-time work. My response to this led to a followup from some people who had volunteered to do the work. I saw no result from that.

The justification given for the request to convert all of the documentation to Texinfo was based on its ease of conversion to other formats (which is dubious, given the lack of tools for converting to/from manual pages). Texinfo of course is not the only possible solution to the problem of producing multiple formats. Groff can do that.

There also is SGML. During the next few years (beginning on the mailing list in 2002, as well as in private email), Pradeep Padala (who wrote an ncurses HOWTO) discussed converting ncurses' documentation to SGML. Ultimately that ran into problems with tools. Given a good DTD, xmlto could produce something, but getting a workable toolset can rapidly become an end in itself, requiring continual maintenance to work around tool breakage. In writing this, for example, I came across Greg Woods' comment on a NetBSD mailing list in 2007 entitled why XML? (was: Proposition for Releases page changes) which you may find amusing..

On the other hand, the HOWTO is useful and was already converted. After some discussion on licensing, we added it (as HTML) to ncurses in mid-2005. Later, I noticed that Pradeep was no longer actively maintaining the HOWTO or the accompanying programming examples, and collected those for the ncurses FAQ.

XFree86 and rman

XFree86 used a different tool for generating HTML documentation: rman. Formerly known as "RosettaMan" (but problematic due to infringement alleged by Rosetta, Inc.), renamed to "PolyglotMan", the tool reads nroff files directly and attempts to construct formatted HTML manual pages.

XTerm's manual page was the longest manual page in XFree86. It also had some problems which I (mostly) fixed by improvements to rman (see CVS):

I sent those changes to Tom Phelps at the end of 2003, but beyond email acknowledgement, saw no further updates from him. Here is part of the conversation.

From dickey@his.com Wed Dec 17 20:58:41 2003 -0500
Date: Wed, 17 Dec 2003 20:58:41 -0500 (EST)
From: Thomas Dickey <dickey@his.com>
To: Tom Phelps <phelps@eecs.berkeley.edu>
Subject: Re: rman
In-Reply-To: <F42DDF86-30FB-11D8-9B95-000A959C28F4@cs.berkeley.edu>
Message-ID: <Pine.BSI.4.53.0312172055440.17997@mail.his.com>
References: <F42DDF86-30FB-11D8-9B95-000A959C28F4@cs.berkeley.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: RO
Content-Length: 381
Lines: 15

On Wed, 17 Dec 2003, Tom Phelps wrote:

> >  I think it'd be nice to cleanup the 3.2 (indent it,
> > fix the compiler warnings, improve the HTML validation), add options
> > for
> > some of the features that are needed for XFree86.
>
> Sounds good.  I'm always happy to receive patches.  Good luck!

ok.

-- 
Thomas E. Dickey
http://invisible-island.net
ftp://invisible-island.net

I followed up over the next week and mailed a patch to resync XFree86 and Phelps. The changes are in the XFree86 CVS (rev 1.19), of course, but were absent from X.Org's copy. Reading comments such as this, I pointed out the missing updates, e.g., in the X.Org mailing list more than once (in 2005 and 2006) as well as to packagers, e.g., on the Cygwin mailing list. Ultimately X.Org dropped the program (as noted here in 2008).

This raises a two-part question:

While I have seen occasional mailing-list discussion of the X documentation files, there has not been much. Occasionally there are comments about converting to XML, discarding obsolete formats, etc. That would give the casual observer the impression that the manual pages are converted to XML, and then (as implied by the people discussing ncurses documentation) that the content could be "easily" converted to a variety of formats. Perhaps.

To get an answer, you have to look at where the installed manual pages come from. Xorg has little to do with the end result which you would install on a computer; that is done by packagers such as Red Hat. Inspecting recent copies of these source packages gives an answer:

vi-like-emacs doc-files

Aside from its manual page, all of vile's documentation started out as plain-text files. This was awkward, since only the help-file (also plain-text) was readily accessible from the editor.

In 1999, Martin Lemburg suggested providing different formats for the help-file. Finding that vile.hlp was not a Microsoft help-file, he provided an example where this was converted to RTF, suggesting that I add it to the distribution. However, I pointed out that this could be hard to maintain:

> if you want to you could include the file(s) inside the next =                
> distributions!                                                                
                                                                                
Perhaps as a separate zip file (I agree that people will find this useful).     
This issue has come up before, and I don't have a good solution:  how can       
we maintain alternate forms of the documentation when there's a manual          
stage required.

vile's documentation has had suffixes which conflict with Microsoft's types for quite a while (vile.hlp since October 1990, macros.doc since April 1994), but converting the documentation to other formats did not seem like a solution to the problem. We continued to add documentation.

Finally, at the end of 2009, I changed things, making all of the documentation except the manual page in HTML format, and generating the “.hlp” and “.doc” files from HTML. Interesting enough, one of the reasons was to improve packaging for winvile:

20091228 (y)
        ...
        > Tom Dickey:
        + change Inno Setup script to install html files rather than doc files
          for documentation; added a shortcut to the table of contents.
        + convert all ".doc" files to html, generate ".doc" from those files.
        ...
20091114 (x)
        > Tom Dickey:
        ...
        + add doc/vile-toc.html, a table of contents for vile-hlp.html
        + generate vile.hlp from doc/vile-hlp.html
        ...
        + add doc/vile-hlp.html, which is a marked-up version of vile.hlp
        + change atr2html to not use <pre>, since <font> inside <pre> is
          nonstandard.

There were other reasons:

There was no point in constructing an HTML version of the manual page, and translating it back into nroff. groff could do an adequate job of going from nroff to HTML.

Why not groff?

Because I had accumulated many manpages, and because people requested it, I setup my website to provide documentation in HTML.

For most programs other than ncurses (where I was already generating HTML using man2html), I initially relied upon groff to do this in makefiles. That seemed simpler than maintaining another tool. For instance, this check-in for xterm in April 2006:

REV:1.129               Makefile.in         2006/04/21 18:57:58       tom

   add a rule to make ctlseqs.html using groff's _very_ crude -Thtml output.
   Don't use that - it works better to make a pdf using ghostscript.

--- Makefile.in 2006/04/10 00:34:36     1.128
+++ Makefile.in 2006/04/21 18:57:58     1.129
@@ -1,4 +1,4 @@
-## $XTermId: man2html.html,v 1.59 2024/03/04 09:21:54 tom Exp $
+## $XTermId: man2html.html,v 1.59 2024/03/04 09:21:54 tom Exp $
 ##
 ## $XFree86: xc/programs/xterm/Makefile.in,v 3.54 2006/04/10 00:34:36 dickey Exp $ ##
 ##
@@ -242,6 +242,9 @@
 maintainer-clean : realclean
        -$(RM) 256colres.h 88colres.h
 
+ctlseqs.html : ctlseqs.ms
+       GROFF_NO_SGR=stupid $(SHELL) -c "tbl ctlseqs.ms | groff -Thtml -ms" >$@
+
 ctlseqs.txt : ctlseqs.ms
        GROFF_NO_SGR=stupid $(SHELL) -c "tbl ctlseqs.ms | nroff -Tascii -ms" >$@

The GROFF_NO_SGR is of course a workaround for a misfeature added to groff a couple of years earlier, but was not actually needed for HTML. By "Don't use that", I reminded myself that there were problems with groff's conversion to HTML but that the PDF and PostScript conversions worked well enough. I added PostScript to xterm's makefile in patch 118 (1999) and PDF in patch 226 (2007). Around 2007, I decided to make the documentation for my programs available in various formats (HTML, PDF, PostScript and plain text).

Converting to HTML with groff has a few advantages:

However (especially with xterm), there were a number of problems:

It was the poor appearance of resized pages which made me decide to stop using groff's HTML conversion.

Hood's man2html script

Here are a few links showing the script's early history:

My fixes

The script itself is written in Perl. Typical of many written during the 1990s, it did not use Perl's strict checking features. I added some strict checking, and reformatted it. It has grown by about a third, going from 607 to 823 lines.

Some of the improvements are of more general interest:

There are areas where the current (July 2015) script could be improved: Using a locale with UTF-8 encoding, groff will produce UTF-8 output. The current script does not handle that, but if it were modified to do this, it could display nicer line-drawing for the tables, and nicer bullet characters.

Documentation

Download

Other Tools

Other “man2html” tools

There are several with the same name, but not well-known:

There is one well-known variant, e.g., the one originally written by Richard Verhoeven sometime in the early 1990s. Exactly when is obscure: he was apparently at Eindhoven for a few years involved with MathSpad, but the man2html program is mentioned only when someone else started working on it in 1996.

Both Verhoevan's and Hood's programs generate all of the HTML tags in uppercase. The resemblance ends there:

Other manpage/html converters

Other Versions

Aside from my changes, the script seems to have been relatively unmaintained. There was one exception.

Immediately after I released ncurses 6.0 in August 2015, Jörg Schilling noticed the mention of man2html in the release notes and added his version of this script the following day. The timestamp for that file itself was in 2013, but the makefile and announcement are dated August 11, 2015 (no previous mention was made of the script in “schily-tools”).

Schilling's script differs from the original by only a few lines (3 aside from the hashbang line) of the 607 total. Here is a diff created from his changes, showing the relevant parts:

--- man2html    1997-08-12 13:19:18.000000000 -0400
+++ man2html.new        2013-06-27 17:36:26.000000000 -0400
@@ -1,4 +1,4 @@
-#!/usr/local/bin/perl
+#!/usr/bin/perl
 ##---------------------------------------------------------------------------##
 ##  File:
 ##      @(#) man2html 1.2 97/08/12 12:57:30 @(#)
@@ -184,7 +184,7 @@
 
            ## Create anchor links for manpage references
            s/((((.\010)+)?[\+_\.\w-])+\(((.\010)+)?
-             \d((.\010)+)?\w?\))
+             \d((.\010)+)?\w*\))
             /make_xref($1)
             /geox  if $see_also;
 
@@ -442,7 +442,8 @@
 
     if ($CgiUrl) {
        my($title,$section,$subsection) =
-           ($str =~ /([\+_\.\w-]+)\((\d)(\w?)\)/);
+           ($str =~ /([\+_\.\w-]+)\((\d)(\w*)\)/);
+           my($subsection) = lc($subsection);
 
        $title =~ s/\+/%2B/g;
        my($href) = (eval $CgiUrl);
@@ -474,7 +475,7 @@
     while ($line = <$InFH>) {
        next if $line !~ /\(\d\w?\)\s+-\s/; # check if line can be handled
        ($refs,$section,$subsection,$desc) =
-           $line =~ /^\s*(.*)\((\d)(\w?)\)\s*-\s*(.*)$/;
+           $line =~ /^\s*(.*)\((\d)(\w*)\)\s*-\s*(.*)$/;
 
        if ($Solaris) {
            $refs =~ s/^\s*([\+_\.\w-]+)\s+([\+_\.\w-]+)\s*$/$1/;

The comments in Schilling's ANNOUNCEMENTS/AN-2015-08-11 summarize his changes:

-       man2html: This was added to schilytools as the original man2html
        command has a bug with processing sub-sections and as the original
        man2html is completely unmaintained since August 12 1997.

-       man2html: subsections are now handled correctly and may be
        longer than a single character.

All of the subsequent mention of man2html in announcements alluded to his workarounds to use the script as is rather than improve it.

Later that year, someone asked a question about converting manpages to html, which I answered, pointing to this page. Schilling downvoted my answer a few minutes later.