MAN2HTML

http://invisible-island.net/scripts/
Copyright © 2015-2022,2024 by Thomas E. Dickey

(top)
Synopsis
Background
Why not groff?
Hood's man2html script
My fixes
Documentation
Download
Other Tools
- Other “man2html” tools
- Other manpage/html converters
Other Versions

Synopsis

Here is an improved version of Earl Hood's man2html Perl script,
along with a discussion of why I came to make those improvements.

Background

Initially (in the mid-1990s) all of the documentation for the programs I work on was either plain text README's or nroff (manual pages). HTML came later.

ncurses and info

While ncurses has had some HTML documentation since 1.9.2d in mid-1995, this started with just a few which were intended to be part of a website:

At that point in time, there was in fact no website. But just in case.
The hacker's guide followed at the end of 1995. Still no website.

As these files were added, the dist.mk makefile was updated to provide plain-text versions using lynx dumps.
I made improvements to the makefile, but largely ignored the HTML files, for a while.

Juergen Pfeifer started things going, by providing HTML-ized versions of the Ada95 binding starting in October 1996, and adding the adahtml rule in dist.mk to generate the files. He used the gnathtml script (which appears to be orphaned — I have maintained a copy for ncurses). Also, in Juergen's initial changes, he added HTML-ized manual pages under the Ada95/html directory.

Later, in March 2000, Juergen Pfeifer added the manhtml rule to dist.mk to generate HTML-ized manual pages using man2html.
At that time, he moved the HTML manual pages to their present location in doc/html/man.

Because they were not really part of the Ada95 binding (which Juergen developed and maintained for several years),
I took more of an interest in being able to rebuild those files. I started making improvements

to gnathtml in June 2000, and
to man2html in July 2001

Both gnathtml and man2html are Perl scripts which process the output from other programs (such as gnat and nroff) to provide HTML. Besides regular bug-fixes, I made improvements (in dist.mk) so that the generated pages would validate, e.g., on w3.org's test pages.

Throughout this discussion, the reader may notice that I do these conversions to HTML using automatic tools. I spend a lot of time making them work, and do not manually adjust the result.

As an alternate view, a few people would like to rely upon GNU info format and not use manpage format. In response to a request (at the end of May 2000) to convert to Texinfo, I pointed out that

I would do this if there were a tool,
There was no tool (and likely still is none), and
I would not manually convert formats (and stop distributing manpages).

The lack of tools is two-fold:

There were no tools for converting the syntax, and
Texinfo documents are organized differently from manual pages.

A one-way manual conversion (based on comments made in a similar request the previous month) would take a few weeks of full-time work. My response to this led to a followup from some people who had volunteered to do the work. I saw no result from that.

The justification given for the request to convert all of the documentation to Texinfo was based on its ease of conversion to other formats (which is dubious, given the lack of tools for converting to/from manual pages). Texinfo of course is not the only possible solution to the problem of producing multiple formats. Groff can do that.

There also is SGML. During the next few years (beginning on the mailing list in 2002, as well as in private email), Pradeep Padala (who wrote an ncurses HOWTO) discussed converting ncurses' documentation to SGML. Ultimately that ran into problems with tools. Given a good DTD, xmlto could produce something, but getting a workable toolset can rapidly become an end in itself, requiring continual maintenance to work around tool breakage. In writing this, for example, I came across Greg Woods' comment on a NetBSD mailing list in 2007 entitled why XML? (was: Proposition for Releases page changes) which you may find amusing..

On the other hand, the HOWTO is useful and was already converted. After some discussion on licensing, we added it (as HTML) to ncurses in mid-2005. Later, I noticed that Pradeep was no longer actively maintaining the HOWTO or the accompanying programming examples, and collected those for the ncurses FAQ.

XFree86 and rman

XFree86 used a different tool for generating HTML documentation: rman. Formerly known as "RosettaMan" (but problematic due to infringement alleged by Rosetta, Inc.), renamed to "PolyglotMan", the tool reads nroff files directly and attempts to construct formatted HTML manual pages.

XTerm's manual page was the longest manual page in XFree86. It also had some problems which I (mostly) fixed by improvements to rman (see CVS):

fixes to address weblint and tidy warnings
modified to not reformat comments, so that copyright notices in the original files would appear as-is in the HTML.
imported rman 3.2 and merged XFree86's changes to it.

I sent those changes to Tom Phelps at the end of 2003, but beyond email acknowledgement, saw no further updates from him. Here is part of the conversation.

From dickey@his.com Wed Dec 17 20:58:41 2003 -0500
Date: Wed, 17 Dec 2003 20:58:41 -0500 (EST)
From: Thomas Dickey <dickey@his.com>
To: Tom Phelps <phelps@eecs.berkeley.edu>
Subject: Re: rman
In-Reply-To: <F42DDF86-30FB-11D8-9B95-000A959C28F4@cs.berkeley.edu>
Message-ID: <Pine.BSI.4.53.0312172055440.17997@mail.his.com>
References: <F42DDF86-30FB-11D8-9B95-000A959C28F4@cs.berkeley.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: RO
Content-Length: 381
Lines: 15

On Wed, 17 Dec 2003, Tom Phelps wrote:

> >  I think it'd be nice to cleanup the 3.2 (indent it,
> > fix the compiler warnings, improve the HTML validation), add options
> > for
> > some of the features that are needed for XFree86.
>
> Sounds good.  I'm always happy to receive patches.  Good luck!

ok.

-- 
Thomas E. Dickey
http://invisible-island.net
ftp://invisible-island.net

I followed up over the next week and mailed a patch to resync XFree86 and Phelps. The changes are in the XFree86 CVS (rev 1.19), of course, but were absent from X.Org's copy. Reading comments such as this, I pointed out the missing updates, e.g., in the X.Org mailing list more than once (in 2005 and 2006) as well as to packagers, e.g., on the Cygwin mailing list. Ultimately X.Org dropped the program (as noted here in 2008).

This raises a two-part question:

how does Xorg format manual pages,
how are they converted to HTML?

While I have seen occasional mailing-list discussion of the X documentation files, there has not been much. Occasionally there are comments about converting to XML, discarding obsolete formats, etc. That would give the casual observer the impression that the manual pages are converted to XML, and then (as implied by the people discussing ncurses documentation) that the content could be "easily" converted to a variety of formats. Perhaps.

To get an answer, you have to look at where the installed manual pages come from. Xorg has little to do with the end result which you would install on a computer; that is done by packagers such as Red Hat. Inspecting recent copies of these source packages gives an answer:

xorg-x11-docs, i.e., the "X" manual page.

Interestingly, the RPM says its source is

http://ftp.x.org/pub/individual/doc/

and a quick look there shows that the size of the tarballs decreases steadily with time.

The Xorg developers have reduced the scope of their effort to make some progress. Release 1.7 is twenty times smaller than release 1.0, partly due to removing "obsolete" formats, but also to removing entire specifications such as Xaw, XIM, Xt from the release, putting them in per-component locations (currently here: Xaw, XIM, Xt). Little attention has been made to packaging those specifications. For instance, Fedora simply provides some of the XML-files in the development packages without attempting to convert them to PDF (none are given in libX11-devel).

In the libxcb-devel package, Fedora provides a set of manual pages generated (according to the RPM spec file) from libxcb source code. The work is far from complete, with 80% containing "TODO" and "NOT DOCUMENTED" tags.
xorg-x11-utils, e.g., for xev.

These provide the same files (nroff) with some minor changes and improvements. While the X specs have been partially converted to SGML, the state of the manual pages as of mid-2015 (ten years into the process) is still "someday".
xorg-sgml-doctools (this, at least, is new).

vi-like-emacs doc-files

Aside from its manual page, all of vile's documentation started out as plain-text files. This was awkward, since only the help-file (also plain-text) was readily accessible from the editor.

In 1999, Martin Lemburg suggested providing different formats for the help-file. Finding that vile.hlp was not a Microsoft help-file, he provided an example where this was converted to RTF, suggesting that I add it to the distribution. However, I pointed out that this could be hard to maintain:

> if you want to you could include the file(s) inside the next =                
> distributions!                                                                
                                                                                
Perhaps as a separate zip file (I agree that people will find this useful).     
This issue has come up before, and I don't have a good solution:  how can       
we maintain alternate forms of the documentation when there's a manual          
stage required.

vile's documentation has had suffixes which conflict with Microsoft's types for quite a while (vile.hlp since October 1990, macros.doc since April 1994), but converting the documentation to other formats did not seem like a solution to the problem. We continued to add documentation.

Finally, at the end of 2009, I changed things, making all of the documentation except the manual page in HTML format, and generating the “.hlp” and “.doc” files from HTML. Interesting enough, one of the reasons was to improve packaging for winvile:

20091228 (y)
        ...
        > Tom Dickey:
        + change Inno Setup script to install html files rather than doc files
          for documentation; added a shortcut to the table of contents.
        + convert all ".doc" files to html, generate ".doc" from those files.
        ...
20091114 (x)
        > Tom Dickey:
        ...
        + add doc/vile-toc.html, a table of contents for vile-hlp.html
        + generate vile.hlp from doc/vile-hlp.html
        ...
        + add doc/vile-hlp.html, which is a marked-up version of vile.hlp
        + change atr2html to not use <pre>, since <font> inside <pre> is
          nonstandard.

There were other reasons:

I had made improvements to vile's text-to-html conversion, and (as part of improving my website) intended using the generated text in my ncurses and xterm FAQs. But I did the changes first in vile's help-file.
By converting vile's documentation to HTML, I could put colored examples (of macros, etc) in the documentation.
HTML had become so prevalent that it effectively displaced most uses of plain-text READMEs.

There was no point in constructing an HTML version of the manual page, and translating it back into nroff. groff could do an adequate job of going from nroff to HTML.

Why not groff?

Because I had accumulated many manpages, and because people requested it, I setup my website to provide documentation in HTML.

For most programs other than ncurses (where I was already generating HTML using man2html), I initially relied upon groff to do this in makefiles. That seemed simpler than maintaining another tool. For instance, this check-in for xterm in April 2006:

REV:1.129               Makefile.in         2006/04/21 18:57:58       tom

   add a rule to make ctlseqs.html using groff's _very_ crude -Thtml output.
   Don't use that - it works better to make a pdf using ghostscript.

--- Makefile.in 2006/04/10 00:34:36     1.128
+++ Makefile.in 2006/04/21 18:57:58     1.129
@@ -1,4 +1,4 @@
-## $XTermId: man2html.html,v 1.59 2024/03/04 09:21:54 tom Exp $
+## $XTermId: man2html.html,v 1.59 2024/03/04 09:21:54 tom Exp $
 ##
 ## $XFree86: xc/programs/xterm/Makefile.in,v 3.54 2006/04/10 00:34:36 dickey Exp $ ##
 ##
@@ -242,6 +242,9 @@
 maintainer-clean : realclean
        -$(RM) 256colres.h 88colres.h
 
+ctlseqs.html : ctlseqs.ms
+       GROFF_NO_SGR=stupid $(SHELL) -c "tbl ctlseqs.ms | groff -Thtml -ms" >$@
+
 ctlseqs.txt : ctlseqs.ms
        GROFF_NO_SGR=stupid $(SHELL) -c "tbl ctlseqs.ms | nroff -Tascii -ms" >$@

The GROFF_NO_SGR is of course a workaround for a misfeature added to groff a couple of years earlier, but was not actually needed for HTML. By "Don't use that", I reminded myself that there were problems with groff's conversion to HTML but that the PDF and PostScript conversions worked well enough. I added PostScript to xterm's makefile in patch 118 (1999) and PDF in patch 226 (2007). Around 2007, I decided to make the documentation for my programs available in various formats (HTML, PDF, PostScript and plain text).

Converting to HTML with groff has a few advantages:

It is someone else's tool, and may require less work from me.
The font choices look reasonably nice.
It generates a list of sections at the top, which made it lynx-friendly.
The pages are resizable (which seemed initially to be a good thing).

However (especially with xterm), there were a number of problems:

groff's conversion to HTML is incomplete; all of the tables are converted as images (sections of the PDF file).
It is not possible to select text from a table.
Often, groff dumps core when generating these images. By trial and (many) errors as well as using particular versions of groff, I tuned the nroff files to reduce the core dumps.
Even when it does not dump core, groff is prone to clipping the images.
xterm's control sequences document uses boxed characters (see the PDF), which do not render in HTML. This might be an advantage, given groff's problems with images.
There is no way to tell groff to generate hyperlinks to other HTML files. In contrast, man2html recognizes the manpage convention with name(X) and generates links automatically.
While resizable pages can be nice, sometimes they detract from readability. In a browser which covers most of the screen, paragraphs can stretch across the screen, making it hard to find the beginning of sentences. I use explicit breaks between sentences when (I think) it aids readability. That makes the result look ragged. This paragraph intentionally uses no line-breaks.

Cascading style-sheets can amend this, placing limits on the width of a paragraph. Given a suitable style-sheet, it solves the problem. The max-width property uses the full screen-width up to a point, and then stops growing. Lynx does not support style-sheets however, and initially my interest in HTML-ized documentation was for Lynx.
Resizing a page of groff's HTML does not behave well, since it does not build-in CSS definitions to keep hanging indents from stretching in odd ways.
Even without the problem with resizing, hanging indents and bullets tend to convert in odd ways using groff unless one uses a particular style — much more restrictive than the normal manpage rules. If not, the bullets end up on a different line from the text.

It was the poor appearance of resized pages which made me decide to stop using groff's HTML conversion.

Hood's man2html script

Here are a few links showing the script's early history:

New man2html (Tool to convert nroff text to HTML) (21 July 1994)
man2html in Tcl, forum comments from 1998 mentioning rman as an alternative.
man2html homepage (01 March 1999)
CVE-1999-1565
Earl Hood / man2html – search.cpan.org
man2html homepage
3.01 package on FreshPorts

My fixes

The script itself is written in Perl. Typical of many written during the 1990s, it did not use Perl's strict checking features. I added some strict checking, and reformatted it. It has grown by about a third, going from 607 to 823 lines.

Some of the improvements are of more general interest:

new options “-aliases”, “-index”, “-toc”
fix a bug where the word marked by an HREF is hyphenated; the HREF was broken.
recognize manpage subsections, i.e., from .SS lines (something not done by groff).
improved section-title logic to work with xterm's ctlseqs.ms file.

There are areas where the current (July 2015) script could be improved: Using a locale with UTF-8 encoding, groff will produce UTF-8 output. The current script does not handle that, but if it were modified to do this, it could display nicer line-drawing for the tables, and nicer bullet characters.

Documentation

Download

Other Tools

Other “man2html” tools

There are several with the same name, but not well-known:

man2html v0.7, by Rik Harris (rik@daneel.rdt.monash.edu.au) posted to WWW-Talk (17 August 1993) (see source).
man2html mentioned in 1999. The same program is listed in SGI's IRIX64 manpages.
man2html by Nelson Beebe (27 April 2001, Version 2.04).
man2html by Norman Hardy (9 November 2002)
gawk script found on Andreas Schoenberg's wiki (8 September 2007)
demo program mentioned in AutoGen 5.18.4 (writing this page in 9 July 2015, the latest release is 5.16.2).
The comment about the Perl script appears to be self-promotional rather than factual.

There is one well-known variant, e.g., the one originally written by Richard Verhoeven sometime in the early 1990s. Exactly when is obscure: he was apparently at Eindhoven for a few years involved with MathSpad, but the man2html program is mentioned only when someone else started working on it in 1996.

manual page on linuxcommand.org (web content generated with Hood's Perl script).
The reference to xmosaic implies it is an old version (early/mid-1990s).
However, the footer of the manual page says January 1, 1998.
Also, the reference to Verhoeven is in the third person.
VH-Man2html, UNIX man page to html converter, is Michael Hamilton's page on the topic.

This is based on Verhoeven's program. According to an LSM listing, and Debian mailing list comment:
- Version 1.1 was announced April 6, 1996,
- Version 1.2 was announced April 17, 1996,
- Version 1.3 was available May 4, 1996.
- Version 1.4 was available August 4, 1996.
- Version 1.5 was available March 9, 1997.
Some users consider this one to be man2html, e.g., quoting comments from http://fossies.org/linux/ftnchek/configure.in:
```
    1 dnl configure.in   Process with autoconf to produce a configure script.
    2 dnl  This autoconf input file is for ftnchek version 2.9, April 1996
...
  136 dnl Look for man-to-html filter.  The scripts for converting the raw html
  137 dnl into the files in the html directory depend on the specific
  138 dnl filter's output style, so another converter probably won't do.
  139 dnl At present the scripts require man2html, a.k.a. vh-man2html, which
  140 dnl is now part of the standard RedHat distribution.  We won't worry much
  141 dnl about this since users generally won't be messing with the docs.
  142 AC_PATH_PROGS(MANtoHTMLPROG,man2html rman)
  143 
  144 case "$MANtoHTMLPROG" in
  145 dnl There are at least two man2html's out there, and probably many more,
  146 dnl so we try to detect whether we have the one that works with the converters.
    
```
Michael Hamilton's page was last updated October 22, 2002. It says:

This page is no longer being maintained. VH-Man2html has been folded into the Linux man-1.4* and man-1.5 packages by Andries Brouwer (aeb@cwi.nl). The current version is usually found in http://ftp.win.tue.nl:/pub/linux/util/man.

Hamilton's page mentions Verhoevan's page at http://wsinwp01.win.tue.nl:1234/maninfo.html (which is long gone). However, there are several older versions of Verhoevan's page on the Internet Archive.

In the oldest available version (December 8, 1996), Verhoevan's page lists Earl Hood's Perl script as one of ten other variations on the theme.

The same page has a copy of Verhoevan's program, but it is undated. At 3043 lines (1700 statements), it is about three times the size of Hood's Perl script (but then, C is usually more verbose than Perl).
manual page on manned.org citing May 3, 1996 in the footer, and (according to manned.org) October 24, 1996 for the original file.

The web content was generated by manned.org's custom front-end to grotty (see source repository).
Post subject: app-text/man2html and app-misc/glimpse: bug report needed? (forum comments in 2011 about 1.5.2)
The gist of the comments is that one of the package maintainers added dependencies which a user did not want.
man2html, Debian packager Robert Luberda, vulnerability in man2html 1.6 in 2011:

Cross-site scripting (XSS) vulnerability in man2html.cgi.c in man2html 1.6, and possibly other version, allows remote attackers to inject arbitrary web script or HTML via unspecified vectors related to error messages.

Both Verhoevan's and Hood's programs generate all of the HTML tags in uppercase. The resemblance ends there:

Verhoevan's program generates an index on the end (using DL tags) which works with nroff subsections.
It also uses DL for hanging indents (such as I would use for bullets).
Perhaps a clever use of CSS could make those look as intended, but a more natural translation would use UL.
It generates regular HTML (not Hood's PRE sections to preserve nroff formatting).
A clever use of CSS as done by manned.org can prevent resizing, though.
Testing vh-man2html (the Debian package) with xterm's manual page shows some of the indentation and line-spacing problems which were a problem with rman. See examples:
- using rman
- using vh-man2html
Because it (like rman) interprets the nroff source files, it is limited to man macros. It does not work with xterm's control sequences (ctlseqs.ms) file which uses ms macros. This is what vh-man2html says:
```
Status: 403 Forbidden
Content-type: text/html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML><HEAD><TITLE>Invalid Man Page</TITLE></HEAD>
<BODY>
<H1>Invalid Man Page</H1>
The requested file ctlseqs.ms is not a valid (unformatted) man page.</BODY></HTML>
```
rman is a little more flexible. Actually, the Debian packaged rman (3.2-4) dumped core when I tested it. Using my improved version, it complains about a handful of unrecognized lines, but produces a halfway-usable HTML output (see example). Neither rman nor vh-man2html could be expected to deal with the macros in that file.

Other manpage/html converters

manServer - convert manual pages to HTML for viewing with a web browser
tcsh.man2html, mentioned in Beyond Linux From Scratch Chapter 7. Shells: Tcsh-6.18.01, this is part of TCSH(1) Cornell 6.05.00 (19 June 1994). tcsh's AUTHORS file says

Dave Schweisguth, Yale University, 1993-4
New man page and tcsh.man2html
man2html.tcl (11 April 1996) bundled with tcl.
According to a 1997 Tcl/Tk 8.0 README, this was based on a similar script (man2tcl.tcl) by Raymond Johnson,
in turn on a C program named man2tcl. At the time, Johnson was working on SunScript (Tcl written in Java).

Other Versions

Aside from my changes, the script seems to have been relatively unmaintained. There was one exception.

Immediately after I released ncurses 6.0 in August 2015, Jörg Schilling noticed the mention of man2html in the release notes and added his version of this script the following day. The timestamp for that file itself was in 2013, but the makefile and announcement are dated August 11, 2015 (no previous mention was made of the script in “schily-tools”).

Schilling's script differs from the original by only a few lines (3 aside from the hashbang line) of the 607 total. Here is a diff created from his changes, showing the relevant parts:

--- man2html    1997-08-12 13:19:18.000000000 -0400
+++ man2html.new        2013-06-27 17:36:26.000000000 -0400
@@ -1,4 +1,4 @@
-#!/usr/local/bin/perl
+#!/usr/bin/perl
##---------------------------------------------------------------------------##
##  File:
##      @(#) man2html 1.2 97/08/12 12:57:30 @(#)
@@ -184,7 +184,7 @@

            ## Create anchor links for manpage references
            s/((((.\010)+)?[\+_\.\w-])+$((.\010)+)?
-             \d((.\010)+)?\w?$)
+             \d((.\010)+)?\w*\))
             /make_xref($1)
             /geox  if $see_also;

@@ -442,7 +442,8 @@

     if ($CgiUrl) {
        my($title,$section,$subsection) =
-           ($str =~ /([\+_\.\w-]+)$(\d)(\w?)$/);
+           ($str =~ /([\+_\.\w-]+)$(\d)(\w*)$/);
+           my($subsection) = lc($subsection);

        $title =~ s/\+/%2B/g;
        my($href) = (eval $CgiUrl);
@@ -474,7 +475,7 @@
     while ($line = <$InFH>) {
        next if $line !~ /$\d\w?$\s+-\s/; # check if line can be handled
        ($refs,$section,$subsection,$desc) =
-           $line =~ /^\s*(.*)$(\d)(\w?)$\s*-\s*(.*)$/;
+           $line =~ /^\s*(.*)$(\d)(\w*)$\s*-\s*(.*)$/;

        if ($Solaris) {
            $refs =~ s/^\s*([\+_\.\w-]+)\s+([\+_\.\w-]+)\s*$/$1/;

The comments in Schilling's ANNOUNCEMENTS/AN-2015-08-11 summarize his changes:

-       man2html: This was added to schilytools as the original man2html
        command has a bug with processing sub-sections and as the original
        man2html is completely unmaintained since August 12 1997.

-       man2html: subsections are now handled correctly and may be
        longer than a single character.

All of the subsequent mention of man2html in announcements alluded to his workarounds to use the script as is rather than improve it.

Later that year, someone asked a question about converting manpages to html, which I answered, pointing to this page. Schilling downvoted my answer a few minutes later.