http://invisible-island.net/scripts/
Copyright © 2015-2022,2024 by Thomas E. Dickey
Here is an improved version of Earl Hood's
man2html
Perl script,
along with a discussion of why I came to make those
improvements.
Initially (in the mid-1990s) all of the documentation for the programs I work on was either plain text README's or nroff (manual pages). HTML came later.
While ncurses has had some HTML documentation since 1.9.2d in mid-1995, this started with just a few which were intended to be part of a website:
At that point in time, there was in fact no website. But just
in case.
The hacker's guide followed
at the end of 1995. Still no website.
As these files were added, the dist.mk
makefile
was updated to provide plain-text versions using lynx dumps.
I made improvements to the makefile, but largely ignored the HTML
files, for a while.
Juergen Pfeifer started things going, by providing HTML-ized
versions of the Ada95 binding starting in October 1996, and adding the
adahtml
rule in dist.mk
to generate the
files. He used the gnathtml
script (which appears to be orphaned — I have maintained a
copy for ncurses). Also, in Juergen's initial changes, he added
HTML-ized manual pages under the Ada95/html
directory.
Later, in March
2000, Juergen Pfeifer added the manhtml
rule to
dist.mk
to generate HTML-ized manual pages using
man2html
.
At that time, he moved the HTML manual pages to their present
location in doc/html/man
.
Because they were not really part of the Ada95 binding (which
Juergen developed and maintained for several years),
I took more of an interest in being able to rebuild those files.
I started making improvements
to gnathtml
in June 2000, and
to man2html
in July 2001
Both gnathtml
and man2html
are Perl
scripts which process the output from other programs (such as
gnat and nroff) to provide HTML. Besides regular bug-fixes, I
made improvements (in dist.mk
) so that the generated
pages would validate, e.g., on w3.org's test pages.
Throughout this discussion, the reader may notice that I do these conversions to HTML using automatic tools. I spend a lot of time making them work, and do not manually adjust the result.
As an alternate view, a few people would like to rely upon GNU info format and not use manpage format. In response to a request (at the end of May 2000) to convert to Texinfo, I pointed out that
I would do this if there were a tool,
There was no tool (and likely still is none), and
I would not manually convert formats (and stop distributing manpages).
The lack of tools is two-fold:
There were no tools for converting the syntax, and
Texinfo documents are organized differently from manual pages.
A one-way manual conversion (based on comments made in a similar request the previous month) would take a few weeks of full-time work. My response to this led to a followup from some people who had volunteered to do the work. I saw no result from that.
The justification given for the request to convert all of the documentation to Texinfo was based on its ease of conversion to other formats (which is dubious, given the lack of tools for converting to/from manual pages). Texinfo of course is not the only possible solution to the problem of producing multiple formats. Groff can do that.
There also is SGML. During the next few years (beginning on
the mailing list in
2002, as well as in private email), Pradeep Padala (who wrote
an ncurses
HOWTO) discussed converting ncurses' documentation to SGML.
Ultimately that ran into problems with tools. Given a
good DTD, xmlto
could
produce something, but getting a workable toolset can
rapidly become an end in itself, requiring continual maintenance
to work around tool breakage. In writing this, for example, I
came across Greg Woods' comment on a NetBSD mailing list in 2007
entitled why
XML? (was: Proposition for Releases page changes) which
you may find amusing..
On the other hand, the HOWTO is useful and was already converted. After some discussion on licensing, we added it (as HTML) to ncurses in mid-2005. Later, I noticed that Pradeep was no longer actively maintaining the HOWTO or the accompanying programming examples, and collected those for the ncurses FAQ.
XFree86 used a different tool for generating HTML documentation: rman. Formerly known as "RosettaMan" (but problematic due to infringement alleged by Rosetta, Inc.), renamed to "PolyglotMan", the tool reads nroff files directly and attempts to construct formatted HTML manual pages.
XTerm's manual page was the longest manual page in XFree86. It
also had some problems which I (mostly) fixed by improvements to
rman
(see
CVS):
fixes to address weblint and tidy warnings
modified to not reformat comments, so that copyright notices in the original files would appear as-is in the HTML.
imported rman 3.2 and merged XFree86's changes to it.
I sent those changes to Tom Phelps at the end of 2003, but beyond email acknowledgement, saw no further updates from him. Here is part of the conversation.
From dickey@his.com Wed Dec 17 20:58:41 2003 -0500 Date: Wed, 17 Dec 2003 20:58:41 -0500 (EST) From: Thomas Dickey <dickey@his.com> To: Tom Phelps <phelps@eecs.berkeley.edu> Subject: Re: rman In-Reply-To: <F42DDF86-30FB-11D8-9B95-000A959C28F4@cs.berkeley.edu> Message-ID: <Pine.BSI.4.53.0312172055440.17997@mail.his.com> References: <F42DDF86-30FB-11D8-9B95-000A959C28F4@cs.berkeley.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO Content-Length: 381 Lines: 15 On Wed, 17 Dec 2003, Tom Phelps wrote: > > I think it'd be nice to cleanup the 3.2 (indent it, > > fix the compiler warnings, improve the HTML validation), add options > > for > > some of the features that are needed for XFree86. > > Sounds good. I'm always happy to receive patches. Good luck! ok. -- Thomas E. Dickey http://invisible-island.net ftp://invisible-island.net
I followed up over the next week and mailed a patch to resync XFree86 and Phelps. The changes are in the XFree86 CVS (rev 1.19), of course, but were absent from X.Org's copy. Reading comments such as this, I pointed out the missing updates, e.g., in the X.Org mailing list more than once (in 2005 and 2006) as well as to packagers, e.g., on the Cygwin mailing list. Ultimately X.Org dropped the program (as noted here in 2008).
This raises a two-part question:
how does Xorg format manual pages,
how are they converted to HTML?
While I have seen occasional mailing-list discussion of the X documentation files, there has not been much. Occasionally there are comments about converting to XML, discarding obsolete formats, etc. That would give the casual observer the impression that the manual pages are converted to XML, and then (as implied by the people discussing ncurses documentation) that the content could be "easily" converted to a variety of formats. Perhaps.
To get an answer, you have to look at where the installed manual pages come from. Xorg has little to do with the end result which you would install on a computer; that is done by packagers such as Red Hat. Inspecting recent copies of these source packages gives an answer:
xorg-x11-docs
, i.e., the "X" manual page.
Interestingly, the RPM says its source is
and a quick look there shows that the size of the tarballs decreases steadily with time.
The Xorg developers have reduced the scope of their effort to make some progress. Release 1.7 is twenty times smaller than release 1.0, partly due to removing "obsolete" formats, but also to removing entire specifications such as Xaw, XIM, Xt from the release, putting them in per-component locations (currently here: Xaw, XIM, Xt). Little attention has been made to packaging those specifications. For instance, Fedora simply provides some of the XML-files in the development packages without attempting to convert them to PDF (none are given in libX11-devel).
In the libxcb-devel
package, Fedora provides
a set of manual pages generated (according to the RPM spec
file) from libxcb
source code. The work is far from complete, with 80%
containing "TODO" and "NOT DOCUMENTED" tags.
xorg-x11-utils
, e.g., for xev.
These provide the same files (nroff) with some minor changes and improvements. While the X specs have been partially converted to SGML, the state of the manual pages as of mid-2015 (ten years into the process) is still "someday".
xorg-sgml-doctools
(this, at least, is
new).
Aside from its manual page, all of vile's documentation started out as plain-text files. This was awkward, since only the help-file (also plain-text) was readily accessible from the editor.
In 1999, Martin Lemburg suggested providing different formats
for the help-file. Finding that vile.hlp
was not a
Microsoft help-file, he provided an example where this was
converted to RTF, suggesting that I add it to the distribution.
However, I pointed out that this could be hard to maintain:
> if you want to you could include the file(s) inside the next = > distributions! Perhaps as a separate zip file (I agree that people will find this useful). This issue has come up before, and I don't have a good solution: how can we maintain alternate forms of the documentation when there's a manual stage required.
vile's documentation has had suffixes which conflict with
Microsoft's types for quite a while (vile.hlp
since
October 1990, macros.doc
since April 1994), but
converting the documentation to other formats did not seem like a
solution to the problem. We continued to add documentation.
Finally, at the end of 2009, I changed things, making all of
the documentation except the manual page in HTML format,
and generating the “.hlp” and “.doc”
files from HTML. Interesting enough, one of the reasons was to
improve packaging for winvile
:
20091228 (y) ... > Tom Dickey: + change Inno Setup script to install html files rather than doc files for documentation; added a shortcut to the table of contents. + convert all ".doc" files to html, generate ".doc" from those files. ... 20091114 (x) > Tom Dickey: ... + add doc/vile-toc.html, a table of contents for vile-hlp.html + generate vile.hlp from doc/vile-hlp.html ... + add doc/vile-hlp.html, which is a marked-up version of vile.hlp + change atr2html to not use <pre>, since <font> inside <pre> is nonstandard.
There were other reasons:
I had made improvements to vile's text-to-html conversion, and (as part of improving my website) intended using the generated text in my ncurses and xterm FAQs. But I did the changes first in vile's help-file.
By converting vile's documentation to HTML, I could put colored examples (of macros, etc) in the documentation.
HTML had become so prevalent that it effectively displaced most uses of plain-text READMEs.
There was no point in constructing an HTML version of the manual page, and translating it back into nroff. groff could do an adequate job of going from nroff to HTML.
Because I had accumulated many manpages, and because people requested it, I setup my website to provide documentation in HTML.
For most programs other than ncurses (where I was already
generating HTML using man2html
), I initially relied
upon groff to do this in makefiles. That seemed simpler than
maintaining another tool. For instance, this check-in for xterm
in April 2006:
REV:1.129 Makefile.in 2006/04/21 18:57:58 tom add a rule to make ctlseqs.html using groff's _very_ crude -Thtml output. Don't use that - it works better to make a pdf using ghostscript. --- Makefile.in 2006/04/10 00:34:36 1.128 +++ Makefile.in 2006/04/21 18:57:58 1.129 @@ -1,4 +1,4 @@ -## $XTermId: man2html.html,v 1.59 2024/03/04 09:21:54 tom Exp $ +## $XTermId: man2html.html,v 1.59 2024/03/04 09:21:54 tom Exp $ ## ## $XFree86: xc/programs/xterm/Makefile.in,v 3.54 2006/04/10 00:34:36 dickey Exp $ ## ## @@ -242,6 +242,9 @@ maintainer-clean : realclean -$(RM) 256colres.h 88colres.h +ctlseqs.html : ctlseqs.ms + GROFF_NO_SGR=stupid $(SHELL) -c "tbl ctlseqs.ms | groff -Thtml -ms" >$@ + ctlseqs.txt : ctlseqs.ms GROFF_NO_SGR=stupid $(SHELL) -c "tbl ctlseqs.ms | nroff -Tascii -ms" >$@
The GROFF_NO_SGR
is of course a workaround for a
misfeature added to groff a couple of years earlier, but was not
actually needed for HTML. By "Don't use that", I reminded myself
that there were problems with groff's conversion to HTML but that
the PDF and PostScript conversions worked well enough. I added
PostScript to xterm's makefile in patch 118 (1999) and PDF in
patch 226 (2007).
Around 2007, I decided to make the documentation for my programs
available in various formats (HTML, PDF, PostScript and plain
text).
Converting to HTML with groff has a few advantages:
It is someone else's tool, and may require less work from me.
The font choices look reasonably nice.
It generates a list of sections at the top, which made it lynx-friendly.
The pages are resizable (which seemed initially to be a good thing).
However (especially with xterm), there were a number of problems:
groff's conversion to HTML is incomplete; all of the tables are converted as images (sections of the PDF file).
It is not possible to select text from a table.
Often, groff dumps core when generating these images. By trial and (many) errors as well as using particular versions of groff, I tuned the nroff files to reduce the core dumps.
Even when it does not dump core, groff is prone to clipping the images.
xterm's control sequences document uses boxed characters (see the PDF), which do not render in HTML. This might be an advantage, given groff's problems with images.
There is no way to tell groff to generate hyperlinks to
other HTML files. In contrast, man2html
recognizes the manpage convention with name(X) and
generates links automatically.
While resizable pages can be nice, sometimes they detract from readability. In a browser which covers most of the screen, paragraphs can stretch across the screen, making it hard to find the beginning of sentences. I use explicit breaks between sentences when (I think) it aids readability. That makes the result look ragged. This paragraph intentionally uses no line-breaks.
Cascading style-sheets can amend
this, placing limits on the width of a paragraph. Given a
suitable style-sheet, it solves the problem. The
max-width
property uses the full screen-width
up to a point, and then stops growing. Lynx does not support style-sheets
however, and initially my interest in HTML-ized documentation
was for Lynx.
Resizing a page of groff's HTML does not behave well, since it does not build-in CSS definitions to keep hanging indents from stretching in odd ways.
Even without the problem with resizing, hanging indents and bullets tend to convert in odd ways using groff unless one uses a particular style — much more restrictive than the normal manpage rules. If not, the bullets end up on a different line from the text.
It was the poor appearance of resized pages which made me decide to stop using groff's HTML conversion.
Here are a few links showing the script's early history:
New man2html (Tool to convert nroff text to HTML) (21 July 1994)
man2html in Tcl, forum comments from 1998 mentioning
rman
as an alternative.
man2html homepage (01 March 1999)
Earl Hood / man2html – search.cpan.org
3.01 package on FreshPorts
The script itself is written in Perl. Typical of many written during the 1990s, it did not use Perl's strict checking features. I added some strict checking, and reformatted it. It has grown by about a third, going from 607 to 823 lines.
Some of the improvements are of more general interest:
new options “-aliases
”,
“-index
”,
“-toc
”
fix a bug where the word marked by an HREF is hyphenated; the HREF was broken.
recognize manpage subsections, i.e., from
.SS
lines (something not done by groff).
improved section-title logic to work with xterm's ctlseqs.ms file.
There are areas where the current (July 2015) script could be improved: Using a locale with UTF-8 encoding, groff will produce UTF-8 output. The current script does not handle that, but if it were modified to do this, it could display nicer line-drawing for the tables, and nicer bullet characters.
There are several with the same name, but not well-known:
man2html v0.7, by Rik Harris (rik@daneel.rdt.monash.edu.au) posted to WWW-Talk (17 August 1993) (see source).
man2html mentioned in 1999. The same program is listed in SGI's IRIX64 manpages.
man2html by Norman Hardy (9 November 2002)
gawk script found on Andreas Schoenberg's wiki (8 September 2007)
demo
program mentioned in AutoGen 5.18.4 (writing this page in
9 July 2015, the latest release is 5.16.2).
The comment about the Perl script appears to be
self-promotional rather than factual.
There is one well-known variant, e.g., the one originally written by Richard Verhoeven sometime in the early 1990s. Exactly when is obscure: he was apparently at Eindhoven for a few years involved with MathSpad, but the man2html program is mentioned only when someone else started working on it in 1996.
manual page on linuxcommand.org (web content generated
with Hood's Perl script).
The reference to xmosaic implies it is an
old version (early/mid-1990s).
However, the footer of the manual page says January 1,
1998.
Also, the reference to Verhoeven is in the third person.
VH-Man2html, UNIX man page to html converter, is Michael Hamilton's page on the topic.
This is based on Verhoeven's program. According to an LSM listing, and Debian mailing list comment:
Some users consider this one to be
man2html
, e.g., quoting comments from http://fossies.org/linux/ftnchek/configure.in:
1 dnl configure.in Process with autoconf to produce a configure script. 2 dnl This autoconf input file is for ftnchek version 2.9, April 1996 ... 136 dnl Look for man-to-html filter. The scripts for converting the raw html 137 dnl into the files in the html directory depend on the specific 138 dnl filter's output style, so another converter probably won't do. 139 dnl At present the scripts require man2html, a.k.a. vh-man2html, which 140 dnl is now part of the standard RedHat distribution. We won't worry much 141 dnl about this since users generally won't be messing with the docs. 142 AC_PATH_PROGS(MANtoHTMLPROG,man2html rman) 143 144 case "$MANtoHTMLPROG" in 145 dnl There are at least two man2html's out there, and probably many more, 146 dnl so we try to detect whether we have the one that works with the converters.
Michael Hamilton's page was last updated October 22, 2002. It says:
This page is no longer being maintained. VH-Man2html has been folded into the Linux man-1.4* and man-1.5 packages by Andries Brouwer (aeb@cwi.nl). The current version is usually found in http://ftp.win.tue.nl:/pub/linux/util/man.
Hamilton's page mentions Verhoevan's page at
http://wsinwp01.win.tue.nl:1234/maninfo.html
(which is long gone). However, there are several
older versions of Verhoevan's page on the Internet
Archive.
In the oldest available version (December 8, 1996), Verhoevan's page lists Earl Hood's Perl script as one of ten other variations on the theme.
The same page has a copy of Verhoevan's program, but it is undated. At 3043 lines (1700 statements), it is about three times the size of Hood's Perl script (but then, C is usually more verbose than Perl).
manual page on manned.org citing May 3, 1996 in the footer, and (according to manned.org) October 24, 1996 for the original file.
The web content was generated by manned.org's custom front-end to grotty (see source repository).
Post
subject: app-text/man2html and app-misc/glimpse: bug report
needed? (forum comments in 2011 about 1.5.2)
The gist of the comments is that one of the package
maintainers added dependencies which a user did not want.
man2html, Debian packager Robert Luberda, vulnerability in man2html 1.6 in 2011:
Cross-site scripting (XSS) vulnerability in man2html.cgi.c in man2html 1.6, and possibly other version, allows remote attackers to inject arbitrary web script or HTML via unspecified vectors related to error messages.
Both Verhoevan's and Hood's programs generate all of the HTML tags in uppercase. The resemblance ends there:
Verhoevan's program generates an index on the end (using
DL
tags) which works with nroff subsections.
It also uses DL
for hanging indents (such as
I would use for bullets).
Perhaps a clever use of CSS could make those look as
intended, but a more natural translation would use
UL
.
It generates regular HTML (not Hood's PRE
sections to preserve nroff formatting).
A clever use of CSS as done by manned.org can prevent
resizing, though.
Testing vh-man2html
(the Debian package) with
xterm's manual page shows some of the indentation and
line-spacing problems which were a problem with
rman
. See examples:
Because it (like rman
) interprets
the nroff source files, it is limited to man macros.
It does not work with xterm's control sequences
(ctlseqs.ms) file which uses ms macros. This is what
vh-man2html
says:
Status: 403 Forbidden Content-type: text/html <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML><HEAD><TITLE>Invalid Man Page</TITLE></HEAD> <BODY> <H1>Invalid Man Page</H1> The requested file ctlseqs.ms is not a valid (unformatted) man page.</BODY></HTML>
rman
is a little more flexible. Actually, the
Debian packaged rman
(3.2-4) dumped core when I
tested it. Using my improved version, it complains about a
handful of unrecognized lines, but produces a halfway-usable
HTML output (see example).
Neither rman
nor vh-man2html
could
be expected to deal with the macros in that file.
tcsh.man2html, mentioned in Beyond Linux From Scratch Chapter 7. Shells: Tcsh-6.18.01, this is part of TCSH(1) Cornell 6.05.00 (19 June 1994). tcsh's AUTHORS file says
Dave Schweisguth, Yale University, 1993-4
New man page and tcsh.man2html
man2html.tcl (11 April 1996) bundled with tcl.
According to a 1997 Tcl/Tk 8.0
README, this was based on a similar script (man2tcl.tcl)
by Raymond Johnson,
in turn on a C program named man2tcl. At the time, Johnson
was working on SunScript (Tcl written in
Java).
Aside from my changes, the script seems to have been relatively unmaintained. There was one exception.
Immediately after I released ncurses 6.0 in August 2015, Jörg Schilling noticed the mention of man2html in the release notes and added his version of this script the following day. The timestamp for that file itself was in 2013, but the makefile and announcement are dated August 11, 2015 (no previous mention was made of the script in “schily-tools”).
Schilling's script differs from the original by only a few lines (3 aside from the hashbang line) of the 607 total. Here is a diff created from his changes, showing the relevant parts:
--- man2html 1997-08-12 13:19:18.000000000 -0400
+++ man2html.new 2013-06-27 17:36:26.000000000 -0400
@@ -1,4 +1,4 @@
-#!/usr/local/bin/perl
+#!/usr/bin/perl
##---------------------------------------------------------------------------##
## File:
## @(#) man2html 1.2 97/08/12 12:57:30 @(#)
@@ -184,7 +184,7 @@
## Create anchor links for manpage references
s/((((.\010)+)?[\+_\.\w-])+\(((.\010)+)?
- \d((.\010)+)?\w?\))
+ \d((.\010)+)?\w*\))
/make_xref($1)
/geox if $see_also;
@@ -442,7 +442,8 @@
if ($CgiUrl) {
my($title,$section,$subsection) =
- ($str =~ /([\+_\.\w-]+)\((\d)(\w?)\)/);
+ ($str =~ /([\+_\.\w-]+)\((\d)(\w*)\)/);
+ my($subsection) = lc($subsection);
$title =~ s/\+/%2B/g;
my($href) = (eval $CgiUrl);
@@ -474,7 +475,7 @@
while ($line = <$InFH>) {
next if $line !~ /\(\d\w?\)\s+-\s/; # check if line can be handled
($refs,$section,$subsection,$desc) =
- $line =~ /^\s*(.*)\((\d)(\w?)\)\s*-\s*(.*)$/;
+ $line =~ /^\s*(.*)\((\d)(\w*)\)\s*-\s*(.*)$/;
if ($Solaris) {
$refs =~ s/^\s*([\+_\.\w-]+)\s+([\+_\.\w-]+)\s*$/$1/;
The comments in Schilling's
ANNOUNCEMENTS/AN-2015-08-11
summarize his
changes:
- man2html: This was added to schilytools as the original man2html command has a bug with processing sub-sections and as the original man2html is completely unmaintained since August 12 1997. - man2html: subsections are now handled correctly and may be longer than a single character.
All of the subsequent mention of man2html in announcements alluded to his workarounds to use the script as is rather than improve it.
Later that year, someone asked a question about converting manpages to html, which I answered, pointing to this page. Schilling downvoted my answer a few minutes later.