http://invisible-island.net/
Copyright © 1996-2019,2022 by Thomas E. Dickey

C_COUNT – C/C++ Line Counter

(top)
Synopsis
History
Design Issues
Changes
Documentation
Download
Related Links

Synopsis

c_count counts lines, statements, other simple measures of C/C++ source programs. It isn't lex/yacc based, and is easily portable to a variety of systems.

History

I originally wrote c_count in mid-1983, calling it lincnt after an earlier metrics utility. The current name is easier to remember.

However, this copy dates to the end of 1985, because I had moved, and though I had it on tape, had no tape drive. So I entered it from a listing.

In case someone wishes to remind me, I am already aware of various code-metrics. Early on (1977-1978), when I started to evolve the notion of a complexity measure for microprocessors, I had in mind the other side, e.g., development effort. I spent some time gathering numbers to show how much effort (time and steps) were needed to develop programs. What I found was

editing the program took almost all of the effort (say 90%).
most of the steps (say 90%) in editing were simply moving around, inspecting the program before deciding what to change.
most changes were correct; some part (say 4%) introduced an error which required rework.

By the way, gathering the metrics took more time than developing the programs. While it might be possible to construct an environment which did the measurements, that was out of scope. For making spot-checks and assessments, a simple tool showing progress was needed. I did that for my project outside the scope of the research, starting in 1976.

To put it another way, tools that tell how high a mountain is are useful, even if there are various methods of climbing it which differ in cost.

Fancier tools (function points, McCabe, Halstead) all have their pitfalls. For instance, I computed data for Halstead's measure and found it strongly correlated with SLOC. This link is not talking about me (and the dates given are unlikely—Halstead published his work in 1977), but the conclusion does match mine.

Simple Measures

I wrote the 1983-version of lincnt to track progress on my project. I found that I was adding about 2000 lines per week. SLOC was then not part of my vocabulary.

To me, it was obvious that the way to count C was to count the semicolons used for delimiters. True, that gives two for a for-loop. But that is a minor inconsistency. And it is simple.

My associates argued about that (it was not obvious to them), and were leery that management might use that as a measure of our performance. (That was not obvious to me, but I concede it could be a hazard).

A little later (same program, early 1985), I encountered comments by someone talking about his "great programmer". That struck me as odd, since I had read some of that person's work, and was unimpressed. Just to check, I started by running my code-counter. It surprised me, saying that the program was about 40% comments (hinting that the programmer was doing thorough work). Going back to inspect the program, I realized that my code-counter was misled. Most of the comments were asterisk characters. After adjusting the count to exclude punctuation, the counter showed less than 3% comments. I refined that measure to compare comments to code (ignoring whitespace—and of course punctuation within "comments").

That is a simple measurement, which gives me a figure of merit for a program. For the same programmer, there are interesting stylistic flaws which would probably require a complex measurement. For example, the program which I was reading used preprocessor macros ineffectively. It defined a report's columns as a set of constants, but did not use arithmetic expressions. That detracted from its maintainability: if one wished to change the width of a column, that would require changing all of the #define's for the columns after the altered column. On reflection, that 3% comment:code ratio told me enough about that program.

There are other simple measures which help to gauge code quality. In a different analysis, I was interested in how much of a program was simply pasted in multiple places rather than by constructing suitable functions. The motivation was because I was working to undo this (calling it dump-truck code) for a program which was in two parts that should have shared data. I analyzed this by stripping comments and extra whitespace, sorting the lines and measuring the number of duplicate lines. In my project I had reduced the duplication from about 30% to less than 20%. On the other hand, another program in the project (not mine) had 46% duplication.

Adding Features

The newer version (starting at the end of 1985) evolved over several years, as I found new issues to deal with The change-log by the way shows the first check-in for March 1986. That was using the SCCS wrappers which I wrote to support the project I was working on.

Part of that project was developing and maintaining a Unix kernel driver for a networking card. The person who had started the driver had written macros with strings that lacked the terminating quote. I added an option to make c_count deal with that, rather than always accept the odd syntax. (This feature does not work with standard C).

I added some features based on suggestions by others. Most of those were in a later project (starting at the end of 1987):

one suggestion (in 1988) was to rename the program. One of my coworkers found some potential in lincnt, but noted that another person (who was doing some metrics work) was using one of the programs called ccount and that part of his reluctance to use lincnt was that the purpose of the program was not apparent from its name.
I pointed out that renaming it ccount as he had suggested would not improve matters—it would only increase confusion. In discussion, I more or less agreed that renaming it c_count (which did not appear to be used) would be a suitable compromise.
Other suggestions were more concrete. For instance, in 1990 I added the -t option to generate a spreadsheet (csv) report.

Shortly after, I added SLOC to my vocabulary, along with PSS (physical source lines) and LSS (logical source lines). We had some people doing metrics, and they had their own language. I encountered this while developing a.count. Like most metrics people, these did not write programs. Rather, they made models (such as an S-curve) and occasionally collected data to validate the models. I wrote a.count to satisfy my curiosity about the project that I was working on. They learned about the program, and after much discussion requested that I modify the report, changing

"lines" to physical source lines and
"statements" to "logical source lines".

Not only that, but they requested that I do the same for lincnt (as it was then called). I did that, but made it optional (-j, for "jargon"). Doing that made my code-counters part of the establishment, so to speak, and they referred to the programs in the papers they were writing.

Publishing...

I renamed the program a few years later (May, 1995), having left that project, and starting to publish the programs I had written on my own during the previous decade. This was around the time that the comp.sources.misc newsgroup died, as I see in my email:

From dickey Wed Jul 12 06:13:07 1995
Subject: recent postings
To: comp-sources-misc@uunet.uu.net, sources-misc@uunet.uu.net,
        comp-sources-unix@uunet.uu.net (comp.sources.unix)
Date: Wed, 12 Jul 1995 06:13:07 -0400 (EDT)

Are you guys still there?  I sent a copy of

        diffstat 1.7 comp.sources.misc (may 21, 1995)
        c_count 7.0 comp.sources.misc (may 21, 1995)

and corrected up with a message to comp.sources.unix indicating that diffstat
should be in _that_ group. Aside from the auto-reply from comp.sources.unix,
I've seen no response.

-- 
Thomas E. Dickey
dickey@clark.net

While diffstat showed up in the index for comp.sources.unix (volume 28, ending May 23, 1995), that was the 42nd of 58 entries in this volume, c_count did not show up in either. For what it's worth, here is a list of successful postings for programs that I worked on during that era:

ncftp in comp.sources.misc volume 39 (August 26, 1993).
par in comp.sources.misc volume 40 (October 10, 1993).
ncftp in comp.sources.misc volume 40 (November 2, 1993 citing January 17, 1993).
diffstat in comp.sources.unix volume 28 (June 15, 1994).
cproto 4.0 in comp.sources.misc volume 44 (October 12, 1994).
cproto 4.2 in comp.sources.misc volume 45 (October 24, 1994).
cproto 4.3 in comp.sources.misc volume 47 (January 6, 1995).
conflict in comp.sources.misc volume 47 (April 15, 1995), one of the last postings.
diffstat 1.17 in comp.sources.unix volume 29 (October 10, 1995), not seen here, though there are postings for another year or so.

At the same time, I put a copy on Sunsite.

Good Numbers

A nonobvious aspect of counting C source is what to do about inline comments. For example, in this chunk:

/* set up a buffer for this file */
bp = getfile2bp(param, FALSE, TRUE);
if (bp) {
    bp->b_flag |= BFARGS;   /* treat this as an argument */
    make_current(bp);       /* pull it to the front */
    if (!havebp) {
        havebp = bp;
        havename = param;
    }
}

there are two inline comments. Some counters ignore them, some do not. c_count does both. In showing the sum of line-types to 100%, it counts inline comments as a negative value, since those lines are already counted as code:

    10     5   |/tmp/foo.c
----------------
    10     5    total lines/statements

     3  lines had comments        30.0 %
     2  comments are inline      -20.0 %
     0  lines were blank           0.0 %
     0  lines for preprocessor     0.0 %
     9  lines containing code     90.0 %
    10  total lines              100.0 %

    60  comment-chars             22.1 %
    12  nontext-comment-chars      4.4 %
    86  whitespace-chars          31.6 %
     0  preprocessor-chars         0.0 %
   114  statement-chars           41.9 %
   272  total characters         100.0 %

    18  tokens, average length 4.83

  0.53  ratio of comment:code

     2  top-level blocks/statements
     3  maximum blocklevel
  1.89  ratio of blocklevel:code

Doing it this way accounts for all of the categories. Incidentally, the format is chosen so that roundoff is accounted for. The numbers are supposed to add up exactly. When I was developing this around 1990, I used both Sun and Apollo workstations. The latter required adjustment, since it rounded differently from Sun. It turns out that rounding problems are far less common with standard C.

What Next?

There are other interesting measures that I could add to c_count. Or I could develop a different tool.

In later metrics work, I have developed different tools. For example, in 2005 I developed two different tools, but (as in c_count and a.count) kept the same general reporting style:

a set of perl scripts and modules which I used to analyze and mark copyrights on a large project.
a set of lex-based counters.

Both of those dealt with a dozen or so file-types.

The latter was based on the syntax-highlighters which I have developed for vile (vi-like-emacs). Generalizing from code-counting, this tool also made the same measurements for a few data file-types such as HTML and XML.

One drawback to the way in which I developed it was that I could not reuse syntax highlighters easily enough. If I were to revisit this tool, I would use vile directly by parsing the colorized output from the syntax highlighters. I added an option (-F) to vile at the end of 2009 which makes this simple.

Not Useful...

By the way, I have noticed sloccount of course, but have no use for it:

The initial versions (in 2001) dumped core when I tested them.
There are differences in the counts reported (not due to the way inline comments are counted, nor due to differences in counting blank or commented lines).
There is a filename conflict with c_count.

For instance, this reported on things that I maintain. It shows a lot of differences. It also overlooked things like the M4-macros for autoconf (but likely found the M4 sources which made up 3719 lines of ncurses' Ada95 binding). In the comparison below, I have omitted the autoconf-generated "configure" script and the utilities "config.guess" and "config.sub". Also (since sloccount ignored those files), I have omitted counts for the various ".in" templates.

Note: c_count of course counts only C programs, and directly shows LSS (logical source statements). I wrote a script called lex-metrics which uses the -F option of vile to compute SLOCs for the various files.

Program	Actual lines									SlocCount lines
Program	ada	ansic	awk	cpp	lex	perl	sed	sh	yacc	ada	ansic	awk	cpp	lex	perl	sed	sh	yacc
cproto-4.6		7600			766			279	761		7600			985			279	761
dialog-0.9a-20001217		5419				350		2316			5321				350		3145
diffstat-1.27		615						170			616
lynx2-8-4		118534				583	107	231			116438				583		206
ncurses-5.2	11354	47489	606	3727		124	137	1583		12937	48144	552	3726		126	136	2323

The license information shown in the report also is misleading (unsurprising given the source). The MIT-X11 license is listed as "distributable".

The filename conflict is like the other problems noted. It is customary when designing a program to avoid conflict with existing programs. c_count had been on the main Linux ftp server (sunsite.unc.edu) for six years before sloccount was released in 2001.

Changes

See the changelog for details:

Documentation

c_count program (pdf) (postscript) (plain text)

inn-workers mailing list Configuration parsing code landed thread (June 2001).
UNIX Application Migration Guide (Patterns & Practices), Microsoft Press, 2003, ISBN 0735618380
c_count is used as an example in this guide originally by Interix now Microsoft.
An Empirical Analysis of C Preprocessor Use, by Ernst,Badros and Notkin, IEEE Transactions on Software Engineering, Vol 28, No. 12, December 2002.
c_count was one of twelve metrics programs mentioned (to illustrate that none did exactly what the authors wanted for analyzing cpp). Incidentally, cproto would have been more suitable for their use.
C∀, a Study in Evolutionary Design in Programming Languages, by Rodolfo Gabriel Esteves Jaramillo (2004, University of Waterloo).
Esteves used c_count to compute SLOCs for a set of programs.
Contract in Electronic Commerce, dissertation by Douglas Steves (2005, University of Texas).
Steves used c_count as a tool in presenting his conclusions.
CMD-ISIR-05-121 Finding Predictors of Field Defects for Open Source Software Systems in Commonly Available Data Sources: a Case Study of OpenBSD, by Li, Shaw and Herbsleb (June 2005).
c_count was one of four tools chosen to compute product metrics.
Eclipse GEF3D: bringing 3D to existing 2D editors (April 2009). and related blog
c_count was used for measure the size of the program.

Other programs

This includes programs named ccount. The one that I referred to from 1988 is not freely available, so I will not cite it here.

Metrics tools for C/C++ Chris Lott's page
Softpanorama links to, and copies Chris Lott's page
c_count.awk by Dan Kozak (1988)
computer-programming-forum.com AWK script to count LOC (lines of code) (2002)
ccount, Ted Shapin (1989), used in Debian 2.2 (1999)
comp.sources.misc volume 42, issue 20, Lutz Prechelt (March 29, 1994).
CCOUNT by Joerg Lawrenz, in CCOUNT readability metrics tool for C programs available (1993)
CCount by Shane Hartman, in Using Software Metrics Tools for Maintenance Decisions (1996).
This is a different ccount than the one used in the Debian package (which does not mention Hartman). Hartman is mentioned in a mailing list discussion (July 2001).
ccount "freeware" unspecified in Assessing the Maintainability of C++ Source Code by Marius Sundbakken (December 2001)
CCount in Beyond Bug-Finding: Sound Program Analysis for Linux a "C-to-C compiler" (2007)