http://invisible-island.net/
Copyright © 1996-2019,2022 by Thomas E. Dickey
c_count counts lines, statements, other simple measures of C/C++ source programs. It isn't lex/yacc based, and is easily portable to a variety of systems.
I originally wrote c_count in mid-1983, calling it lincnt after an earlier metrics utility. The current name is easier to remember.
However, this copy dates to the end of 1985, because I had moved, and though I had it on tape, had no tape drive. So I entered it from a listing.
In case someone wishes to remind me, I am already aware of various code-metrics. Early on (1977-1978), when I started to evolve the notion of a complexity measure for microprocessors, I had in mind the other side, e.g., development effort. I spent some time gathering numbers to show how much effort (time and steps) were needed to develop programs. What I found was
By the way, gathering the metrics took more time than developing the programs. While it might be possible to construct an environment which did the measurements, that was out of scope. For making spot-checks and assessments, a simple tool showing progress was needed. I did that for my project outside the scope of the research, starting in 1976.
To put it another way, tools that tell how high a mountain is are useful, even if there are various methods of climbing it which differ in cost.
Fancier tools (function points, McCabe, Halstead) all have their pitfalls. For instance, I computed data for Halstead's measure and found it strongly correlated with SLOC. This link is not talking about me (and the dates given are unlikely—Halstead published his work in 1977), but the conclusion does match mine.
I wrote the 1983-version of lincnt to track progress on my project. I found that I was adding about 2000 lines per week. SLOC was then not part of my vocabulary.
To me, it was obvious that the way to count C was to count the semicolons used for delimiters. True, that gives two for a for-loop. But that is a minor inconsistency. And it is simple.
My associates argued about that (it was not obvious to them), and were leery that management might use that as a measure of our performance. (That was not obvious to me, but I concede it could be a hazard).
A little later (same program, early 1985), I encountered comments by someone talking about his "great programmer". That struck me as odd, since I had read some of that person's work, and was unimpressed. Just to check, I started by running my code-counter. It surprised me, saying that the program was about 40% comments (hinting that the programmer was doing thorough work). Going back to inspect the program, I realized that my code-counter was misled. Most of the comments were asterisk characters. After adjusting the count to exclude punctuation, the counter showed less than 3% comments. I refined that measure to compare comments to code (ignoring whitespace—and of course punctuation within "comments").
That is a simple measurement, which gives me a figure of merit
for a program. For the same programmer, there are interesting
stylistic flaws which would probably require a complex
measurement. For example, the program which I was reading used
preprocessor macros ineffectively. It defined a report's columns
as a set of constants, but did not use arithmetic expressions.
That detracted from its maintainability: if one wished to change
the width of a column, that would require changing all of the
#define
's for the columns after the altered column.
On reflection, that 3% comment:code ratio told me enough
about that program.
There are other simple measures which help to gauge code quality. In a different analysis, I was interested in how much of a program was simply pasted in multiple places rather than by constructing suitable functions. The motivation was because I was working to undo this (calling it dump-truck code) for a program which was in two parts that should have shared data. I analyzed this by stripping comments and extra whitespace, sorting the lines and measuring the number of duplicate lines. In my project I had reduced the duplication from about 30% to less than 20%. On the other hand, another program in the project (not mine) had 46% duplication.
The newer version (starting at the end of 1985) evolved over several years, as I found new issues to deal with The change-log by the way shows the first check-in for March 1986. That was using the SCCS wrappers which I wrote to support the project I was working on.
Part of that project was developing and maintaining a Unix kernel driver for a networking card. The person who had started the driver had written macros with strings that lacked the terminating quote. I added an option to make c_count deal with that, rather than always accept the odd syntax. (This feature does not work with standard C).
I added some features based on suggestions by others. Most of those were in a later project (starting at the end of 1987):
I pointed out that renaming it ccount as he had suggested would not improve matters—it would only increase confusion. In discussion, I more or less agreed that renaming it c_count (which did not appear to be used) would be a suitable compromise.
-t
option to generate a spreadsheet
(csv) report.Shortly after, I added SLOC to my vocabulary, along with PSS (physical source lines) and LSS (logical source lines). We had some people doing metrics, and they had their own language. I encountered this while developing a.count. Like most metrics people, these did not write programs. Rather, they made models (such as an S-curve) and occasionally collected data to validate the models. I wrote a.count to satisfy my curiosity about the project that I was working on. They learned about the program, and after much discussion requested that I modify the report, changing
Not only that, but they requested that I do the same for
lincnt (as it was then called). I did that, but
made it optional (-j
, for "jargon"). Doing that made
my code-counters part of the establishment, so to speak, and they
referred to the programs in the papers they were writing.
I renamed the program a few years later (May, 1995), having left that project, and starting to publish the programs I had written on my own during the previous decade. This was around the time that the comp.sources.misc newsgroup died, as I see in my email:
From dickey Wed Jul 12 06:13:07 1995 Subject: recent postings To: comp-sources-misc@uunet.uu.net, sources-misc@uunet.uu.net, comp-sources-unix@uunet.uu.net (comp.sources.unix) Date: Wed, 12 Jul 1995 06:13:07 -0400 (EDT) Are you guys still there? I sent a copy of diffstat 1.7 comp.sources.misc (may 21, 1995) c_count 7.0 comp.sources.misc (may 21, 1995) and corrected up with a message to comp.sources.unix indicating that diffstat should be in _that_ group. Aside from the auto-reply from comp.sources.unix, I've seen no response. -- Thomas E. Dickey dickey@clark.net
While diffstat showed up in the index for comp.sources.unix (volume 28, ending May 23, 1995), that was the 42nd of 58 entries in this volume, c_count did not show up in either. For what it's worth, here is a list of successful postings for programs that I worked on during that era:
At the same time, I put a copy on Sunsite.
A nonobvious aspect of counting C source is what to do about inline comments. For example, in this chunk:
/* set up a buffer for this file */
bp = getfile2bp(param, FALSE, TRUE);
if (bp) {
bp->b_flag |= BFARGS; /* treat this as an argument */
make_current(bp); /* pull it to the front */
if (!havebp) {
havebp = bp;
havename = param;
}
}
there are two inline comments. Some counters ignore them, some do not. c_count does both. In showing the sum of line-types to 100%, it counts inline comments as a negative value, since those lines are already counted as code:
10 5 |/tmp/foo.c ---------------- 10 5 total lines/statements 3 lines had comments 30.0 % 2 comments are inline -20.0 % 0 lines were blank 0.0 % 0 lines for preprocessor 0.0 % 9 lines containing code 90.0 % 10 total lines 100.0 % 60 comment-chars 22.1 % 12 nontext-comment-chars 4.4 % 86 whitespace-chars 31.6 % 0 preprocessor-chars 0.0 % 114 statement-chars 41.9 % 272 total characters 100.0 % 18 tokens, average length 4.83 0.53 ratio of comment:code 2 top-level blocks/statements 3 maximum blocklevel 1.89 ratio of blocklevel:code
Doing it this way accounts for all of the categories. Incidentally, the format is chosen so that roundoff is accounted for. The numbers are supposed to add up exactly. When I was developing this around 1990, I used both Sun and Apollo workstations. The latter required adjustment, since it rounded differently from Sun. It turns out that rounding problems are far less common with standard C.
There are other interesting measures that I could add to c_count. Or I could develop a different tool.
In later metrics work, I have developed different tools. For example, in 2005 I developed two different tools, but (as in c_count and a.count) kept the same general reporting style:
Both of those dealt with a dozen or so file-types.
The latter was based on the syntax-highlighters which I have developed for vile (vi-like-emacs). Generalizing from code-counting, this tool also made the same measurements for a few data file-types such as HTML and XML.
One drawback to the way in which I developed it was that I
could not reuse syntax highlighters easily enough. If I were to
revisit this tool, I would use vile directly by
parsing the colorized output from the syntax highlighters. I
added an option (-F
) to vile at the
end of 2009 which makes
this simple.
By the way, I have noticed sloccount of course, but have no use for it:
For instance, this reported on things that I maintain. It shows a lot of differences. It also overlooked things like the M4-macros for autoconf (but likely found the M4 sources which made up 3719 lines of ncurses' Ada95 binding). In the comparison below, I have omitted the autoconf-generated "configure" script and the utilities "config.guess" and "config.sub". Also (since sloccount ignored those files), I have omitted counts for the various ".in" templates.
Note: c_count of
course counts only C programs, and directly shows LSS (logical
source statements). I wrote a script called
lex-metrics which uses the -F
option of vile to compute SLOCs for the various files.
Program | Actual lines | SlocCount lines | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ada | ansic | awk | cpp | lex | perl | sed | sh | yacc | ada | ansic | awk | cpp | lex | perl | sed | sh | yacc | |
cproto-4.6 | 7600 | 766 | 279 | 761 | 7600 | 985 | 279 | 761 | ||||||||||
dialog-0.9a-20001217 | 5419 | 350 | 2316 | 5321 | 350 | 3145 | ||||||||||||
diffstat-1.27 | 615 | 170 | 616 | |||||||||||||||
lynx2-8-4 | 118534 | 583 | 107 | 231 | 116438 | 583 | 206 | |||||||||||
ncurses-5.2 | 11354 | 47489 | 606 | 3727 | 124 | 137 | 1583 | 12937 | 48144 | 552 | 3726 | 126 | 136 | 2323 |
The license information shown in the report also is misleading (unsurprising given the source). The MIT-X11 license is listed as "distributable".
The filename conflict is like the other problems noted. It is customary when designing a program to avoid conflict with existing programs. c_count had been on the main Linux ftp server (sunsite.unc.edu) for six years before sloccount was released in 2001.
See the changelog for details:
There are other metrics programs, of course.
c_count is used as an example in this guide originally by Interix now Microsoft.
c_count was one of twelve metrics
programs mentioned (to illustrate that none did exactly what
the authors wanted for analyzing cpp
).
Incidentally, cproto would
have been more suitable for their use.
Esteves used c_count to compute SLOCs for a set of programs.
Steves used c_count as a tool in presenting his conclusions.
c_count was one of four tools chosen to compute product metrics.
c_count was used for measure the size of the program.
This includes programs named ccount. The one that I referred to from 1988 is not freely available, so I will not cite it here.
This is a different ccount than the one used in the Debian package (which does not mention Hartman). Hartman is mentioned in a mailing list discussion (July 2001).