Bioinfoxy: Programming Style

Programming style or "coding conventions" as it is sometimes called is about how you code so that the code is easily readable and understandable by a human. Especially the part of making code readable that doesn't interest your compiler in the least: naming and formating. Your compiler doesn't need whitespace and descriptive variable names to execute code correctly. But you do.

Basics

The most important element of programming style is consistency: do things the same everywhere, so that the same things do look the same and different things look different. If the same things look different, it becomes impossible to guess what something is or does just from the looks of it. Also, the code will look more complicated than it actually is.

Style always is a matter of personal taste. In the following, I lay out the style I have found most usable, with explanations what the advantages of this style are. Generally, there are three things to shoot for when it comes to style:

style that matches code logic. First and foremost, your style should match the logic of the code. Style is not about aesthetics, although good style tends to be aesthetically pleasing. You can draw great ASCII art within your code, but as long as it's not helping to show the code logic, it's useless.
style that is easy to maintain. If your style looks good but takes a lot of work and extra typing to keep it up, sooner or later you won't keep it up, and it's gonna be useless. Also, if making insertions, deletions or changes in your code leads to a lot of extra typing to keep the style intact, you won't do it. Or if you do, you are wasting your time. Finally, if you need to remember a lot of rules because the style is convoluted, you will forget some of them.
style that is easy to read. This is a less important point but still, if you can't easily read the code, the other two won't do you much good.

To be able to be consistent and achieve these goals, you have to decide about your style up front. This also has the added benefit that you won't be wasting time and brain power on how to format an expression later on, brain power and time you could much better use to think about your program or go out with your girl.

Being consistent also means that when you edit someone else's code, you should adopt his style. Don't waste your time on reformating someone else's source code. Also for some communities there are traditional styles to adhere to which you should respect. For example, there are documented coding conventions for the Java language, or suggestions for friendly Perl style.

Good tools, like always, can make your life a lot easier when it comes to style, too. For example, emacs has modes that handle the formating and indenting of your code for many languages automatically. You can even choose between various styles. There are other tools, called beautifiers who are specialized in doing just that, but of course it's better to have it built right into your editor. Also, a good editor should have syntax highlighting. It helps to give you a fast idea about which parts of a program are literals, which are comments, which are keywords and so on, and take away from the burden to express all that just by formating. It will help to catch typos, as a 'retrun' doesn't light up like a 'return', too.

In my opinion, the most important elements of style are proper formatting and good names. Less important are comments. So lets get to the details:

Formatting

Most of your lines should be shorter than 80 characters. Ancient computer equipment doesn't handle more than 80 characters per line very well, and long lines are also hard to read. If 80 characters are not enough, most of the time you have too much stuff on one line or too many nesting levels with indentation. You should improve your code by putting part of it into a separate function.

Use lowercase letters or a natural mix of lower and Upper case. DO NOT SHOUT IN ALL CAPS. Shouting clutters up your screen and makes the text harder to read. In the days of syntax highlighting, there is no need to capitalize language keywords anymore.

Declaration

Declare any variables right on top of the scope for which they'll be used, so you know what is used there. Do not declare them on a higher scope than is needed, so everything that belongs together sticks together. This makes the code easier to understand and easier to maintain, as it is hard to insert some code later on that accidentally changes the value of the variable before it is used.

In practice this means declare class variables at the top of your class, local variables at the top of your function, and temporary help variables at the top of their loop. If you use comments, put them on your declarations, they are much more helpful here than comments on control constructs. Initialize variables for classes where the class is initialized, for local functions at the top, if possible during declaration. Put only one statement or declaration per line. It encourages commenting for declarations, and makes complicated statements look complicated, while simple ones look simple.

White space

White space can be used to show parallelism and grouping and to ease up dense statements. If in doubt, apply liberally. I put a single space

after keywords like while, if, for. But not after function names. This helps to clearly distinguish them.
after commas. This helps to tell parameters apart.

I put an empty line

before comments. Also not so needed with syntax highlighting
between groups of statements. This makes the statements in a group appear to belong closer together than to statements from another group, for example groups of declarations

I put two empty lines to separate functions. By doing so instead of using a line of fancy comment (like /*********/), you still get the optical separation between fucntions, without the typing work. Also, you then can use a fancy comment to group functions within a file, for example parsing functions, printing functions etc.

Code legibility is increased by indenting parallelism. Be carful not to align stuff just beacuse it looks neat. Alignment should only be used for assignments that have functional coupling. Compare

         my $filename =    $args{PATHNAME};
         my @names    = @{ $args{FIELDNAMES} };
         my $tab      =    $args{SEPARATOR};
with
         my $filename = $args{PATHNAME};
         my @names = @{$args{FIELDNAMES}};
         my $tab = $args{SEPARATOR};

[Example by Tom Christiansen]

Indentation

Indentation is used to show two things: first, indented code can mean the line continues the statement of the last line, whereas nonindented code would suggest the start of a new statement. Second, indented code can mean that it is at a deeper nested scope, for example the contents of a while loop. There are common indentation styles, among them:

1. K&R/One True Brace
while (condition) {
    dependent code;
}
2. BSD/Allman
while (condition)
{
    dependent code;
}

3. "Trueblock"
while (condition)
    {
    dependent code;
    }

4. GNU
while (condition)
  {
    dependent code;
  }

I am a proponent of style one (which has been called K&R style after Kernighan and Ritchie who use it in their classic book about C. It also has been called the One True Brace Style, read all about it in the Jargon file). It has the advantages that it doesn't "waste" vertical space by putting the opening brace on a separate line and it has a cool name. If you're coding in Java, it also is compliant to the Java coding convention.

This style can lead to a problem, when you have a long conditional statement that you have to line break. The dependent code would use the same indentation as the conditional in this case, which would make it hard to see where the conditional ends. You can avoid this by (a) not indenting the second line of the conditional, (b) indenting it twice instead of once. Both solutions are kludegy. Or just use the next style for these cases.

Style two, called Allman or BSD style, avoids this problem (see below), since the blank line introduced by the opening brace sets conditional and dependent code apart. It keeps the nice feature of aligning the closing brace with the statement, as in K&R style. These two are the most often used styles.

Style three is suggested by Steve McConnell in his otherwise excellent book. In his opinion, since the whole block is dependent on the condition, the whole block should be indented. I've never seen that used. Style four is a strange mix between this and BSD-style, and is used as the official style of the GNU project.

The notion that saving vertical space is good is based on the assumption that you can better understand the code by having more of it on your screen at the same time. This is wrong. You'd get the maximum amount of code on one screen by doing away with indentation and whitespace altogether. Saving that one line will just make your screen look more crowded. It will make it actually harder to read the code and get an overview. I realized this when I was wondering why the code of a colleague looked so much more clear and orderly than mine, even when he was sloppy about formating. He was using Allman style and I wasn't.

$my_list = (item one,
            item two,
            item three
           );

The form above, endline indentation, is practically only used for declarations. I don't use it, as indentation depends on the length of the name and thus makes each declaration look different. Also, for long names, it's pushing everything to the right border, and its harder to maintain. I either fill the first line, break, indent once and fill the second line, or if there are a lot of arguments, direktly break and put one per line, indented:

boolean insert_animal_record (
    Animal type,
    String name,
    integer age,
    boolean furry)
{
  ...

The decision how far to indent is a tradeoff between lost horizontal space and clear separation of nesting levels. With large indents you wander of the right side of the page fast, with low indents it becomes more difficult to visually keep different levels separate. I feel that a 4-column indent is enough to keep levels apart and also saves you typing if you are typing spaces instead of tabs. Never use tabs, because when someone else opens your code with an editor that has another tab-width set, nothing looks like it used to any more.

Expressions

Use the natural form for expressions and function calls, like you would do in normal speech. Good code reads almost like english text. Negations are hard to understand in conditionals: go for the positive form. Use extra parantheses to resolve ambiguity, and break up complex expessions into simpler, smaller ones - better yet, put them in a small subroutine. Clarity is much more valuable than cleverness. Be careful with side effects. Use else-ifs for multiway decisions, keeping the flow of tested conditions easy and indentation low, instead of nested if statements. Use character constants not numbers. To test for ranges, even better use library functions. [tpop]

Nested expressions with parentheses are difficult to indent so that you can read them easily. Here is my method: put operators leading on new lines, indented so they are below the opening parenthesis. This way, all statements on the same level are on the same indentation depth. Don't do so for the leaf lists if it's not necessary. Don't put closing parentheses on new lines. If the expression is within a conditional (which often is the case) you may want to put the opening brace of the conditional on a separate line, to get an optical setoff from the indented expression - even when you use K&R style.

if ((a and b)
   or (c
      and (d or e)
      and (f or g)))
{
    make_it_so();
}

When an expression is to long to fit on one line, there is the recommendation to break it so it is clear it does continue on the next line - after a comma or operator. Unfortunateley, this is inconsistent with the previous method, and will not work well to show nesting. So better do it like this:

if (horkingLongStatementWhichEvalsToBooleanMaybeWithLotsaParenthesesToo
   || (theNextHorkingLongStatementWhichEvalsToBoolean
      && soOnWithAnotherHorkingLongStatementWhichEvalsToBoolean))
{
    doThis();
}

The preferred form of the ?: operator is similar:

variable = (condition which usually is long and has some comparison operator)
  ? it_was_true();
  : no_it_wasnt();

Naming

Good source code should be self-documenting. The easiest way to achieve this is well chosen names.
Always give names to strings and to magic numbers (anything other than a 0 or 1).[tpop] By doing so, you can change all instances of the number in one place, and conditionals using symbolic names are easier to read. Compare:

/* Check if the record is valid */
if (record == 0)
if (record == IS_VALID)

Variables

In your name, you can try to convey three different things:

what the variable stands for: this is the base name.
the scope of the variable: is it global, packagewide, or local?
the type of the variable: is it a parameter, a constant, a user-defined type or a native type?

For the base name, use descriptive names for globals, shorter names for locals. Be consistent: use the same way to name things everywhere. Be accurate: don't use misleading names. Names should always be application-specific, don't describe the coding construct. Use @fruit instead of @list. Chose names so that they can be easily read like english text.

Base names should have a length between about 8-20 characters. Shorter names tend to be not very descriptive, longer names eat up too much space and are tiring to read. Since a name should describe what it stands for, a complicated subroutine that tries to do too many things will call for a very long name. This is a hint that you should redesign the subroutine, not cheat on the name. Sometimes long names cannot be avoided. In those cases there are several techniques to shorten them:

dropping nonleading vocals and double consonants (customer_application_form => cstmr_aplctn_frm). These tend to be hard to pronounce, which sucks, when you have to talk about them with someone else.
cutting off after a certain number of characters (customer_application_form => cust_appl_form). This is better because it is pronouncable.

Just keep in mind its never worth to shorten a name when you save only one character by doing so.

The methods you can use to modify the base name are different capitalization (if your language allows it), and pre- and postfixes. For capitalisation, there is often a tradition in naming. For example Java uses internalCaps, Perl prefers internal_underscore. Underscore makes reading a wee bit easier, but takes one character more. Use whatever is commonly used in the given language. Here are the conventions I use as defaults:

ALL_CAPS for constants.
Firstcaps for type names or name_t for type names if the conventions of the language mandate it. I do not use hungarian notation to indicate language types.
innerCaps or no_caps for local object names or variables.
varNameG or var_name_g for global variables.
p_name_p or pName for parameters, although I do not indicate function parameters.
namep for Boolean predicates (taken from lisp)

Postfixes are better than prefixes, because they let you see the important thing first and can be autoexpanded. Common uses for postfixes are qualifiers, for example userCnt, userIdx. It also helps to always use the same pairs. I use:

idx, cnt for index and count. (Don't use num or number as a qualifier, since it is never clear if it referes to the overall count or the current index.)
beg, end for beginning and end. As array slice indices, beg points to the first element, end after the last, so you can use the canonical while (array_idx < array[end])

Subroutines

Name subroutines so that their calls read well in English. Use action names. `Procedure names should reflect what they do; function names should reflect what they return.' --Rob Pike. Don't use abstract words like process_stuff, transform_input or handle_event if possible, because you can't tell from such a name what the subroutine does. Again, if there is no good name, probably it's a bad subroutine.

Functions (subroutines that return a value) are named after what they return. Example: getCanonicalVersion();.
Predicate functions (subroutines that return a boolean value) are named with `is', sometimes with `does', `can', or `has'. Example: is_canonical() is better than canonical() for the same function, because it reads well in a conditional: if (song.is_canonical())
Procedures (subroutines that don't return a value) are named after what they do. Example: canonize(Song song); A classic here is setName(String name);

The notation this2that() is well established for conversion functions or hash mappings. Hashes usually express some property of the keys, and are used with the English word `of' or the possessive form. Name hashes for their values, not their keys. eg. $color{"apple"}. [Tom Christiansen, works not as good for object notation color.get("apple");]

Comments

Don't belabour the obvious, uselessly duplicating the source code on an atomic level. Comment entire blocks, not single lines. Comments as a brief summary should give insight in what happens, the intent of the code. Eschew gaudy block banners, they just take a lot of maintenance work and are not needed any more with syntax highlighting. Use comments for classes, functions, global variables and constants. `Comments on data are usually much more helpful than on algorithms.' (Rob Pike)

Try to avoid commenting code where something tricky happens. Rather repair the code so it is not tricky any more. `Basically, avoid comments. If your code needs a comment to be understood, it would be better to rewrite it so it's easier to understand.' (Rob Pike) Don't comment bad code, rewrite it. [tpop]

A valid comment is an overall introductory paragraph that explains what happens, and interface documentation for your library routines and packages, so the user can use them without looking at the source code. There are tools for doing this user or API documentation, for Perl programs, you can provide a manpage in POD (look at the Perl documentation for examples), for Java an API with javadoc.

Valid input formats and output. Functions which parse messy input should at least contain some example of it.
Environment (What does it need to run)

Normalization

This subject is borderline for a styleguide. It tells you that you never should do cut-and-paste programming. Rather, if you need the same functionality at two places in your code, put it into a small subroutine and call that. If you have two very similar routines, merge them into a parametrized one. Creating small, simple subs is easier, more foolproof and makes you feel like you code lisp. Well, not quite. In any case it improves your code. It makes the code shorter, easier to maintain as you only have to change one place, easier to expand as you can call all the little general routnes, and somehow easier to read. Using well-named functions makes their calls read nearly like english text. The disadvantage is that when you really want to know what the program does, you'll have to scroll a lot chasing down all these little subroutine rascals.

You also can put complicated boolean tests into subroutines, even if they are called only from one place so one could put them inline without losing normalisation, just to make the code in the calling routine easier to read. It has a similar effect as writing a comment to explain what the condition does, using the subroutine name as the comment.

SHS SQL style

Use internal_underscore. 4-column indent, BSD-style.

_cur  Cursor variable
_row  Row variable
_rec  Record variable
trg_  Trigger name
pck_  Package name
i,j   Simple loop counters

these are helpful to guard vs clashes in SQL statements, but suck
because you always have to type i_ or something like that in addition
to the name so they eat up space and clog up your screen

i_    in     parameter name
o_    out    parameter name
io_   in/out parameter name
v_    local variable name
t_    type name
c_    constant name (usually global and type defined)
(g_   global variable name)


In triggers prefix with subset of
b     before
a     after
i     insert
u     update
d     delete
s     statement

Method names
set_   set, overwriting any existing state
fill_  set, if empty
is_    boolean test

In declarations simple ones first.

Function Header Template (Language is German):

--Einspielen unter schemaname
create or replace package pck_ as
/** Übersichtssatz - Frank ist ein fauler Hund.

Beschreibung.

<img src="pckname_img.gif">

Schema: schemaname

$Logfile: $
$History: $

%version $Revision: 51 $ $Date: $
%author  Frank Schacherer, SHS
*/

References

Code Complete.
The Practice of Programming.

Bioinfoxy

Pages

2012-09-04

Programming Style