Bioinfoxy: Coding

The art of Computer Programming as seen by other people with experience.

Start with something that's simple and useful, and that gets real work done now. Learn from the experience of doing that work. Enhance if/when needed. Don't worry about defining the ultimate cosmic architecture. The roadway is littered with the corpses of ultimate cosmic architectures.
--Jon Udell

General

Be predictable.	Data make code.	Normalize data & code.
Write a spec, then implement.	Separate frontend from library.	Make libraries general.

Some of the important fundamentals for code in my opinion.
Principle of Least Astonishment. This principle states that: A system and its commands should behave the way most people would predict, that is, the system should operate with "least astonishment."[Mike Sperber]
The main problem you face when programming bigger systems ist handling complexitiy. Most software is far too complicated for a human mind to understand whole. So one need tricks to handle this complexity. One needs to get help for what the brain cannot handle.[ooa/d]
Using Diagrams as thought and design aids helps thinking and designing tremendously, as man is an optical animal. But unfortunately software is "invisible", i.e. impossible draw, even in 3 dimensions, as a whole. There are just too many layers and views. [mmm]
Abstraction is what allows humans to handle the huge complexity of the world and life. You ignore all details about something and just concentrate on the few things that are important. (Of course the trick is in determining what is important.) Abstraction allows you to ignore the complicated inner workings of a piece of code and just use its' simple interface for interacting with other code. [wb, mmm]
It is often simpler to solve the general problem instead of the specific one. Like, solving some algorithm for n cases, then apply it to 73, then solving it for 73 cases. The reason is that to solve it in general on has to understand what is immutable about the problem, and what are just parameters that can change. [pps]
Always try to find out what is general about your problem from what is special. Then decompose your problem in a library of general tasks (backend) and the special case logic needed for glueing these together and interfacing to the user (frontend). This way you get a more elegant, clean system and can reuse the general components for something else. Eg. the skript for making the external version of this homepage has to walk the tree of files and dirs, removing unwanted files and links to them. Also, you have to make a copy of the old homepage first, so you will not lose the full version. You could do all that in one big skript, that could be used for nothing else, then. Decomposing it you get:

a component to make a copy of an file tree. (this exists - use a system xcopy).
a component that walks a tree of files and dirs, invoking a function on each file - or better each file with a name fitting a regex, where .* would do it on all files. (this exists - use perl File::depthfind).
a function that finds tags matching a regex, replacing the found tags, here with nothing.
a function that deletes a file matching the regex.
a config file, which can contain the list of regexen, so you can define them once then reuse it. you can use a perl source here and load it with do or require;
an interface skript, that takes as parameters the source and target dirs and the list of regexen and maybe something like a verboseness flag, or a non-default name for the kill list.

When working with perl or shells, most elemental operations already exist to be glued together, like the xcopy or File::Find in our example. For number two and three, you could write a small library that maybe later can be expanded with other useful functions for batch processing of tree files. Each time there is some amount of glue code that has to be written and will not be reusable.
Data structures are more important than code.

        "The programmer at wit's end for lack of space can often do best
        by disentangeling himself from his code, rearing back, and
        contemplating his data. Representation is the essence
        of programming." [mmm]

Rob Pike says: `Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming. {This is SO true} (See Brooks p. 102.: "Show me your flowcharts and conceal your tables, and I'll continue to be mystified. Show me your tables, and I won't usually need your flowcharts; they'll be obvious.")' Capture regularity with data, irregularity with code. (Kernighan) If you see similar functionality in two places, unify it. That is called a `subroutine'. Consider making a hash of function pointers to represent a state table or switch statement.[Tom Christiansen]

Schedule

How does a program get to be a year late? ... One day at a time.
So use a chart for planning your milestones. Even better use inchpebbles, as they are more directly achievable, shooting for a daily goal. And use exact milestones, no 90% done, nearly debugged etc - programs are these half of their development time! Never miss two deadlines in a row, it kills morale. [mmm]

Documentation

Communicate in many ways: informally, formally in meetings, via a shared workbook or via email. A project workbook, documenting the work done should be built as WWW pages. All documents produced should be part of this workbook. Only interfaces should be exposed, you see only your own internal implementation.[mmm]
The critical documents for a software project are: objectives, user manual, internal documentation, schedule, budget, organization chart. Writing these also focusses thought by forcing hundreds of mini-decisions. [mmm]
Source code documentation is the best way as It's a folly to try and maintain independent files in synchronism. This is true for every kind of code and data. If you have it redundantly in two places, you will change it in one of them, and forget or be too lazy to change it in the other. Then the system becomes messy and unstable. Keep it in one place! Good source code documentation, preferably literal programming makes it easy to keep docu up to source-date and forces you to think about what you are doing!

Style

The source code should be self-documenting. Several points can be used for this, and the way they are used makes the style of a programmer:

well-chosen names

Use descriptive names for globals, short names for locals. Be consistent (use the same way to name things everywhere) and accurate (don't use misleading names). Name functions so that their calls read well in English, and use action names.
Use ALL CAPS for constants, Firstcaps for class names or global variables, firstLower for object names or local variables. Often, for short local variables lowall is even preferable.
Functions are named after what they return, procedures after what they do. For example, predicate functions should usually be named with `is', `does', `can', or `has'. Thus, isReady() is better than ready() for the same function. Therefore, canonize() as a void function (procedure), getCanonicalVersion as a value-returning function, and isCanonical for a boolean check.

boolean isActive() => if (cat.isActive()) { ...
String getName() => default to get obj value
void setName(String name) => default to set obj value

The abc2xyz() is well established for conversion functions or hash mappings. Hashes usually express some property of the keys, and are used with the English word `of' or the possessive form. Name hashes for their values, not their keys. eg. $color{"apple"}. [Tom Christiansen] Works not as good for the Java Collection classes color.get("apple"); `Procedure names should reflect what they do; function names should reflect what they return.' --Rob Pike. Give names to magic numbers (anything other than a 0 or 1). [tpop]

expressions

Use the natural form for expressions, like you would do in normal speech. Negations are hard to understand in conditionals: go for the positive form. Use extra parantheses to resolve ambiguity, and break up complex expessions into simpler, smaller ones. Clarity is much more valuable than cleverness here. Be careful with side effects. Use else-ifs for multiway decisions, keeping the flow of tested conditions easy and indentation low, instead of nested if statements.
Use character constants not (encoding-dependent) numbers. To test for ranges, even better use library functions. [tpop]

declaration, initialisation

Declare any variables right on top of the scope for which they'll be used, so you know what is used there. Do not declare them on a higher scope than is needed, so everything that belongs together sticks together.
In practice this means declare class/global variables at the top of your class, local variables at the top of your function, and temporary help variables at the top of their loop.
Initialize variables for classes where the class is initialized, for local functions at the top, if possible during declaration.

indentation

Use always the same source code formatting and naming scheme to save lots of time you otherwise would spend thinking about these. Code legibility is dramatically increased by consistency and indenting parallelism.
Compare

         my $filename =    $args{PATHNAME};
         my @names    = @{ $args{FIELDNAMES} };
         my $tab      =    $args{SEPARATOR};

with

         my $filename = $args{PATHNAME};
         my @names = @{$args{FIELDNAMES}};
         my $tab = $args{SEPARATOR};

[Tom Christiansen]
Indentation is used to show two things: first, indented code can mean the line continues the statement of the last line, whereas nonindent code would suggest the start of a new statement. Second, indent code can mean it is at a deeper nested scope, for example the contents of a while loop.
This can lead to a problem, when you have a long statement, and deeper nested dependent code, which would use the same indentation. You can avoid this by (a) not indenting the second line of the long statement, (b) indenting it twice instead of once.
I use standard Java layout. Using the generally accepted idiom makes it easier for you to read others stuff, too. I indent for easy readability as described there or for a single tab. The only exceptions are long boolean concatenations in conditionals clauses. There I use:

if (horkingLongStatementWhichEvalsToBooleanMaybeWithLotsaParenthesesToo
|| theNextHorkingLongStatementWhichEvalsToBoolean
&& soOnWithAnotherHorkingLongStatementWhichEvalsToBoolean) {
        doThis();
}

The preferred form of the ?: operator is

variable = (condition which usually is long and has some comparison operator)
                   ? it as true
                   : no it wasn't;

I also do not indent a tab in class bodies, as you the loose an autoamtic extra four characters on every line.

comments

Don't belabour the obvious, uselessly duplicating the source code on an atomic level. Comment entire blocks, not single lines. Comments as a brief summary should give insight in what happens (or if something tricky happens). Eschew gaudy block banners. Use them for classes, functions, global variables and constants. Don't comment bad code, rewrite it (!). [tpop]
`Comments on data are usually much more helpful than on algorithms.' (Rob Pike) `Basically, avoid comments. If your code needs a comment to be understood, it would be better to rewrite it so it's easier to understand.' (Rob Pike)
Especially important is the introductory paragraph that explains what happens, best done as a manpage, containing:

NAME and Purpose
SYNOPSIS, Syntax
DESCRIPTION (plain text)
Options
Valid input formats and output. Functions which parse messy input should at least contain some example of it.
Environment (What does it need to run)

In perl use the POD, in java, Javadoc.

Basic Algorithms and Datastructures

Searching

In ordered arrays: binary search O(log n). In trees and heaps: recursive search O(log n). In unordered arrays: linear search. O(n);

Sorting

Quicksort, Heapsort, Mergesort O(n log n). Simple sorts. O(n^2).

Growing Arrays (Vectors)

Linked Lists

Trees

Hash Tables

Use a versioning system like CVS to document version and change development. [mmm]

Good Programmers

35. Everyone can be taught to sculpt: Michelangelo would have had
  to be taught how not to. So it is with the great programmers.

-Alan Perlis
Sharp professional programmers are ten times as productive as normal ones. A small sharp team is best - as few minds as possible. Such teams created nearly all of the great software that exists (Unix, Linux, C, Perl, Java, etc). A team of two, with one leader is often the best use of minds. The really good programmers have strong spatial senses and usually geometric models of time. Often they are also talented with words and music. And you can use them to clean your furniture, too. [mmm]

Idioms

You can learn the idioms used by others from their source. It usually is the best way to learn. HTML (with View Source) and Perl (where you can look at the lib modules) are great for this reason.

System Design

The essence of design is to balance competing goals and constraints. Issues in design are Interfaces, Ressource Management, Error Handling
Conceptual integrity is the most important thing in system design. Ease of use for both simple and difficult problems is the designs ultimate test. Separation of architecture from implementation is a powerful way to get conceptual integrity. External provision of an architecture enhances, not cramps, creativity of implementation. Architecture and implementation can develop in parallel. [mmm]
Do the same thing the same everywhere.[tpop] This is very true. To be easy to understand and use (also for yourself), things should work just like you expect (or guess) them to do. By doing everything the same every time, if you know how it works once, you know it for all cases (see also: getting into a rut early). You know what to expect and don't have to look things up and it all seems simple. If you'd do it different every time, you can not expect anything, and the system is hard to understand, seems complicated and its easy to make errors. Consider for example how programs that do not use standard key binding (like ^X and ^V for cut and paste) on Windows suck, or how irritating it is if some command line programs had the params after the target file(s) and some in front, or if in some classes a method is called getURL() in others getUrl() and yet in others geturl().
Differentiate between frontend and backend. The front-end interfaces with the user and here you set all parameters. The backend consists of libraries. They need clean interfaces. They do not set any defaults other than elementals like zero, null or stdout and recieve everything else handed down from the front end (or a config file specified there). Imagine the library becomes part of some larger program whose specification changes over time. Program it so that it will work robust for any input.
The best way to develop a system is to grow it, like a plant. Start out with a small, simple core, get it working. Then, slowly bud part after part, growing it more complex over time. All the time you have a working system.
Even better than developing it yourself: buy prefabricated shrink-wrap components someone else developed. This is the major thrust of object orientation, with its API's, frameworks and office packages. So the vendor put in thousands of hours of development, and you can use the results just like -snap-. High level languages isolated programmers from the "accidental" complexities and problems of machine language and thus enhanced productivity manyfold. They also killed a host of opportunities to produce bugs. This is abstraction. Moving the language farther from the drudgery of the machine and nearer to the ideas a programmer wants to express, the things he wants to model. Components do the same on the next level: moving from atomic high level language expressions up to the objects the programmer thinks about, be it URLs, spreadsheets or parsers. [mmm]
The most important thing is to look at the data you have to work with and how it is structured. This should make you chosse your data structures and interfaces, and if these are chosen fittingly, expansion will be easily accomodated.

A library for others

You will be one of those others if you want to reuse that code in a month or even a week from now.
Interfaces. What services and access are provided? Hide implementation details. (Aka "encapsulation", "abstraction", "information hiding", "modularization"). Avoid global variables, wherever possible pass data as function arguments. Avoid publicly visible data, exept for class defined constants used to parametrize the classes behavior by using them as arguments in function calls. Chose a small, orthogonal set of primitives. Resist to provide multiple ways of doing the same thing - ist harder to maintain and learn. Prefer narrow interfaces, which do one thing, but do it well. Don't fix the interface when the implementation is broken. Dont reach behind the user's back. Libraries should not write secret files or change global data, nor modify data in its caller. Make interfaces self-containd (or at least be explicit wich external services are needed, to avoid placing the maintence burden on the client). Try to return as much useful information as convenient from function calls, in a form that is easy for the caller to use.
Interfaces are used to hide implementation detail. Use a specification to make clear what the object implementing the interface does (assumptions, input/output). The best approach is to write the spec early and revise it as you learn from the implementation.
Ressource Management. Who manages limited ressources like files, memory, how are shared copies of information handled.
Free a ressource on the same layer that allocated it. If you have a routine working on files, either open and close them in there or take a handle to an open file, and do not close it in there. Errors and misunderstandings about shared responibilities are a frequent source of bugs.[tpop]
Initialization of objects and fields is also a ressource question. Data that is to be shared by function calls to an object should be stored in a class variable and preferrably be assigned upon initialization. (To further differentiate in Java this should be done in direct assignment for default values instead of a in constructor, which should be reserved to set values given as parameters.) Data which differs for each function invoctaion should be stored local to that function.
Error Handling. Who detects errors, who reports them and how?
Catch errors as early as possible, handle them as high up in the hierarchy as possible. Use exceptions only for exceptional situations. Library routines should not exit the system when hitting an error, pop up dialog boxes or print messages. They should return error statements to the caller. Either through "funny" return values or better through exceptions (which are like funny values with a lot of context information attached which bubble up the call stack automatically). Try to keep the library usable after an error has occured.[tpop]
Try to find a module to print/log errors or design a simple one yourself. It unifies error behaviour and its existence encourages you to use it. You could even try to store all error messages centrally. Logging has the advantage to work in environs, where no standard output will be seen. Provide as much context as possible in error messages. A simple estrdup failed gives the user no clue, who said that and why it was caused, compared to

Markov: estrdup ("Longtext") failed: Memory
  limit reached

. Also give an example of valid input, or size limit.[tpop]

Debugging, Testing and Profiling

Debugging

Examine the most recent change. Debug it now, not later. Get a stack trace. Read before typing. Explain your code to someone else. Single step. For hard bugs: make the bug reproducible. Divide and conquer (binary search prune inputs or put print statements). Display output to localize the search. Write self checking code. Write a log file (and flush buffer, so abnormal termination will not lose the last statements) Draw a picture (annotate your data structures with statisics and plot the result). Use tools like diff, grep to examine lage outputs. If debugging takes longer, keep record of what you tried (it will hel you to remember the next time something similar comes up).[tpop]

Testing

Test code mentally as you write it, thus you will not make some mistakes in the first place.

Test code at its boundaries. This will catch fencepost errors. For loops or conditions try what they will do with empty input, one item only, two items or the maximum number of items. What happens if there are too many? Can it happen?	If code fails it ususally does at the boundaries, if it works at the boundaries it will work everywhere else, too.
Test pre- and post-conditions. Use assertions. Program defensively. An assertion is simply a test for a pre-condition which will abort the program if it fails. Defensive programming is simply code to handle "can't happen" cases, that is cases outside of the expected value range. Typical examples are null pointers, out-of-range array subscripts, divisions by zero.	Defensive programming can protect the piece of code against incorrect use or incorrect data. These might happen as a consequence of an error somewhere else. So do asertions, but with defensive programming, the system will not exit, which only makes sense if there is some sensible way to recover fro the error. Probably the value returned from a defensive clause should indicate the erronous state or throw an exeption to be handled further up, so the client code can make sense of what happened.
Test incementally. Write a small unit of code that does something well defined and simple. Then test it. Then add more. Test that. Etc.	It is faster to do it this way, than with on big testing bang in the end. Doing it incrementally, at each step you can be pretty sure that the library functions already implemented work like they shoud, and if you get an error, it probably is in the few lines of new code, easy to find. If you test only at the end and get mistakes they are much harder to find.
Test simple parts first.	Thus, you can buid confidence in the elemental parts and then go on to more complicated features.
Check error returns.	Exceptions force you to do this.

Also, there should be some test cases for each program, ideally a simple one, a complicated ones at the edge of allowed input (empty or maximum values etc), some barely illegal ones. since it is boring and time consuming to test by hand, you will end up not doing it. Therefore, you have to automate testing to run a set of test cases by pushing a single button, and reporting on mistakes.

Know what output to expect. Verify conservation properties. If the output has an inverse, see if you can reconstruct the initial input. You can use tools like `cmp, diff, wc` to compare outputs.	This is sometimes difficult for complex programs, GUI apps etc.
Automated regression testing. The idea behind regression testing is that if you have a program you know to work, and extend or change it, the old tests should still work with the new version and produce the same, correct output. If they do, you again have a working version. Usually you hav a test harness, some kind of shell or perl skript, which runs your program for a large number of input files (test cases). These should run silently, producing output only if a mistake occurs.	If you fix some mistake, you tend to check if only if the fix worked. But if your fix introduced a new mistake, you will miss it. Regression testing should help you to prevent that.
Create a test scaffold.	This will alow you to simply incorporate new tests.

What NOT to do

If you have big switch-case or if-then-else statements, you're doing something wrong. If you use a lot of cut and paste, you are doing something wrong.Anti Patterns
11. If you have a procedure with 10 parameters, you probably missed some. Alan Perlis, Epigramms

Design Patterns

There are several techinical patterns one can use in programming:

Top Down Design (This means a divide and conquer approach: split the problem into smaller ones until they are so small you can easily handle them. This is mainly working with procedural approaches, and is essentialy the same as building up a complex program from less complex components - just seen the other way round. Or maybe it is a way to find out what the simple components should be.
Recursion. Mainly for handling recusive data structures like trees.
Divide & Conquer. This is again the technique of seperating a big problem into smaller ones and solve these.
Preselection
Backtracking

General Object Oriented Wisdom:

Program for an interface, not an implementation. Let your implementation classes extend abstract supercalsses or implement defined interfaces. Client code accesses the interface only. This way, you can change the implementation without changing client code.
Prefer object composition to inheritance. Since subilassing kind of breaks encapsulation (changing the superclass will change the subilass), it is better to use composition. This will make the classes more easily reuseable in another context. Delegation makes composition as mighty as subilassing. Instead of delegating request execution to the superclass, it is delegated to the component object.
Templates (Generic Programming). This is a technique to define a type without chosing the types it works on in advance. You then can parametrize it with the types it works on, eg. a collection which can be parametrized with the class type it holds. [dp]

Rut values

Standard random numbers:

17 (the 'smallest random number')
42 (Doug Adams said so)
105 (Octal for decimal 69, or decimal for Hexadecimal 69)

Standard variable Names. You shouldn't need more than three. Actually it is said about numbers in computer programs: Use none, one or unlimited numbers.

foo
bar
baz
blah
blub

Rules for catching exceptions

If an exception is thrown, it should be allowed to bubble up until you reach some part of the program which is on a high enough level, to competently do something about it. So for example if the exception is not fatal to the further execution of the code, it could be caught right away and a message written to the log. On the other hand, if the exception proves fatal for higher levels of the code, it has to be allowed to bubble up and caught there. Always remember that Exceptions are just used instead of returning funny values (which will not be passed up through the stack automatically).
When we want to shield the higher levels from proprietary Exceptions by some low level service, like the DBMS, we would have to use our own Exceptions to wrap these instead.
The same - in the opposite direction - holds true for default values from the configuration file. These should be read in only at the highest possible level, and then be passed on as parameters to other classes which do need them.

etc...

Separate the things that change from the things that stay the same. [tij]
If in trouble, make more objects. [ooa/d]
Read the technical specification for something if you want to know it (eg. XML, HRML etc). It's no use wasting time in second rate places, as soon as you have an overview.

References

(see my Bibliographies for more literature.)

Tag	Author	Title	Year	Publisher	ISBN
mmm	Jr. Frederic P. Brooks	The Mythical Man Month, Anniversary Edition	1995	AW	0-201-83595-9
This is the classical text about software project management with some annotations and the ``No Silver Bullet'' Essay added twenty years later. There seems to be an unspoken law stating that this must be cited in any other book about software projects or any computer book at all, with the following sentence: ``The programmer at wit's end for lack of space can often do best by disentangeling himself from his code, rearing back, and contemplating his data. Representation is the essence of programming.'' But there is much more practical wisdom in it, and it's half-entertaining to read, too.
pps	Jon Bentley	Programming pearls	1986	AW	0-201-10331-1
A real pearl. This is a pleasure to read! It teaches some basic concepts of coding, like back-of-the-envelope calculations, the use of data structures to get elegant code and some useful algorithmic tricks (heapsort, quicksort, binary search, hashing). It's a bit old already, but the essence of what he is saying is still true in the age of OO programming.
tij	Bruce Eckel	Thinking in Java	1998	PH	0-136-59723-8
It's the best book about Java. Period.
tpop	Brian W. Kernighan and Rob Pike	The Practice of Programming	1999	AW	0-201-61586-X
An excellent introduction to the various aspects of programming, from style and interface design to debugging, testing, porting and little languages. When you only buy one book about an overview on actual programming - buy this one. It's only about 250 pages long, and covers a lot of terrain, with much sound advice, further-leading suggested reading and a lot of example code. Of course you can only learn by doing, so there are exercises.
dp	Erich Gamma and Richard Helm and Ralph Johnson und John Vlissides	Design Patterns: Elements of reusable object-oriented software	1986	AW	0-201-63361-2
Cited in nearly every other book about object-oriented proramming, this is probably the book that will teach you more about the subject than any other book out there. Unfortunately its boring to read and a touch too academic for my taste. Still a must read.
ooa/d	Grady ooa/d	Object-Oriented Analysis and design	1994	AW	0-805-35340-2
This is a standard work for Object Oriented Analysis and Design (OOA/OOD), also touching the iterative development process. Too much hype, too many buzzwords, too much object-religious statements - the getting-a-better-prgrammer-per-page-of-text ratio is rather low.
cc	Steve McConnell	Code Complete: A Practical Handbook of Software Construction	1993	Microsoft Press	1-55615-484-4
This is the Elder and Big Brother of 'The practice of programming'.

Bioinfoxy

Pages

2012-09-04

Coding