Bioinfoxy: coding

Showing posts with label coding. Show all posts

2014-07-14

MySQL join syntax

http://dev.mysql.com/doc/refman/5.7/en/join.html
http://dev.mysql.com/doc/refman/5.7/en/nested-join-optimization.html

In the classical way to write a query
SELECT a.x, b.y
FROM a, a2b, b
WHERE a.a_no = a2b.a_no
AND a2b.b_no = b.b_no
AND a.z = 123

the WHERE clause mixes join conditions with selection conditions. Therefore, and once one gets to things like outer joins, it is better to be explicit, using the where condition to restrict the results by value, and the joins to restrict them by matches between tables.

As always, tables in joins can be actual tables, or subqueries returning tables, and we can define aliases for tables. I am ignoring partitions here.

A join merges two tables into one, using matching values between them, to restrict the cartesian product of all possible combos that can be made with rows from these tables to those rows that have the matching values. There are several kinds of joins

JOIN ... needs to have matching values in tables on both sides
LEFT JOIN ... needs values in left table, and tacks on values from the right, or null otherwise

You use left joins if you have a table that only has values for some records in your main table, and you want to see them if available, but you do not want to filter out rows from the main table, if they are absent.

One can optionally insert an OUTER between LEFT and JOIN. One can optionally call the simple, bidirectional join INNER JOIN or CROSS JOIN (in other dialects INNER requires an ON clause, here not), or use a comma between table names to imply it with the join condition in the where clause. Be careful when mixing table lists (i.e. implied inner joins) and left or explicit joins: because the explicit joins have precedence over the implied inner joins, the column you may want to join on in your outer join may not yet have been joined in, resulting in an error message. (See below).

There are several ways to describe on which fields from the left and right operands should be joined.

ON ... followed by an explicit statement which field from which table equals which field from which other (one actually can use anything one could use in a where clause in ON, but then one loses the advantage of cleanly splitting joins from selection criteria).

USING ... followed by a parenthesis enclosed list of fields which need to be present in both tables under the same name, and need to have matching content for a match.

NATURAL JOIN of two tables even does not need the field(s) in USING any more, as it will join on all fields of the same name between the tables.

The SELECT * output of USING and NATURAL are not quite identical to ON: the matching columns are listed only once, before the other columns (so called "coalesced" common columns), in order of appearance in the first table, then columns unique to the first table in order, then those unique to the second table in order. ON just lists all columns from all joined tables.

For example, assume you have the tables
t1(id,x), t2(id,y)then


SELECT * FROM t1 JOIN t2 ON t1.id=t2.id;


SELECT * FROM t1 JOIN t2 USING (id);


SELECT * FROM t1 NATURAL JOIN t2;



will all achieve the same effect.



If you have the three tables t1(a,b),
            t2(c,b), and t3(a,c) then the natural join SELECT * FROM t1 NATURAL JOIN t2 NATURAL JOIN t3; really means that after joining up t1 and t2 the resutling table will have the columns b, a, c. Again natural joining this with t3 will thus join on both a and c, not just on c, equivalent to this: SELECT ... FROM t1, t2, t3 WHERE t1.b = t2.b AND t2.c = t3.c AND t1.a = t3.a; This may not be what you intendet. So while NATURAL is nice for the lazy, it also is dangerous for its side effects. It is better to be explicit and use USING.



As long as all of them are inner joins, the order of joins is unimportant, and no parentheses are needed.

However, this is not true for outer joins.

Lets say we have

CREATE TABLE t1 (i1 INT, j1 INT);
CREATE TABLE t2 (i2 INT, j2 INT);
CREATE TABLE t3 (i3 INT, j3 INT);
INSERT INTO t1 VALUES(1,1);
INSERT INTO t2 VALUES(1,1);
INSERT INTO t3 VALUES(1,1);
SELECT * FROM t1, t2 JOIN t3 ON (t1.i1 = t3.i3);

will create an error Unknown column 't1.i1' in 'on clause', because the explicit join is evaluated first, and does not know anything about i1. SELECT * FROM t1, t2 JOIN t3 ON (t1.i1 = t3.i3);would fix that, as would SELECT * FROM t1 JOIN t2 JOIN t3 ON (t1.i1 = t3.i3); because there the joins are worked off left to right in order. That means you can list first all your joins (making a huge cartesian table in theory), then all the on conditions with AND. Likewise, you can only refer to tables that were mentioned before (to the left) in the ON clause, not to tables that are mentioned after (joined to the right of the clause).

Referring to joins

Joins generate one big table on which the where clause can work. You can supply aliases to the various tables contribution to a join.

Finally, multiple joins, in a mix of inner and outer, with the outer being nested via linking tables. Ie what you deal with in the real world:

select *
from finding f
join acq_source s using (acq_source_id)
join finding2treatment f2t using (finding_id)
join treatment t using (treatment_id)
join treatment2genotype t2y using (treatment_id)
join genotype y using (genotype_id)
join genotype2variant y2v using (genotype_id)
join gene g using (gene_id)
left join (treatment2disease t2d join disease d) on (t2d.treatment_id = t.treatment_id and t2d.disease_id = d.disease_id)
left join (treatment2drug t2u join drug u) on (t2u.treatment_id = t.treatment_id and t2u.drug_id = u.drug_id)
left join (reference2finding r2e join reference r) on (r2e.finding_id = f.finding_id and r2e.reference_id = r.reference_id)
where f.finding_id in ( )

2014-03-26

Perl Moose notes

Learn about constructors. Non-obvious gotchas

Do not define new() for your classes. Moose provides it.
Attributes are initialized by themselves, or set from parameters that match their name when calling new
There are hooks in the methods BUILDARGS and BUILD that you can implenent to override pre- and post-construction.
BUILDARGS receives as args the args passed to new.
BUILD is called after an object is instantiated. Used mostly to do assertions, logging

Learn about Moose is how Attributes work. Non-obvious gotchas:

If you want to provide a reference as a default, this has to be returned from an anonymous sub, so that each instance has ist own reference, and not all are sharing one and the same
You can provide an anonymous sub to initialize the default, or you can define the name of of a builder subroutine. Builder can be composed in by role or subclassed, preferable for anything but trivial cases
Make attributes with a builder lazy. Your subroutine to intialize the attribute may depend on values of other attributes, and these may not yet have been set. You must define the attribute as lazy to avoid issues. (What about circular refs?)
Lazy builders are ony called when the value is accessed. This can happen anywhere, including inside a map or grep. $_ may have strange values at those times. Make it local $_; if you use it
has really is just a function call

2014-03-09

tmux

tmux, the terminal multiplexer. Its shiny. Here some useful commands, all follow

Ctrl-b

c - new window
n - next window
p - prev window
d - detach, so it runs in the bg even if you close the session + you can re-attach with "tmux attach"
other people can attach to it to, creating a kind of interactive workspace
, (comma) - rename the widow
PgUp/PgDown scrolling through the buffer. When not in Ctrl-b mode, you cannot scroll. q to quit.

tmux attach - attach to a tmux server and session runnig on the current machine

2014-02-27

MySQL useful tidbits

show privileges; (overall)
show grants; (your own privileges)

show databases;

show processlist;
show full processlist; same as select * from information_schema.processlist

kill ;

2012-09-27

Creating a table of contents from headings in a page with JavaScript

Here is a HOW-TO for dynamically creating a linked table of contents from the headings of a webpage. This really owes the core code to Janko's blog.

Introduction

Writing a table of contents to jump to each heading on a longer page manually is excise, boring work. Solution: have the page create its own table of contents, by using JavaScript to inspect the document structure.

This is especially important on Blogger, because it messes up your HTML if you accidentially or for more convenient editing switch the editor mode from "HTML" to "Compose". It will turn all your paragraph tags into hard line breaks and vomit div tags all over the place. It'll massacre HTML or JavaScript code displays such as the ones on this page. The worst effect however is on name anchors, where it actually destroys their functionality, by inserting href attributes. Strong recommendation: if you are on Blogger, stick to HTML mode.

Setting up

First, we insert an element into the page that acts as the placeholder for the table of contents. This might also be done dynamically, if all the pages have the same structure, but for now I'll just do this manually, so I can put the table where I want it:

<div id="toc"></div>

Second, because blogger uses heading tags also in gadgets and other places, we wrap the actual page content in a second div element, so we can limit ourselves to headings that are part of the actual post. Otherwise, all kinds of things will be listed. If you are not on Blogger, you can omit this step. The wireframe for this page looks as follows

<div id="content">
    <h2>Setting Up</h2>
        <h3>Creating the table</h3>
            <h4>jQuery</h4>
            <h4>toContent.js</h4>
        <h3>Formating the table</h3>
</div>

Third, we have to include the JavaScript that transforms this into a table of contents. Put this code at the bottom of the page body, so all the elements it refers to have already loaded. Also, it does not further delay the loading of the page. I put the script into a file, to make it possible to centrally change the code for many pages. There are also ways to do more dynamic loading of scripts, see for example here.

<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js" type="text/javascript"></script>
<script src="http://www.schacherer.de/js/tocontent.js" type="text/javascript"></script>

To be straight, the pages on Blogger are already crammed with dynamically loaded scripts to the hilt by Blogger itself, so where you put this should not make much of a difference in speed, and the way the script is written it will execute only once the entire document is ready, anyways, so you could put this even in the header. But to put it into the header on Blogger would mean to modify the fundamental template code, and it then would show up on all pages, not just those were you need it.

Creating the Table

Now lets talk about the scripts that actually do the work.

jQuery

The first script just loads the jquery API, which allows you to manipulate and search page elements much more concisely than plain Javascript.

tocontent.js

This script expands your toc div into a hyperlinked table.

$(document).ready(function() {
    $("#toc").append('<p>Contents');        
    $("#content").find("h1, h2, h3, h4").each(function(i) {
        var current = $(this);
        current.attr("id", "title" + i);
        $("#toc").append("<a id='link" + i + 
                         " href='#title" + i + 
                         "' title='" + current.attr("tagName") + "'>" +    
                         current.html() + "<\/a><\/br>");
    });
    $("#toc").append("<\/p>");
});

You can modify this script to include other elements, or for example not display the level four headings, by modifying the element list in the find clause. If you do not need to encapsulate your tags in a content section, just delete ("#content").find. The title attribute is used later in formatting the table with CSS.

A feature, or rather bug about JavaScript is that it may insert semicolons into your code when you have broken long statements across multiple lines. You have to close statements with a semicolon if the following one starts with an opening parenthesis, square bracket or one of the arithmetic operator tokens. This is not an issue if you close your statements with semicolons anyways. What is an issue is that you may not insert a line break to format code for clarity if your statement continues, before or after: return, break, continue, throw, ++ or --, as JavaScript will insert a semicolon and break up your statement.

Formating the table

To format the table, you have many options with CSS, I use the following to get an indented list that is compact:

#toc a { display:inline-block; }
#toc a[title=H1] { text-indent:0em; font-size:12pt;}
#toc a[title=H2] { text-indent:1em; font-size:11pt;}
#toc a[title=H3] { text-indent:2em; font-size:10pt;}
#toc a[title=H4] { text-indent:3em; font-size:9pt;}

You can put this into a CSS file to load in the header, or in Blogger, you can go to the advanced layout options, and select "Custom CSS" and paste this in to be included with every page. That's it. Have fun with dynamical TOCs.

2012-09-22

A list of short Coding Book reviews

Reviews on coding and software development books.

Author	Title	Ranked	ISBN
Jon Bentley	Programming pearls	*****	0201103311
A real pearl. This is a pleasure to read! It teaches some basic concepts of coding, like back-of-the-envelope calculations, the use of data structures to get elegant code and some useful algorithmic tricks (heapsort, quicksort, binary search, hashing). It's a bit old already, but the essence of what he is saying is still true in the age of OO programming.
Brian W. Kernighan and Rob Pike	The Practice of Programming	*****	020161586X
An excellent introduction to the various aspects of programming, from style and interface design to debugging, testing, porting and little languages. When you only buy one book about an overview on actual programming - buy this one. It's only about 250 pages long, and covers a lot of terrain, with much sound advice, further-leading suggested reading and a lot of example code. Of course you can only learn by doing, so there are exercises.
Steve McConnell	Code Complete	****	1556154844
This is the Elder and Big Brother of 'The practice of programming'.
Bruce Eckel	Thinking in Java	****	0136597238
It's the best book about Java. Period.
Randal L. Schwartz and Tom Christiansen	Learning Perl	****	1565922840
('Llama Book') This is the best introduction into hands-on programming for an absolute beginner I've seen. Amazing.
Scott Guelich and Shishir Gundavaram and Gunther Birznieks	CGI Programming with Perl	****	1565924193
This book is a gem on CGI programming and perl. It's hard to believe, but this one is even better than the other perl books from O'Reilly I've read. It covers everything you need to know: CGI proper, CGI.pm, templates, security, database backends, maintaining state, creating graphics, debugging CGI apps and more. It even touches neighboring areas like HTTP, JavaScript, indexing and sending mail. This is the book for tackling the practical, real-world CGI problems you're going to face. Make sure you get the second edition.
Jr. Frederic P. Brooks	The Mythical Man Month	****	0201835959
This is the classical text about software project management with some annotations and the ``No Silver Bullet'' Essay added twenty years later. There seems to be an unspoken law stating that this must be cited in any other book about software projects or any computer book at all, with the following sentence: ``The programmer at wit's end for lack of space can often do best by disentangeling himself from his code, rearing back, and contemplating his data. Representation is the essence of programming.'' But there is much more practical wisdom in it, and it's half-entertaining to read, too.
Steve Krug	Don't Make Me Think	****	0321344758
A blissfully well designed book about web interface design, slim, with a bucketload of useful tips, and entertaining to boot. Covers how to set up a cheap usability testing rig, too.
Harold Abelson and Gerald Jay Sussman and Julie Sussman	Structure And Analysis Of Computer Programs	****	0262011530
('Wizard Book') Uses LISP (a functional programming language) to introduce basic concepts of coding. It is not an easy read.
William H. Press and Saul A. Teukolsky and William T. Vetterling and Brian P. Flannery	Numerical Recipes in C++ : The Art of Scientific Computing	***	0521750334
This is a wonderful book to look up ready-to-use algorithms on a wide range of numerical problems. Also, the mathematical introductions are understandable, to the point and written with style. I really like this book, although I never read it cover to cover. It is a classic, too. There are many versions for different computer languages, but the language is not important actually. They could just as well provide pseudocode.
Grady Booch	Object-Oriented Analysis and design	***	0805353402
This is a standard work for Object Oriented Analysis and Design (OOA/OOD), also touching the iterative development process. Too much hype, too many buzzwords, too much object-religious statements - the getting-a-better-prgrammer-per-page-of-text ratio is rather low.
Robert Sedgewick	Algorithms in C	***	0201514257
The classic overview on classic algorithms.
Alfred V. Aho and Jeffrey Ullman and Ravi Sethi	Compilers: principles, techniques and tools	***	0201100886
('Dragon Book') This is a GREAT introduction to compilers, lexers, parsers etc. Its a bit older, but if you are not Bjarne Stoustroup, it'll probably be all you'll ever need. Even the paper it's printed on smells good!
Andrew S. Tanenbaum	Computer Networks	***	0133499456
A thorough and highly entertaining introduction to the basics of the field, from the physical layer up through the network protocols.
Alan Cooper	About Face: The essentials of user interface design	***	1568843224
A smattering of smart observations about user interfaces by the father of Visual Basic. If it was only half as big, it would be great. Coopers picture asking for feedback is pretty cool.
Alex Martelli and David Ascher (eds.)	Python Cookbook	***	0596001673
Many people say this is the best Python book. It is the only book I have needed in addition to the online documentation so far, and it is quite good if you want to get a feel for the idioms and idiosyncrasies of that little language. I haven't read it cover to cover, but use it to look up stuff. Still, the recipies are entertaining enough so that you could read it like a normal book.
Jay Yellen and Jonathan L. Gross	Graph Theory & Its Applications	***	0849339820
What's nice about this book is that they give you all the term definitions, and a lot of drawings to boot, so you can visually understand what they are talking about. This makes it usually very easy to grasp the point. Much more so, in my opinion at least, than reading page after page of math formulas. There's also the algorithms, even though you can get most of those from an aglorithm book, if you already have one.
William J. Brown and Raphael C. Malveau and Hays W. McCormick III and Thomas J. Mowbray	Anti Patterns: Refactoring Software, Architectures and Projects in Crisis	***	0471197130
This is a book about all the awful things you should shun, and giving them entertaining names, too: 'lava code' (old crud that accumulated in your classes), 'The blob' (putting the whole program old-school-style into one big class) etc. - its actually useful.
This is a book about all the awful things you should shun, and giving them entertaining names, too: 'lava code' (old crud that accumulated in your classes), 'The blob' (putting the whole program old-school-style into one big class) etc. - its actually useful.
Thomas H. Cormen and Charles E. Leiserson and Ronald L. Rivest and Clifford Stein	Introduction to Algorithms	***	0262032937
A standard algorithm textbook, covering all the bases. I would have liked Sedgewick's new edition better, but I make a lot of use of graphn algorithms, and they are published in a separate volume that was not out at the time in that edition. This one is solid, although there too many proofs and excercices taking up space for my taste.
Clinton Wong	Web Client Programming with Perl	***	156592214X
A nice little book that teaches you how to automate Web-Page retrieval and processing with Perls LWP. Also has a chapter on using pTk. I use a syscall to GNU's wget to do stuff like this, its faster to put together for most small tasks. A bit dated.
Steven Feuerstein and Bill Pribyl	Oracle PL/SQL Programming	***	1565923359
Allegedly the best introduction to PL/SQL, this book is a mix of explaining the language features of that stone-age language and programming style 101. Probably PL/SQL is still the best if you want to suck out and work with data from your Oracle db in 2001. It's just such a mess syntactically and there's a bazillion exceptions to when you can use stuff and when you can't. The explanation of the language features is ok, especially hints when to use which feature, but I could have done without the programming style elaborations, which are better and more thoroughly treated in 'The practice of Programming' and 'Code complete'.
Jeffrey E. F. Friedl	Regular Exoressions	***	3930673622
A thorough introduction to regular expressions, and the concepts behind them.
George Ploya	How to solve it	***	0140124993
A classic book on mathematical problem solving. Reputedly, they hand out a copy to each new programmer at Microsoft Corp. It basically consist of a list of questions to ask yourself, and detailed dictionary that explains each concept and each question. Good tools for problem solving thus are: analogy, generalization and specialization. Examining the data, the condition, and the unknown. Decomposing and reconstructing, changing or dropping parts of the condition or data. Looking for symmetry. Examining definitions. Working backwards. Introduction of good notation, and of supporting elements. Induction and indirect proof or reductio ad absurdum. Checking the result, trying to derive it differently. And preseverance. What you learn from this is, that you need experience from lots of work, so that you will be able to see analogies to something and remember similar problems and how they could be attacked. A worthwhile book.
Sriram Srinivasan	Advanced Perl Programming	***	3897211076
('Panther Book'). Very nicely explaing some advanced topics of perl, like object-orientation and references, and the book to read to learn advanced perl after Learning Perl.
Donald E. Knuth	The Art of Computer Programming : Fundamental Algorithms	***	0201896834
This is an absolute classic, but I have to admit, I found it extremely hard and dry to read, and never really got into it. That the code is some kind of weird assembler instead of typical pseudocode also is not helping -- but that may just be my lack of genius for these things.
Nick Heinle	Designing with JavaScript: Creating Dynamic Web Pages	**	1565923006
This book is a mind-numbing pain to read, because Heinle wrote it for the bleeding programming novice. The same source code is repeated as often as three times, and each line is separately commented. Still it is a good book, since it has some usable cookbook-style scripts and explores some quite advanced topics towards the end. And it's a practical book for practical use of JavaScript in a multi-browser, bugridden world, and this makes it good - Heinle knows what is the important stuff and doesn't bother you with 300 pages of DOM documentation for methods you'll never use, like all the other books I've seen.
Larry Wall and Tom Christiansen and Randal L. Schwartz	Programming Perl	**	3930673487
('Camel Book') This is the standard perl reference. Good for looking things up, but not good for learning. Most of what it says is also found in the perl online documentation, you don't really need it.
Erich Gamma and Richard Helm and Ralph Johnson und John Vlissides	Design Patterns: Elements of reusable object-oriented software	**	0201633612
Cited in nearly every other book about object-oriented proramming, this is probably the book that will teach you more about the subject than any other book out there. Unfortunately its boring to read and a touch too academic for my taste. Still a must read.
Rainer Hellmich (eds.)	Einführung in intelligente Softwaretechniken	**	3827295467
A superficial introduction to finite automata, Petri nets, compilers, compiler-compilers, object-orientation, rule-based systems, expert systems, predicate logic, prolog, fuzzy logic, genetic algorithms, other heuristic optimizing methods and neuronal nets. All interesting areas, but the writing is only so-so.

2012-09-04

Writing

The Rule

Think of the reader.

Make it easy for the reader, and do not bore him. If the reader likes what he reads, he will continue reading. If he is bored, disoriented, or can't follow your thoughts, he will stop reading. Your effort is then wasted. And you show that you are inept or lazy. To see if your text reads well, read it aloud. And if possible have someone else read it and give you feedback.

The Structure

The structure of your work shall match its content. To understand, one must order and relate ideas. If you organized your ideas well, the reader can understand them without effort. If not, you instead put that burden on your reader, and he will not be amused.

One thought per paragraph. The paragraph is the basic optical unit of text. The thought is the basic conceptual unit. Keeping them in parallel and build your text from logical units. Put several thoughts into one paragraph and you'll see how hard gets to figure out what point you are trying to make.

Don't rip a thought apart. Keep to your thought until it is done. Avoid ellipses, insertions, asides, parentheses, footnotes. An average reader can keep seven words in mind without effort. He can not know where you want to go and must follow you. Don't force him to hold your unfinished thought in mind through tiring false leads. You make your argument hard to follow and waste your reader's concentration.

Keep it short, remove chaff. Leave out the superfluous from your argument. Cut unnecessary paragraphs, sentences and words. You make your argument more clear and forceful, save space for important statements, and you save your reader's time.

Parallel form for parallel thoughts. Here form makes content easy to see. For long lists of parallel information, use a list.

Use sections. In longer texts, just plain text paragraphs are not be enough to get the overall logical structure. Use chapters, sections to organize your text. Headings help to make them visible.

Main point at the start, Emphasis at the end. The reader always will read the beginning of your text, so put your main message there. The start is distinct from the main body of text, so the reader remembers it more strongly. The same goes for the end which leaves the last and strongest impression -- if he readsit.

No monotony. Finally, don't apply any rule without reprieve. One short, positive, active sentence after the other gets boring, too. Spice up your text by variety.

The Sentences

Put main points in main clauses. Avoid dependent clauses that carry the main point, an additional main point, carry on the action or insert some thought that is unrelated. Use dependent clauses sparingly to break the monotony of chains of main clauses. Appending them is better then prepending them, and inserting them is bad.

Keep to one tense. Usually talk present voice. It feels uncoordinated and thoughtless if you are switching times all the time.

Use positive statements. They are more specific then negative ones, and easier to understand. Above all, avoid double negations. "They don't make it less difficult." -- see?

Use active voice. Say "He showed", not "It was demonstrated". Action is more alive than passivity. The sentences will shorter, and easier to understand. It forces you to know the actor, too, so helps you think things through.

Keep split-able verbs together. If you don't, the reader has to keep in mind half of the verb, while skipping over your sentence to find the other half. He will have less attention for what you say, and tire faster.

The Words

Short words. Short words are easy, long words are hard. It is that simple. The more syllables, the harder to understand. Short words often are less abstract than long ones, too. "Short words are the best and old words when short are best of all." -- Winston Churchill

Gripping, concrete words. Abstract words print no image in the reader's mind. Narrow, concrete words are more exact and less judging. Say hen, not chicken, say chicken, not poultry. Use pars pro toto to stay brief and vivid if you talk about general things.

Use action verbs. Verbs drive the action, they are stronger than nouns. Therefore, don't replace them with nouns if you can avoid it: don't say "hold a meeting", say "meet". It is shorter, too.

No adjectives. Adjectives they are fat on lean sentences. Adjectives soften the impact of their noun. Get rid of them. Try cutting all adjectives from a text, and you will be surprised how much more toned it will be. Adjectives distract from the idea the noun expresses. If you feel that you need an adjective, try first to find a more fitting noun.

No fillers. Get rid of filler words like little, pretty, quite, rather, very and of phrases like in this context or in my opinion. They add no information and weaken the statement.

The Orthography, Grammar, Form

Your text has to be correct, before you can worry if it is gripping and easy to understand. In the age of spell checkers, this is no real concern. Spell checkers may not be perfect, but they come close enough. Use them. Even more than the syntax, the form is nothing to worry about anymore in the age of computer typesetting.

Closing Remarks

These hard-and-fast rules for non-fiction writing do not always hold, but break them consciously to achieve a certain effect, not out of laziness.

In scientific writing abstraction, big words, and hard-to-follow sentences are all too common. People are either think this is required to appear "professional", or have not understood their subject well enough, or just do not care. I think there is no substance to it: "If you can't explain it simply, you don't understand it well enough." --Albert Einstein

Many ideas in this document have been taken from "Deutsch für Profis'', partly from "The Elements of Style''. Both books are excellent. To anyone who understands German I highly recommend the first one - it is more thorough and more entertaining to read.

References

[Sch00] Wolf Schneider. Deutsch für Profis. Goldmann, 1999. ISBN: 3-442-16175-4.
[Str79] William Strunk Jr. and E. B. White. The Elements of Style. Number 3/e. MacMillan Publishing Co., Inc., 1979. ISBN: 0-024-18200-1.

Tools

"Ein Mann, der recht zu wirken denkt,
muss auf das beste Werkzeug halten."
- J. W. Goethe

This page holds a list of useful tools, mostly free software with short descriptions. Some of the non- free stuff is things that I used.

Shell

One of the main problems when developing on Windows and Unix at the same time is the different shells. Therfore Cygwin has to be the best thing since sliced bread. Cygwin allows you to use the Unix toolkit (make, less ..) and command names (rm ..) on your Win32 box. I'm not using the Bash shell that is supplied with it, though, but the normal "MS-DOS" shell with filename completion active (by setting HKCU/software/microsoft/command_processor/CompletionChar to 9) although it's not as smart as the Unix completion, it just grabs the first match it finds, not asking for ambiguous ones. For testing it might be worth to use bash and avoid to learn DOS-Style. A comparison between syntax and toolnames for Windows/Unix shells.

Editors

Your editor is where you will spend a lot of your time at the keyboard - or maybe all of it if you are using Emacs. GNU Emacs. This is my editor. Probably the most powerful editor of them all, because it is fully extendable and programmable. For every need there is a specialized mode - you can see some that I use here. Unfortunately it has a steep learning curve and other key-bindings than standard Windows. Vi. The Unix editor. This comes with almost every Unix flavor, but is even more arcane to use than Emacs and not as easily customized. UltraEdit. This shareware editor is the best I have seen on Windows to date. It is extremely intuitive, user-friendly, configurable yet mighty. It has syntax coloring and a hex-mode, plus all the standard Shortcuts, Find in files, Regex, File-Compare etc. TextPad is also nice.

IDE/CASE Tool

Graphical CASE tool usually have an integrated default editor which you might be able to switch to your favourite one. They integrate your work needs for editor, compiler, program execution, reference and debugging. And write some of the code for you. The single step debugging is the main reason I'd use one. These tend to be commercial. JPadPro is shareware and the best and most honest IDE for Java I have seen to date. Just uses the JDK, does not do some voodoo behind the scenes you don't control, easily configurable, with comment folding and a very nice tree for your packages/classes. The only disadvantage is that you cannot use another editor. Nowadays, Eclipse is probably by far the most popular and powerful IDE platform, not only for Java -- and it is free.

Languages

Perl, the 800-pound gorilla of scripting languages. For small works glueing several programs together there are many mighty tools on Unix, like awk and sed. I use Perl to avoid learning them all, as you can do it all and more in Perl. It's also the favourite skripting language for CGI. And it's available on Windows, too.

Java. SUNs javac, java is object-oriented and portable with ample GUI support. It also used to be slow, but apparently not any more. I used to code Java a lot during my Ph.D. work.

Python. Python is a hell of a cool language. A fully object-oriented script language, complete with regular expressions, ultra-clean syntax and many built-in lisp like functions it combines the best of Java, Perl and Lisp into one neat package. Here are some notes.

The shells. Often a small shell script can do what a perl script could. I find the syntax of sh annoying, so I have a cheat sheet.

Lisp/Scheme. It's built into Emacs, so knowing a little is useful if you use Emacs.

Source Control

Source control makes it possible to work with several people on the same source tree, or to back up to a state where everything worked. Together with a bug tracking/change request software, you can develop in a coordinated way. A common system is CVS, which recently has been replaced more and more by subversion.

git on github is another popular, and more modern tool. I feel it is more complicated to use than CVS, but it has some advantage in making it easier to create and merge branches, and through this, it allows you a review step before adding code to the central repository, because you can make a branch, make changes, then ask a maintainer to merge it in (even when you have no rights to do so), and he will be able to diff and see what you did before it goes in.

Documentation

Good documentation is important, because it helps you to think about your code and helps you to understand it later. I use normally what the language supplies as default. Javadoc with Java, or Perl's POD (plain old documentation) are both O.K.

Build

The good thing about dedicated build tools instead of plain scripts is that they allow you to enter at different places in the process without commenting out code or changing your script, and that they figure out everything that needs to be re-done based on explicit dependencies.

make. There are multiple versions of this tool, the normal make, GNU make (which is the most widely used) and under Windows NT nmake. Since you can use make instead of nmake with Cygwin, and Windows has stuff like InstallShield, its probably not worth learning nmake. There are big books about make, but for 80% of your needs you can get away with 20% of the syntax. Here is a simple annotated example of a makefile. I actually only use it under UNIX - it's more of a UNIX tool I guess. Nowadays, Ant seems to be the build tool of choice for Java types. I did not find it offering much of an advantage, forcing you to write Java classes.

Debugging, Bug Tracing, Profiling

Each language tends to have it's own tools. The default profiler that goes with perl is pretty ok. I usually keep to Kernighan's advice and think about my data and how the error can occur, than use a few interspersed print statements to track it down.

Testing

Here, you have to construct your own test cases and scaffold. The main point is that you have to automate this, or you will not do it enough. To automate it, you have to write test input and output files and skripts that compare the the output of a new version to the prepared one, to ensure nothing has changed. (Regression testing). Your test skript should be silent as long as everything is ok, only complain if errors are found. To do this you have to learn a bit of shell programming, mainly conditionals, loops, file comparison and length check and testing if a programm returned the ok signal. Modern languages like Python have embedded unit test support.

Design and Modeling

I used Rational Rose, a commercial tool now owned by IBM for object oriented modeling in UML (and Booch etc, if you're so inclined).

Programming Style

Programming style or "coding conventions" as it is sometimes called is about how you code so that the code is easily readable and understandable by a human. Especially the part of making code readable that doesn't interest your compiler in the least: naming and formating. Your compiler doesn't need whitespace and descriptive variable names to execute code correctly. But you do.

Basics

The most important element of programming style is consistency: do things the same everywhere, so that the same things do look the same and different things look different. If the same things look different, it becomes impossible to guess what something is or does just from the looks of it. Also, the code will look more complicated than it actually is.

Style always is a matter of personal taste. In the following, I lay out the style I have found most usable, with explanations what the advantages of this style are. Generally, there are three things to shoot for when it comes to style:

style that matches code logic. First and foremost, your style should match the logic of the code. Style is not about aesthetics, although good style tends to be aesthetically pleasing. You can draw great ASCII art within your code, but as long as it's not helping to show the code logic, it's useless.
style that is easy to maintain. If your style looks good but takes a lot of work and extra typing to keep it up, sooner or later you won't keep it up, and it's gonna be useless. Also, if making insertions, deletions or changes in your code leads to a lot of extra typing to keep the style intact, you won't do it. Or if you do, you are wasting your time. Finally, if you need to remember a lot of rules because the style is convoluted, you will forget some of them.
style that is easy to read. This is a less important point but still, if you can't easily read the code, the other two won't do you much good.

To be able to be consistent and achieve these goals, you have to decide about your style up front. This also has the added benefit that you won't be wasting time and brain power on how to format an expression later on, brain power and time you could much better use to think about your program or go out with your girl.

Being consistent also means that when you edit someone else's code, you should adopt his style. Don't waste your time on reformating someone else's source code. Also for some communities there are traditional styles to adhere to which you should respect. For example, there are documented coding conventions for the Java language, or suggestions for friendly Perl style.

Good tools, like always, can make your life a lot easier when it comes to style, too. For example, emacs has modes that handle the formating and indenting of your code for many languages automatically. You can even choose between various styles. There are other tools, called beautifiers who are specialized in doing just that, but of course it's better to have it built right into your editor. Also, a good editor should have syntax highlighting. It helps to give you a fast idea about which parts of a program are literals, which are comments, which are keywords and so on, and take away from the burden to express all that just by formating. It will help to catch typos, as a 'retrun' doesn't light up like a 'return', too.

In my opinion, the most important elements of style are proper formatting and good names. Less important are comments. So lets get to the details:

Formatting

Most of your lines should be shorter than 80 characters. Ancient computer equipment doesn't handle more than 80 characters per line very well, and long lines are also hard to read. If 80 characters are not enough, most of the time you have too much stuff on one line or too many nesting levels with indentation. You should improve your code by putting part of it into a separate function.

Use lowercase letters or a natural mix of lower and Upper case. DO NOT SHOUT IN ALL CAPS. Shouting clutters up your screen and makes the text harder to read. In the days of syntax highlighting, there is no need to capitalize language keywords anymore.

Declaration

Declare any variables right on top of the scope for which they'll be used, so you know what is used there. Do not declare them on a higher scope than is needed, so everything that belongs together sticks together. This makes the code easier to understand and easier to maintain, as it is hard to insert some code later on that accidentally changes the value of the variable before it is used.

In practice this means declare class variables at the top of your class, local variables at the top of your function, and temporary help variables at the top of their loop. If you use comments, put them on your declarations, they are much more helpful here than comments on control constructs. Initialize variables for classes where the class is initialized, for local functions at the top, if possible during declaration. Put only one statement or declaration per line. It encourages commenting for declarations, and makes complicated statements look complicated, while simple ones look simple.

White space

White space can be used to show parallelism and grouping and to ease up dense statements. If in doubt, apply liberally. I put a single space

after keywords like while, if, for. But not after function names. This helps to clearly distinguish them.
after commas. This helps to tell parameters apart.

I put an empty line

before comments. Also not so needed with syntax highlighting
between groups of statements. This makes the statements in a group appear to belong closer together than to statements from another group, for example groups of declarations

I put two empty lines to separate functions. By doing so instead of using a line of fancy comment (like /*********/), you still get the optical separation between fucntions, without the typing work. Also, you then can use a fancy comment to group functions within a file, for example parsing functions, printing functions etc.

Code legibility is increased by indenting parallelism. Be carful not to align stuff just beacuse it looks neat. Alignment should only be used for assignments that have functional coupling. Compare

         my $filename =    $args{PATHNAME};
         my @names    = @{ $args{FIELDNAMES} };
         my $tab      =    $args{SEPARATOR};
with
         my $filename = $args{PATHNAME};
         my @names = @{$args{FIELDNAMES}};
         my $tab = $args{SEPARATOR};

[Example by Tom Christiansen]

Indentation

Indentation is used to show two things: first, indented code can mean the line continues the statement of the last line, whereas nonindented code would suggest the start of a new statement. Second, indented code can mean that it is at a deeper nested scope, for example the contents of a while loop. There are common indentation styles, among them:

1. K&R/One True Brace
while (condition) {
    dependent code;
}
2. BSD/Allman
while (condition)
{
    dependent code;
}

3. "Trueblock"
while (condition)
    {
    dependent code;
    }

4. GNU
while (condition)
  {
    dependent code;
  }

I am a proponent of style one (which has been called K&R style after Kernighan and Ritchie who use it in their classic book about C. It also has been called the One True Brace Style, read all about it in the Jargon file). It has the advantages that it doesn't "waste" vertical space by putting the opening brace on a separate line and it has a cool name. If you're coding in Java, it also is compliant to the Java coding convention.

This style can lead to a problem, when you have a long conditional statement that you have to line break. The dependent code would use the same indentation as the conditional in this case, which would make it hard to see where the conditional ends. You can avoid this by (a) not indenting the second line of the conditional, (b) indenting it twice instead of once. Both solutions are kludegy. Or just use the next style for these cases.

Style two, called Allman or BSD style, avoids this problem (see below), since the blank line introduced by the opening brace sets conditional and dependent code apart. It keeps the nice feature of aligning the closing brace with the statement, as in K&R style. These two are the most often used styles.

Style three is suggested by Steve McConnell in his otherwise excellent book. In his opinion, since the whole block is dependent on the condition, the whole block should be indented. I've never seen that used. Style four is a strange mix between this and BSD-style, and is used as the official style of the GNU project.

The notion that saving vertical space is good is based on the assumption that you can better understand the code by having more of it on your screen at the same time. This is wrong. You'd get the maximum amount of code on one screen by doing away with indentation and whitespace altogether. Saving that one line will just make your screen look more crowded. It will make it actually harder to read the code and get an overview. I realized this when I was wondering why the code of a colleague looked so much more clear and orderly than mine, even when he was sloppy about formating. He was using Allman style and I wasn't.

$my_list = (item one,
            item two,
            item three
           );

The form above, endline indentation, is practically only used for declarations. I don't use it, as indentation depends on the length of the name and thus makes each declaration look different. Also, for long names, it's pushing everything to the right border, and its harder to maintain. I either fill the first line, break, indent once and fill the second line, or if there are a lot of arguments, direktly break and put one per line, indented:

boolean insert_animal_record (
    Animal type,
    String name,
    integer age,
    boolean furry)
{
  ...

The decision how far to indent is a tradeoff between lost horizontal space and clear separation of nesting levels. With large indents you wander of the right side of the page fast, with low indents it becomes more difficult to visually keep different levels separate. I feel that a 4-column indent is enough to keep levels apart and also saves you typing if you are typing spaces instead of tabs. Never use tabs, because when someone else opens your code with an editor that has another tab-width set, nothing looks like it used to any more.

Expressions

Use the natural form for expressions and function calls, like you would do in normal speech. Good code reads almost like english text. Negations are hard to understand in conditionals: go for the positive form. Use extra parantheses to resolve ambiguity, and break up complex expessions into simpler, smaller ones - better yet, put them in a small subroutine. Clarity is much more valuable than cleverness. Be careful with side effects. Use else-ifs for multiway decisions, keeping the flow of tested conditions easy and indentation low, instead of nested if statements. Use character constants not numbers. To test for ranges, even better use library functions. [tpop]

Nested expressions with parentheses are difficult to indent so that you can read them easily. Here is my method: put operators leading on new lines, indented so they are below the opening parenthesis. This way, all statements on the same level are on the same indentation depth. Don't do so for the leaf lists if it's not necessary. Don't put closing parentheses on new lines. If the expression is within a conditional (which often is the case) you may want to put the opening brace of the conditional on a separate line, to get an optical setoff from the indented expression - even when you use K&R style.

if ((a and b)
   or (c
      and (d or e)
      and (f or g)))
{
    make_it_so();
}

When an expression is to long to fit on one line, there is the recommendation to break it so it is clear it does continue on the next line - after a comma or operator. Unfortunateley, this is inconsistent with the previous method, and will not work well to show nesting. So better do it like this:

if (horkingLongStatementWhichEvalsToBooleanMaybeWithLotsaParenthesesToo
   || (theNextHorkingLongStatementWhichEvalsToBoolean
      && soOnWithAnotherHorkingLongStatementWhichEvalsToBoolean))
{
    doThis();
}

The preferred form of the ?: operator is similar:

variable = (condition which usually is long and has some comparison operator)
  ? it_was_true();
  : no_it_wasnt();

Naming

Good source code should be self-documenting. The easiest way to achieve this is well chosen names.
Always give names to strings and to magic numbers (anything other than a 0 or 1).[tpop] By doing so, you can change all instances of the number in one place, and conditionals using symbolic names are easier to read. Compare:

/* Check if the record is valid */
if (record == 0)
if (record == IS_VALID)

Variables

In your name, you can try to convey three different things:

what the variable stands for: this is the base name.
the scope of the variable: is it global, packagewide, or local?
the type of the variable: is it a parameter, a constant, a user-defined type or a native type?

For the base name, use descriptive names for globals, shorter names for locals. Be consistent: use the same way to name things everywhere. Be accurate: don't use misleading names. Names should always be application-specific, don't describe the coding construct. Use @fruit instead of @list. Chose names so that they can be easily read like english text.

Base names should have a length between about 8-20 characters. Shorter names tend to be not very descriptive, longer names eat up too much space and are tiring to read. Since a name should describe what it stands for, a complicated subroutine that tries to do too many things will call for a very long name. This is a hint that you should redesign the subroutine, not cheat on the name. Sometimes long names cannot be avoided. In those cases there are several techniques to shorten them:

dropping nonleading vocals and double consonants (customer_application_form => cstmr_aplctn_frm). These tend to be hard to pronounce, which sucks, when you have to talk about them with someone else.
cutting off after a certain number of characters (customer_application_form => cust_appl_form). This is better because it is pronouncable.

Just keep in mind its never worth to shorten a name when you save only one character by doing so.

The methods you can use to modify the base name are different capitalization (if your language allows it), and pre- and postfixes. For capitalisation, there is often a tradition in naming. For example Java uses internalCaps, Perl prefers internal_underscore. Underscore makes reading a wee bit easier, but takes one character more. Use whatever is commonly used in the given language. Here are the conventions I use as defaults:

ALL_CAPS for constants.
Firstcaps for type names or name_t for type names if the conventions of the language mandate it. I do not use hungarian notation to indicate language types.
innerCaps or no_caps for local object names or variables.
varNameG or var_name_g for global variables.
p_name_p or pName for parameters, although I do not indicate function parameters.
namep for Boolean predicates (taken from lisp)

Postfixes are better than prefixes, because they let you see the important thing first and can be autoexpanded. Common uses for postfixes are qualifiers, for example userCnt, userIdx. It also helps to always use the same pairs. I use:

idx, cnt for index and count. (Don't use num or number as a qualifier, since it is never clear if it referes to the overall count or the current index.)
beg, end for beginning and end. As array slice indices, beg points to the first element, end after the last, so you can use the canonical while (array_idx < array[end])

Subroutines

Name subroutines so that their calls read well in English. Use action names. `Procedure names should reflect what they do; function names should reflect what they return.' --Rob Pike. Don't use abstract words like process_stuff, transform_input or handle_event if possible, because you can't tell from such a name what the subroutine does. Again, if there is no good name, probably it's a bad subroutine.

Functions (subroutines that return a value) are named after what they return. Example: getCanonicalVersion();.
Predicate functions (subroutines that return a boolean value) are named with `is', sometimes with `does', `can', or `has'. Example: is_canonical() is better than canonical() for the same function, because it reads well in a conditional: if (song.is_canonical())
Procedures (subroutines that don't return a value) are named after what they do. Example: canonize(Song song); A classic here is setName(String name);

The notation this2that() is well established for conversion functions or hash mappings. Hashes usually express some property of the keys, and are used with the English word `of' or the possessive form. Name hashes for their values, not their keys. eg. $color{"apple"}. [Tom Christiansen, works not as good for object notation color.get("apple");]

Comments

Don't belabour the obvious, uselessly duplicating the source code on an atomic level. Comment entire blocks, not single lines. Comments as a brief summary should give insight in what happens, the intent of the code. Eschew gaudy block banners, they just take a lot of maintenance work and are not needed any more with syntax highlighting. Use comments for classes, functions, global variables and constants. `Comments on data are usually much more helpful than on algorithms.' (Rob Pike)

Try to avoid commenting code where something tricky happens. Rather repair the code so it is not tricky any more. `Basically, avoid comments. If your code needs a comment to be understood, it would be better to rewrite it so it's easier to understand.' (Rob Pike) Don't comment bad code, rewrite it. [tpop]

A valid comment is an overall introductory paragraph that explains what happens, and interface documentation for your library routines and packages, so the user can use them without looking at the source code. There are tools for doing this user or API documentation, for Perl programs, you can provide a manpage in POD (look at the Perl documentation for examples), for Java an API with javadoc.

Valid input formats and output. Functions which parse messy input should at least contain some example of it.
Environment (What does it need to run)

Normalization

This subject is borderline for a styleguide. It tells you that you never should do cut-and-paste programming. Rather, if you need the same functionality at two places in your code, put it into a small subroutine and call that. If you have two very similar routines, merge them into a parametrized one. Creating small, simple subs is easier, more foolproof and makes you feel like you code lisp. Well, not quite. In any case it improves your code. It makes the code shorter, easier to maintain as you only have to change one place, easier to expand as you can call all the little general routnes, and somehow easier to read. Using well-named functions makes their calls read nearly like english text. The disadvantage is that when you really want to know what the program does, you'll have to scroll a lot chasing down all these little subroutine rascals.

You also can put complicated boolean tests into subroutines, even if they are called only from one place so one could put them inline without losing normalisation, just to make the code in the calling routine easier to read. It has a similar effect as writing a comment to explain what the condition does, using the subroutine name as the comment.

SHS SQL style

Use internal_underscore. 4-column indent, BSD-style.

_cur  Cursor variable
_row  Row variable
_rec  Record variable
trg_  Trigger name
pck_  Package name
i,j   Simple loop counters

these are helpful to guard vs clashes in SQL statements, but suck
because you always have to type i_ or something like that in addition
to the name so they eat up space and clog up your screen

i_    in     parameter name
o_    out    parameter name
io_   in/out parameter name
v_    local variable name
t_    type name
c_    constant name (usually global and type defined)
(g_   global variable name)


In triggers prefix with subset of
b     before
a     after
i     insert
u     update
d     delete
s     statement

Method names
set_   set, overwriting any existing state
fill_  set, if empty
is_    boolean test

In declarations simple ones first.

Function Header Template (Language is German):

--Einspielen unter schemaname
create or replace package pck_ as
/** Übersichtssatz - Frank ist ein fauler Hund.

Beschreibung.

<img src="pckname_img.gif">

Schema: schemaname

$Logfile: $
$History: $

%version $Revision: 51 $ $Date: $
%author  Frank Schacherer, SHS
*/

References

Code Complete.
The Practice of Programming.

Coding

The art of Computer Programming as seen by other people with experience.

Start with something that's simple and useful, and that gets real work done now. Learn from the experience of doing that work. Enhance if/when needed. Don't worry about defining the ultimate cosmic architecture. The roadway is littered with the corpses of ultimate cosmic architectures.
--Jon Udell

General

Be predictable.	Data make code.	Normalize data & code.
Write a spec, then implement.	Separate frontend from library.	Make libraries general.

Some of the important fundamentals for code in my opinion.
Principle of Least Astonishment. This principle states that: A system and its commands should behave the way most people would predict, that is, the system should operate with "least astonishment."[Mike Sperber]
The main problem you face when programming bigger systems ist handling complexitiy. Most software is far too complicated for a human mind to understand whole. So one need tricks to handle this complexity. One needs to get help for what the brain cannot handle.[ooa/d]
Using Diagrams as thought and design aids helps thinking and designing tremendously, as man is an optical animal. But unfortunately software is "invisible", i.e. impossible draw, even in 3 dimensions, as a whole. There are just too many layers and views. [mmm]
Abstraction is what allows humans to handle the huge complexity of the world and life. You ignore all details about something and just concentrate on the few things that are important. (Of course the trick is in determining what is important.) Abstraction allows you to ignore the complicated inner workings of a piece of code and just use its' simple interface for interacting with other code. [wb, mmm]
It is often simpler to solve the general problem instead of the specific one. Like, solving some algorithm for n cases, then apply it to 73, then solving it for 73 cases. The reason is that to solve it in general on has to understand what is immutable about the problem, and what are just parameters that can change. [pps]
Always try to find out what is general about your problem from what is special. Then decompose your problem in a library of general tasks (backend) and the special case logic needed for glueing these together and interfacing to the user (frontend). This way you get a more elegant, clean system and can reuse the general components for something else. Eg. the skript for making the external version of this homepage has to walk the tree of files and dirs, removing unwanted files and links to them. Also, you have to make a copy of the old homepage first, so you will not lose the full version. You could do all that in one big skript, that could be used for nothing else, then. Decomposing it you get:

a component to make a copy of an file tree. (this exists - use a system xcopy).
a component that walks a tree of files and dirs, invoking a function on each file - or better each file with a name fitting a regex, where .* would do it on all files. (this exists - use perl File::depthfind).
a function that finds tags matching a regex, replacing the found tags, here with nothing.
a function that deletes a file matching the regex.
a config file, which can contain the list of regexen, so you can define them once then reuse it. you can use a perl source here and load it with do or require;
an interface skript, that takes as parameters the source and target dirs and the list of regexen and maybe something like a verboseness flag, or a non-default name for the kill list.

When working with perl or shells, most elemental operations already exist to be glued together, like the xcopy or File::Find in our example. For number two and three, you could write a small library that maybe later can be expanded with other useful functions for batch processing of tree files. Each time there is some amount of glue code that has to be written and will not be reusable.
Data structures are more important than code.

        "The programmer at wit's end for lack of space can often do best
        by disentangeling himself from his code, rearing back, and
        contemplating his data. Representation is the essence
        of programming." [mmm]

Rob Pike says: `Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming. {This is SO true} (See Brooks p. 102.: "Show me your flowcharts and conceal your tables, and I'll continue to be mystified. Show me your tables, and I won't usually need your flowcharts; they'll be obvious.")' Capture regularity with data, irregularity with code. (Kernighan) If you see similar functionality in two places, unify it. That is called a `subroutine'. Consider making a hash of function pointers to represent a state table or switch statement.[Tom Christiansen]

Schedule

How does a program get to be a year late? ... One day at a time.
So use a chart for planning your milestones. Even better use inchpebbles, as they are more directly achievable, shooting for a daily goal. And use exact milestones, no 90% done, nearly debugged etc - programs are these half of their development time! Never miss two deadlines in a row, it kills morale. [mmm]

Documentation

Communicate in many ways: informally, formally in meetings, via a shared workbook or via email. A project workbook, documenting the work done should be built as WWW pages. All documents produced should be part of this workbook. Only interfaces should be exposed, you see only your own internal implementation.[mmm]
The critical documents for a software project are: objectives, user manual, internal documentation, schedule, budget, organization chart. Writing these also focusses thought by forcing hundreds of mini-decisions. [mmm]
Source code documentation is the best way as It's a folly to try and maintain independent files in synchronism. This is true for every kind of code and data. If you have it redundantly in two places, you will change it in one of them, and forget or be too lazy to change it in the other. Then the system becomes messy and unstable. Keep it in one place! Good source code documentation, preferably literal programming makes it easy to keep docu up to source-date and forces you to think about what you are doing!

Style

The source code should be self-documenting. Several points can be used for this, and the way they are used makes the style of a programmer:

well-chosen names

Use descriptive names for globals, short names for locals. Be consistent (use the same way to name things everywhere) and accurate (don't use misleading names). Name functions so that their calls read well in English, and use action names.
Use ALL CAPS for constants, Firstcaps for class names or global variables, firstLower for object names or local variables. Often, for short local variables lowall is even preferable.
Functions are named after what they return, procedures after what they do. For example, predicate functions should usually be named with `is', `does', `can', or `has'. Thus, isReady() is better than ready() for the same function. Therefore, canonize() as a void function (procedure), getCanonicalVersion as a value-returning function, and isCanonical for a boolean check.

boolean isActive() => if (cat.isActive()) { ...
String getName() => default to get obj value
void setName(String name) => default to set obj value

The abc2xyz() is well established for conversion functions or hash mappings. Hashes usually express some property of the keys, and are used with the English word `of' or the possessive form. Name hashes for their values, not their keys. eg. $color{"apple"}. [Tom Christiansen] Works not as good for the Java Collection classes color.get("apple"); `Procedure names should reflect what they do; function names should reflect what they return.' --Rob Pike. Give names to magic numbers (anything other than a 0 or 1). [tpop]

expressions

Use the natural form for expressions, like you would do in normal speech. Negations are hard to understand in conditionals: go for the positive form. Use extra parantheses to resolve ambiguity, and break up complex expessions into simpler, smaller ones. Clarity is much more valuable than cleverness here. Be careful with side effects. Use else-ifs for multiway decisions, keeping the flow of tested conditions easy and indentation low, instead of nested if statements.
Use character constants not (encoding-dependent) numbers. To test for ranges, even better use library functions. [tpop]

declaration, initialisation

Declare any variables right on top of the scope for which they'll be used, so you know what is used there. Do not declare them on a higher scope than is needed, so everything that belongs together sticks together.
In practice this means declare class/global variables at the top of your class, local variables at the top of your function, and temporary help variables at the top of their loop.
Initialize variables for classes where the class is initialized, for local functions at the top, if possible during declaration.

indentation

Use always the same source code formatting and naming scheme to save lots of time you otherwise would spend thinking about these. Code legibility is dramatically increased by consistency and indenting parallelism.
Compare

         my $filename =    $args{PATHNAME};
         my @names    = @{ $args{FIELDNAMES} };
         my $tab      =    $args{SEPARATOR};

with

         my $filename = $args{PATHNAME};
         my @names = @{$args{FIELDNAMES}};
         my $tab = $args{SEPARATOR};

[Tom Christiansen]
Indentation is used to show two things: first, indented code can mean the line continues the statement of the last line, whereas nonindent code would suggest the start of a new statement. Second, indent code can mean it is at a deeper nested scope, for example the contents of a while loop.
This can lead to a problem, when you have a long statement, and deeper nested dependent code, which would use the same indentation. You can avoid this by (a) not indenting the second line of the long statement, (b) indenting it twice instead of once.
I use standard Java layout. Using the generally accepted idiom makes it easier for you to read others stuff, too. I indent for easy readability as described there or for a single tab. The only exceptions are long boolean concatenations in conditionals clauses. There I use:

if (horkingLongStatementWhichEvalsToBooleanMaybeWithLotsaParenthesesToo
|| theNextHorkingLongStatementWhichEvalsToBoolean
&& soOnWithAnotherHorkingLongStatementWhichEvalsToBoolean) {
        doThis();
}

The preferred form of the ?: operator is

variable = (condition which usually is long and has some comparison operator)
                   ? it as true
                   : no it wasn't;

I also do not indent a tab in class bodies, as you the loose an autoamtic extra four characters on every line.

comments

Don't belabour the obvious, uselessly duplicating the source code on an atomic level. Comment entire blocks, not single lines. Comments as a brief summary should give insight in what happens (or if something tricky happens). Eschew gaudy block banners. Use them for classes, functions, global variables and constants. Don't comment bad code, rewrite it (!). [tpop]
`Comments on data are usually much more helpful than on algorithms.' (Rob Pike) `Basically, avoid comments. If your code needs a comment to be understood, it would be better to rewrite it so it's easier to understand.' (Rob Pike)
Especially important is the introductory paragraph that explains what happens, best done as a manpage, containing:

NAME and Purpose
SYNOPSIS, Syntax
DESCRIPTION (plain text)
Options
Valid input formats and output. Functions which parse messy input should at least contain some example of it.
Environment (What does it need to run)

In perl use the POD, in java, Javadoc.

Basic Algorithms and Datastructures

Searching

In ordered arrays: binary search O(log n). In trees and heaps: recursive search O(log n). In unordered arrays: linear search. O(n);

Sorting

Quicksort, Heapsort, Mergesort O(n log n). Simple sorts. O(n^2).

Growing Arrays (Vectors)

Linked Lists

Trees

Hash Tables

Use a versioning system like CVS to document version and change development. [mmm]

Good Programmers

35. Everyone can be taught to sculpt: Michelangelo would have had
  to be taught how not to. So it is with the great programmers.

-Alan Perlis
Sharp professional programmers are ten times as productive as normal ones. A small sharp team is best - as few minds as possible. Such teams created nearly all of the great software that exists (Unix, Linux, C, Perl, Java, etc). A team of two, with one leader is often the best use of minds. The really good programmers have strong spatial senses and usually geometric models of time. Often they are also talented with words and music. And you can use them to clean your furniture, too. [mmm]

Idioms

You can learn the idioms used by others from their source. It usually is the best way to learn. HTML (with View Source) and Perl (where you can look at the lib modules) are great for this reason.

System Design

The essence of design is to balance competing goals and constraints. Issues in design are Interfaces, Ressource Management, Error Handling
Conceptual integrity is the most important thing in system design. Ease of use for both simple and difficult problems is the designs ultimate test. Separation of architecture from implementation is a powerful way to get conceptual integrity. External provision of an architecture enhances, not cramps, creativity of implementation. Architecture and implementation can develop in parallel. [mmm]
Do the same thing the same everywhere.[tpop] This is very true. To be easy to understand and use (also for yourself), things should work just like you expect (or guess) them to do. By doing everything the same every time, if you know how it works once, you know it for all cases (see also: getting into a rut early). You know what to expect and don't have to look things up and it all seems simple. If you'd do it different every time, you can not expect anything, and the system is hard to understand, seems complicated and its easy to make errors. Consider for example how programs that do not use standard key binding (like ^X and ^V for cut and paste) on Windows suck, or how irritating it is if some command line programs had the params after the target file(s) and some in front, or if in some classes a method is called getURL() in others getUrl() and yet in others geturl().
Differentiate between frontend and backend. The front-end interfaces with the user and here you set all parameters. The backend consists of libraries. They need clean interfaces. They do not set any defaults other than elementals like zero, null or stdout and recieve everything else handed down from the front end (or a config file specified there). Imagine the library becomes part of some larger program whose specification changes over time. Program it so that it will work robust for any input.
The best way to develop a system is to grow it, like a plant. Start out with a small, simple core, get it working. Then, slowly bud part after part, growing it more complex over time. All the time you have a working system.
Even better than developing it yourself: buy prefabricated shrink-wrap components someone else developed. This is the major thrust of object orientation, with its API's, frameworks and office packages. So the vendor put in thousands of hours of development, and you can use the results just like -snap-. High level languages isolated programmers from the "accidental" complexities and problems of machine language and thus enhanced productivity manyfold. They also killed a host of opportunities to produce bugs. This is abstraction. Moving the language farther from the drudgery of the machine and nearer to the ideas a programmer wants to express, the things he wants to model. Components do the same on the next level: moving from atomic high level language expressions up to the objects the programmer thinks about, be it URLs, spreadsheets or parsers. [mmm]
The most important thing is to look at the data you have to work with and how it is structured. This should make you chosse your data structures and interfaces, and if these are chosen fittingly, expansion will be easily accomodated.

A library for others

You will be one of those others if you want to reuse that code in a month or even a week from now.
Interfaces. What services and access are provided? Hide implementation details. (Aka "encapsulation", "abstraction", "information hiding", "modularization"). Avoid global variables, wherever possible pass data as function arguments. Avoid publicly visible data, exept for class defined constants used to parametrize the classes behavior by using them as arguments in function calls. Chose a small, orthogonal set of primitives. Resist to provide multiple ways of doing the same thing - ist harder to maintain and learn. Prefer narrow interfaces, which do one thing, but do it well. Don't fix the interface when the implementation is broken. Dont reach behind the user's back. Libraries should not write secret files or change global data, nor modify data in its caller. Make interfaces self-containd (or at least be explicit wich external services are needed, to avoid placing the maintence burden on the client). Try to return as much useful information as convenient from function calls, in a form that is easy for the caller to use.
Interfaces are used to hide implementation detail. Use a specification to make clear what the object implementing the interface does (assumptions, input/output). The best approach is to write the spec early and revise it as you learn from the implementation.
Ressource Management. Who manages limited ressources like files, memory, how are shared copies of information handled.
Free a ressource on the same layer that allocated it. If you have a routine working on files, either open and close them in there or take a handle to an open file, and do not close it in there. Errors and misunderstandings about shared responibilities are a frequent source of bugs.[tpop]
Initialization of objects and fields is also a ressource question. Data that is to be shared by function calls to an object should be stored in a class variable and preferrably be assigned upon initialization. (To further differentiate in Java this should be done in direct assignment for default values instead of a in constructor, which should be reserved to set values given as parameters.) Data which differs for each function invoctaion should be stored local to that function.
Error Handling. Who detects errors, who reports them and how?
Catch errors as early as possible, handle them as high up in the hierarchy as possible. Use exceptions only for exceptional situations. Library routines should not exit the system when hitting an error, pop up dialog boxes or print messages. They should return error statements to the caller. Either through "funny" return values or better through exceptions (which are like funny values with a lot of context information attached which bubble up the call stack automatically). Try to keep the library usable after an error has occured.[tpop]
Try to find a module to print/log errors or design a simple one yourself. It unifies error behaviour and its existence encourages you to use it. You could even try to store all error messages centrally. Logging has the advantage to work in environs, where no standard output will be seen. Provide as much context as possible in error messages. A simple estrdup failed gives the user no clue, who said that and why it was caused, compared to

Markov: estrdup ("Longtext") failed: Memory
  limit reached

. Also give an example of valid input, or size limit.[tpop]

Debugging, Testing and Profiling

Debugging

Examine the most recent change. Debug it now, not later. Get a stack trace. Read before typing. Explain your code to someone else. Single step. For hard bugs: make the bug reproducible. Divide and conquer (binary search prune inputs or put print statements). Display output to localize the search. Write self checking code. Write a log file (and flush buffer, so abnormal termination will not lose the last statements) Draw a picture (annotate your data structures with statisics and plot the result). Use tools like diff, grep to examine lage outputs. If debugging takes longer, keep record of what you tried (it will hel you to remember the next time something similar comes up).[tpop]

Testing

Test code mentally as you write it, thus you will not make some mistakes in the first place.

Test code at its boundaries. This will catch fencepost errors. For loops or conditions try what they will do with empty input, one item only, two items or the maximum number of items. What happens if there are too many? Can it happen?	If code fails it ususally does at the boundaries, if it works at the boundaries it will work everywhere else, too.
Test pre- and post-conditions. Use assertions. Program defensively. An assertion is simply a test for a pre-condition which will abort the program if it fails. Defensive programming is simply code to handle "can't happen" cases, that is cases outside of the expected value range. Typical examples are null pointers, out-of-range array subscripts, divisions by zero.	Defensive programming can protect the piece of code against incorrect use or incorrect data. These might happen as a consequence of an error somewhere else. So do asertions, but with defensive programming, the system will not exit, which only makes sense if there is some sensible way to recover fro the error. Probably the value returned from a defensive clause should indicate the erronous state or throw an exeption to be handled further up, so the client code can make sense of what happened.
Test incementally. Write a small unit of code that does something well defined and simple. Then test it. Then add more. Test that. Etc.	It is faster to do it this way, than with on big testing bang in the end. Doing it incrementally, at each step you can be pretty sure that the library functions already implemented work like they shoud, and if you get an error, it probably is in the few lines of new code, easy to find. If you test only at the end and get mistakes they are much harder to find.
Test simple parts first.	Thus, you can buid confidence in the elemental parts and then go on to more complicated features.
Check error returns.	Exceptions force you to do this.

Also, there should be some test cases for each program, ideally a simple one, a complicated ones at the edge of allowed input (empty or maximum values etc), some barely illegal ones. since it is boring and time consuming to test by hand, you will end up not doing it. Therefore, you have to automate testing to run a set of test cases by pushing a single button, and reporting on mistakes.

Know what output to expect. Verify conservation properties. If the output has an inverse, see if you can reconstruct the initial input. You can use tools like `cmp, diff, wc` to compare outputs.	This is sometimes difficult for complex programs, GUI apps etc.
Automated regression testing. The idea behind regression testing is that if you have a program you know to work, and extend or change it, the old tests should still work with the new version and produce the same, correct output. If they do, you again have a working version. Usually you hav a test harness, some kind of shell or perl skript, which runs your program for a large number of input files (test cases). These should run silently, producing output only if a mistake occurs.	If you fix some mistake, you tend to check if only if the fix worked. But if your fix introduced a new mistake, you will miss it. Regression testing should help you to prevent that.
Create a test scaffold.	This will alow you to simply incorporate new tests.

What NOT to do

If you have big switch-case or if-then-else statements, you're doing something wrong. If you use a lot of cut and paste, you are doing something wrong.Anti Patterns
11. If you have a procedure with 10 parameters, you probably missed some. Alan Perlis, Epigramms

Design Patterns

There are several techinical patterns one can use in programming:

Top Down Design (This means a divide and conquer approach: split the problem into smaller ones until they are so small you can easily handle them. This is mainly working with procedural approaches, and is essentialy the same as building up a complex program from less complex components - just seen the other way round. Or maybe it is a way to find out what the simple components should be.
Recursion. Mainly for handling recusive data structures like trees.
Divide & Conquer. This is again the technique of seperating a big problem into smaller ones and solve these.
Preselection
Backtracking

General Object Oriented Wisdom:

Program for an interface, not an implementation. Let your implementation classes extend abstract supercalsses or implement defined interfaces. Client code accesses the interface only. This way, you can change the implementation without changing client code.
Prefer object composition to inheritance. Since subilassing kind of breaks encapsulation (changing the superclass will change the subilass), it is better to use composition. This will make the classes more easily reuseable in another context. Delegation makes composition as mighty as subilassing. Instead of delegating request execution to the superclass, it is delegated to the component object.
Templates (Generic Programming). This is a technique to define a type without chosing the types it works on in advance. You then can parametrize it with the types it works on, eg. a collection which can be parametrized with the class type it holds. [dp]

Rut values

Standard random numbers:

17 (the 'smallest random number')
42 (Doug Adams said so)
105 (Octal for decimal 69, or decimal for Hexadecimal 69)

Standard variable Names. You shouldn't need more than three. Actually it is said about numbers in computer programs: Use none, one or unlimited numbers.

foo
bar
baz
blah
blub

Rules for catching exceptions

If an exception is thrown, it should be allowed to bubble up until you reach some part of the program which is on a high enough level, to competently do something about it. So for example if the exception is not fatal to the further execution of the code, it could be caught right away and a message written to the log. On the other hand, if the exception proves fatal for higher levels of the code, it has to be allowed to bubble up and caught there. Always remember that Exceptions are just used instead of returning funny values (which will not be passed up through the stack automatically).
When we want to shield the higher levels from proprietary Exceptions by some low level service, like the DBMS, we would have to use our own Exceptions to wrap these instead.
The same - in the opposite direction - holds true for default values from the configuration file. These should be read in only at the highest possible level, and then be passed on as parameters to other classes which do need them.

etc...

Separate the things that change from the things that stay the same. [tij]
If in trouble, make more objects. [ooa/d]
Read the technical specification for something if you want to know it (eg. XML, HRML etc). It's no use wasting time in second rate places, as soon as you have an overview.

References

(see my Bibliographies for more literature.)

Tag	Author	Title	Year	Publisher	ISBN
mmm	Jr. Frederic P. Brooks	The Mythical Man Month, Anniversary Edition	1995	AW	0-201-83595-9
This is the classical text about software project management with some annotations and the ``No Silver Bullet'' Essay added twenty years later. There seems to be an unspoken law stating that this must be cited in any other book about software projects or any computer book at all, with the following sentence: ``The programmer at wit's end for lack of space can often do best by disentangeling himself from his code, rearing back, and contemplating his data. Representation is the essence of programming.'' But there is much more practical wisdom in it, and it's half-entertaining to read, too.
pps	Jon Bentley	Programming pearls	1986	AW	0-201-10331-1
A real pearl. This is a pleasure to read! It teaches some basic concepts of coding, like back-of-the-envelope calculations, the use of data structures to get elegant code and some useful algorithmic tricks (heapsort, quicksort, binary search, hashing). It's a bit old already, but the essence of what he is saying is still true in the age of OO programming.
tij	Bruce Eckel	Thinking in Java	1998	PH	0-136-59723-8
It's the best book about Java. Period.
tpop	Brian W. Kernighan and Rob Pike	The Practice of Programming	1999	AW	0-201-61586-X
An excellent introduction to the various aspects of programming, from style and interface design to debugging, testing, porting and little languages. When you only buy one book about an overview on actual programming - buy this one. It's only about 250 pages long, and covers a lot of terrain, with much sound advice, further-leading suggested reading and a lot of example code. Of course you can only learn by doing, so there are exercises.
dp	Erich Gamma and Richard Helm and Ralph Johnson und John Vlissides	Design Patterns: Elements of reusable object-oriented software	1986	AW	0-201-63361-2
Cited in nearly every other book about object-oriented proramming, this is probably the book that will teach you more about the subject than any other book out there. Unfortunately its boring to read and a touch too academic for my taste. Still a must read.
ooa/d	Grady ooa/d	Object-Oriented Analysis and design	1994	AW	0-805-35340-2
This is a standard work for Object Oriented Analysis and Design (OOA/OOD), also touching the iterative development process. Too much hype, too many buzzwords, too much object-religious statements - the getting-a-better-prgrammer-per-page-of-text ratio is rather low.
cc	Steve McConnell	Code Complete: A Practical Handbook of Software Construction	1993	Microsoft Press	1-55615-484-4
This is the Elder and Big Brother of 'The practice of programming'.

Pages

2014-07-14

2014-03-26

2014-03-09

2014-02-27

2012-09-27

Introduction

Setting up

Creating the Table

jQuery

tocontent.js

Formating the table

2012-09-22

2012-09-04

The Rule

The Structure

The Sentences

The Words

The Orthography, Grammar, Form

Closing Remarks

References

Shell

Editors

IDE/CASE Tool

Languages

Source Control

Documentation

Build

Debugging, Bug Tracing, Profiling

Testing

Design and Modeling

Basics

Formatting

Declaration

White space

Indentation

Expressions

Naming

Variables

Subroutines

Comments

Normalization

SHS SQL style

References

General

Schedule

Documentation

Style

well-chosen names

expressions

declaration, initialisation

indentation

comments

Basic Algorithms and Datastructures

Searching

Sorting

Growing Arrays (Vectors)

Linked Lists

Trees

Hash Tables

Good Programmers

Idioms

System Design

A library for others

Debugging, Testing and Profiling

Debugging

Testing

What NOT to do

Design Patterns

Rut values

Rules for catching exceptions

etc...

References