Bioinfoxy: Architecture

Understand your problem

Coding is not done for it's own sake. Usually there is a problem you want to solve with the system you are building. It may be a big problem, like the integration of heterogeneous database systems in a big company, or a little problem, like a formated list of records. In any case, before you start to code, you have to understand and be interested in this problem. You have to find out what the right way to solve it is. This is hard, when you are uninterested. If you work in bioinformatics, you have to be interested in the biology, for example. Do not make the mistake to sit down and start hacking when you are not knowing what you do. The results will be ugly and cost you time. Examine your problem, think it through, draw sketches on paper, until you know how to solve it. Then the code writes itself much easier.

Don't repeat yourself

DRY is an easy to grok concept but extremely powerful. You want to program in a way that avoids having to data objects same data, two different parsers for the same file type, two diffrent libraries doing the same thing. You want a single source of truth. That means you have to identify things that are used repeatedly, and isolate them. You have to think about what your central data structures are, and how they will be accessed. You have to organize the call interfaces, and document things, so that others or you can reuse what you did.

I have seen a lot of large systems full of lava code, cut-and-paste programming, and double developments. Systems where basically every programmer went in and redid his stuff from scratch for himself, because the things that were there were to difficult and undocumented and cumbersome to understand. In the end the whole system starts to collapse, because it is much too complicated to understand. Introducing new features or fixing bugs at one place leads to unforeseen and arcane errors somewhere else. It is not even known for most of the code if it is still needed, but people are afraid to get rid of it.

Build large systems from components

The most important and most difficult thing is to find the right segmentation. Make it so that each module bundles stuff that you'd expect to find there intuitively, and you'll be about right. Create the components really as general components. Make them libraries, with a defined interface, that is robust to malformed inputs and usage. You can think of them as little servers that get client requests. Then, when you tested the library, you can trust it and need not waste brain power looking for mistakes there, anymore. Another crucial thing is to limit the number of dependencies. If every part depends on a lot of other parts, you can change nothing in isolation without affecting a lot of things. Libraries should have little to no dependencies.

Programming is about abstraction

Even a medium sized computer program is much too complicated to fully understand. You never can keep the variables, parameters and whatnots in your head. Your brain is too weak, and you have to find a way so that it has only a few things to remember at the same time. That's where abstraction, object-orientation, information hiding, modularisation and so on come from. Keep it simple, because you are not smart enough to make it complicated. It's getting more complicated all the time all by itself, and keeping it simple is a though fight, and an art. If it is elegant, then usually it is good.

Programming separating the parts that change from the parts that don't

Put the things that do not change in the code, the things that change in databases, data tables, configuration files.Many programs could be simplified with a table driven approach: the general code is in the program, and all the specific quirks are in the tables, or in config files. This also nicely helps to split the parts that change from the parts that do not.

Programming is data representation

The most important is to have good data structures and objects that reflect the parts of the program in intuitive and simple ways. Then the program logic is easy to write. If the data is badly organized, the logic is a nightmare and the program is hard to understand.

Some useful design patterns

Streams. This is a very simple model to view all kinds of data: just a stream of characters (or bytes), which you can examine to build organized structures from their meaning (which is dependent on the program reading the stream), and into which you can drop (insert) stuff or delete stuff from. Also the common model for files, images etc.
Client : Server. This beaten-to-death model makes it possible to distribute responsibilities in your application and forces you to think clearly about the interface between the parts (the protocol). It can be used to build multi-tiered applications and implement then stuff like load-balancing between several servers, making the whole design quite modular and scalable.
Centralized information exchange format (Protocol)
If you have n possible clients which will exchange data, the number of interfaces between them rises with n! if everyone connects directly to every other one. You can avoid this by having one central exchange format, to which all the clients write and from which they read, so you just need two gateways per client, no matter how many clients you have.
Event Handlers and Event Firing. For this model you have to build a handler registry into your application objects and upon any interesting state change (in GUIs for example onClick, onMouseOver, onMouseOut etc.) they will notify the registered handlers about what happened, by sending an event object, which should have all the contextual information (like coordinates, Keys pressed, a pointer back to the firing object).
Model, View, Controller. Decoupling Data, Logic and View makes it possible to have several independent ways to work upon and visualize your data, instead of being chained to just one built right into the data.
Multithreading. Multithreading is a very powerful concept, but quite easy to fuck up, also. What you have to be careful about here is shared variables: any class or global variable in your normal code becomes poison in multithreaded code if it is used by several threads, since one thread will change it's contents behind the back of the other one. There are three ways to go about this:
- Don't use global or class vars. Pass all data as parameters from function to function. This will lead to long parameter lists, and since there is no state, it also cannot be selectively modified. Not a very good solution.
- Use a central repository which holds several copies of the state variables, one for each thread. So they don't mess around with their fellows vars. This works ok and is not too hard to implement (with servlets you can stuff your vars into the Session object for example). It is a bit wasteful on memory, especially if you have a lot of threads.
- Use locks on the variables (called semaphores), i.e. for each var there is an additional lock var. If a thread wants to access the var it has to "get" the lock: it has to look if the lock is free (no other thread claims it). If not, wait a bit, and try again. If so, claim it, go about your work, and release it again. This locking stuff is tricky as hell, you can get into all kinds of deadlocks.

Bioinfoxy

Pages

2012-09-04

Architecture