Showing posts with label language. Show all posts
Showing posts with label language. Show all posts

2014-07-14

MySQL join syntax

http://dev.mysql.com/doc/refman/5.7/en/join.html
http://dev.mysql.com/doc/refman/5.7/en/nested-join-optimization.html

In the classical way to write a query
SELECT a.x, b.y
  FROM a, a2b, b
 WHERE a.a_no = a2b.a_no
   AND a2b.b_no = b.b_no
   AND a.z = 123

the WHERE clause mixes join conditions with selection conditions. Therefore, and once one gets to things like outer joins, it is better to be explicit, using the where condition to restrict the results by value, and the joins to restrict them by matches between tables.

As always, tables in joins can be actual tables, or subqueries returning tables, and we can define aliases for tables. I am ignoring partitions here.

A join merges two tables into one, using matching values between them, to restrict the cartesian product of all possible combos that can be made with rows from these tables to those rows that have the matching values. There are several kinds of joins

JOIN ... needs to have matching values in tables on both sides
LEFT JOIN ... needs values in left table, and tacks on values from the right, or null otherwise

You use left joins if you have a table that only has values for some records in your main table, and you want to see them if available, but you do not want to filter out rows from the main table, if they are absent.

One can optionally insert an OUTER between LEFT and JOIN. One can optionally call the simple, bidirectional join INNER JOIN or CROSS JOIN (in other dialects INNER requires an ON clause, here not), or use a comma between table names to imply it with the join condition in the where clause. Be careful when mixing table lists (i.e. implied inner joins) and left or explicit joins: because the explicit joins have precedence over the implied inner joins, the column you may want to join on in your outer join may not yet have been joined in, resulting in an error message. (See below).

There are several ways to describe on which fields from the left and right operands should be joined.

ON ... followed by an explicit statement which field from which table equals which field from which other (one actually can use anything one could use in a where clause in ON, but then one loses the advantage of cleanly splitting joins from selection criteria).

USING ... followed by a parenthesis enclosed list of fields which need to be present in both tables under the same name, and need to have matching content for a match.

NATURAL JOIN of two tables even does not need the field(s) in USING any more, as it will join on all fields of the same name between the tables.

The SELECT * output of USING and NATURAL are not quite identical to ON: the matching columns are listed only once, before the other columns (so called "coalesced" common columns), in order of appearance in the first table, then columns unique to the first table in order, then those unique to the second table in order. ON just lists all columns from all joined tables.

For example, assume you have the tables
t1(id,x), t2(id,y)then
SELECT * FROM t1 JOIN t2 ON t1.id=t2.id;
SELECT * FROM t1 JOIN t2 USING (id);
SELECT * FROM t1 NATURAL JOIN t2;
will all achieve the same effect.
If you have the three tables t1(a,b), t2(c,b), and t3(a,c) then the natural join SELECT * FROM t1 NATURAL JOIN t2 NATURAL JOIN t3; really means that after joining up t1 and t2 the resutling table will have the columns b, a, c. Again natural joining this with t3 will thus join on both a and c, not just on c, equivalent to this: SELECT ... FROM t1, t2, t3 WHERE t1.b = t2.b AND t2.c = t3.c AND t1.a = t3.a; This may not be what you intendet. So while NATURAL is nice for the lazy, it also is dangerous for its side effects. It is better to be explicit and use USING.
As long as all of them are inner joins, the order of joins is unimportant, and no parentheses are needed. 
However, this is not true for outer joins.

Lets say we have
CREATE TABLE t1 (i1 INT, j1 INT);
CREATE TABLE t2 (i2 INT, j2 INT);
CREATE TABLE t3 (i3 INT, j3 INT);
INSERT INTO t1 VALUES(1,1);
INSERT INTO t2 VALUES(1,1);
INSERT INTO t3 VALUES(1,1);
SELECT * FROM t1, t2 JOIN t3 ON (t1.i1 = t3.i3);
will create an error Unknown column 't1.i1' in 'on clause', because the explicit join is evaluated first, and does not know anything about i1. SELECT * FROM t1, t2 JOIN t3 ON (t1.i1 = t3.i3);would fix that, as would SELECT * FROM t1 JOIN t2 JOIN t3 ON (t1.i1 = t3.i3); because there the joins are worked off left to right in order. That means you can list first all your joins (making a huge cartesian table in theory), then all the on conditions with AND. Likewise, you can only refer to tables that were mentioned before (to the left) in the ON clause, not to tables that are mentioned after (joined to the right of the clause).

Referring to joins

Joins generate one big table on which the where clause can work. You can supply aliases to the various tables contribution to a join.

Finally, multiple joins, in a mix of inner and outer, with the outer being nested via linking tables. Ie what you deal with in the real world:

 select * 
   from finding f 
        join acq_source s using (acq_source_id)
        join finding2treatment f2t using (finding_id)
        join treatment t using (treatment_id)
        join treatment2genotype t2y using (treatment_id)
        join genotype y using (genotype_id)
        join genotype2variant y2v using (genotype_id)
        join gene g using (gene_id)
   left join (treatment2disease t2d join disease d) on (t2d.treatment_id = t.treatment_id and t2d.disease_id = d.disease_id)
   left join (treatment2drug t2u join drug u) on (t2u.treatment_id = t.treatment_id and t2u.drug_id = u.drug_id)
   left join (reference2finding r2e join reference r) on (r2e.finding_id = f.finding_id and r2e.reference_id = r.reference_id)
 where f.finding_id in ( )

2012-09-03

Useful Unix Commands

Preprending a line to a file under unix, do echo "This is my line" | cat - filename > newfile. You can't redirect filename to filename itself , though.

A little shell trick to remove the first lines from a file is tail -n +x file where x is the number of the first line you want in your output. So to remove the first line from a file yould set it to +2.

Restarting apache: /etc/init.d/httpd restart

Restarting samba: The documentation of Samba tells you it'll reload it's config file (usually /etc/samba/smb.conf) when sent a SIGHUP signal, which translates to a kill -1 pid (see man signal -S 7 for the list of signals). You also can restart it, with the usual /etc/init.d/smb restart, just like you'd do with other daemons.

To zip/unzip .zip files unter unix, use zip/unzip instead of gzip/gunzip. See here. For cleaning up your directory tree under KDE, kdirstat is nice.

Using rpm:
rpm -U packagefile install (really update)

rpm -e package erease

rpm -qa query: list all installed packages

rpm -ql package query: list all files installed for package


When the installer complains that a library is missing, e.g.

error: failed dependencies:

libblas.so.3 is needed by octave-2.1.35-4

go to rpmfind.net, and put in the name of the library to find the RPM that provides it, download that and install it. When looking for file names, try leaving out the .number at the end, to find the more up to date versions, too. Apache, by the way, is called httpd here.
Basic I/O redirection: see Unix Shell.
tee file pipe input to output and to file (like a 'T' crossing: one way in, two out)
cmd1 | xargs cmd2 use the output of cmd1 as args for cmd2
Finding Stuff
find dir -name 'pattern' files in subtree with pattern in the name. find . -name *.txt -exec grep -l 'myword' '{}' \; lists all text files that contain myword. The -l grep option prints the filename instead of the line. The quotes protect the {} and the search term from shell interpolation. Alternatively, one can use grep myword /dev/null, because /dev/null will be treated as a second file name, and when there is more than one filename, the name is printed by default (the -H option would do the same). The {} is the placeholder for the filename, the \; is needed to pass a ; through to find, so it knows where the -exec arguments end.
find dir -iregex 'pattern' files in subtree with pattern in the name, using case independent real regexes (bash only).
locate pattern prints from a nightly built index of the filesystem all names containig pattern. (fast)
which name print full path to name prog. Useful if there are several versions istalled, and you want to know which one is first on the path.
man [-k] [-sn] name print the manual entry for name. -k searches full text ("keyword"). -sn forces the search in a section. Section1 is the commands you usually need. Man pages are usually stored under /usr/(local/)man/man[1-6]
System information

cat /proc/cpuinfo Find processor information under linux on the command line.
alias all aliases
showrev, uname [-p] processor and kernel information. showrev -p option prints list of installed patches. uname -a gives all kernel information. To find out which linux release is installed, on Redhat/Fedora, you can also try less /etc/redhat-release
df [-k] free disk space [in kbytes] for all file systems. du dir prints the used space for a dir.
netstat [-n] [f] [ip] adresses and status of the network
vmstat cpu usage and memory
iostat io information
nslookup ip-address domain name bound to the ip-adress
ps [-elf] active processes. -elf prints nearly all of them in long form (and is easy to remember).
top eternally print active processes until ^C
crontab -l cron jobs for current user
Displaying and formatting
Whenever input is expected, it can be given as a filename afterwards (command file.txt) or fed from another command via a pipe (command-1 | command-2).
grep[-i][-v] pattern print [case-independend] all lines in input [not] matching the pattern. Patterns . * [] ^ $ behave normally, + ? | () have to be escaped with \. \w is word (alpahanumeric), \W non-word, \b word-boundary, \B non-boundary. Inside [] [:alpha:]=a-zA-Z [:digit:]=0-9.
more or less paginate input
head [-n] prints first few [n] lines of file.
tail [-n] prints last few [n] lines of file.
cat print input
wc [-l] count bytes, words, lines [only]
dos2unix remove all \r from file (these are in there because on dos, newline is \n\r instead unixe's \n).
strings extract ascii srings from binary files. Take a look at Word or kernel error messages.
Compressing and expanding
Usually, since FTP cannot recursively get or put dir trees, these are tape-archived (tar'd) into one file, compressed with compress(.Z) or gzip (.gz), so the result is a filename.tar.gz file. This in turn has to be uncompressed and extracted to be used. Beware: DOS clients do not like two dots and tend to save this as filename_tar.tar or filename_tar.gz.
tar c[v]f filename.tar dir [verbosely] roll dir into tar ball. Be careful that dir does not end in an / as it does with auto-completion.
tar x[v]f file.tar [verbosely] unroll tar ball.
compress file compresses the file into its file.tar.Z version.
uncompress file.Z inflates back to the file version. You can also use gunzip for this.
gzip file compresses the file into its file.gz version.
gunzip file inflates back to the file version.
gunzip < file | tar xf - unzip and untar. I think the < is needed as then the data is read from a stream, and thus no internmediate file is written?
unzip file.zip unzips a .zip file
gtar xzf file unzip and untar, even shorter.
File and directory handling
ln [-s] real symbol create a [sym]link to another file. (Symlinks may span file systems). If you provide a dot for the symbol, the last name in the dirlist will be used, eg: ln -s /usr/local/pub . creates a link named pub
rm [-R] expr recursivly remove all files matching expr. recursive version also removes dirs.
cp file target copy a file.
mv source target moves (or renames, depending on your pov) file or dierectory. To move a dirctory tree into the working dir type (eg): mv ~/texts/books . To move everything fro m a dir to another: mv * another/*
ls [dir] list files in dir. common options are -F (append symbols for: dir, link etc), -A (list all including ones starting with .) -l (long with all info).
mkdir name / rmdir name make or remove directories.
touch file update the date on the file or create a new one with 0 bytes if it did not exist.
pwd print present working directory
Process handling
kill [-9] PID [thoroughly] kill process with PID.
command & run process in background.
^Z suspend process
bg send suspended process to background
fg [%n] fetch process with id n back to foreground.
Shell scripting
eval Read the arguments as input to the shell and execute them as commands. Usually used to execute commands generated by command or variable substitution (because the result is the input string that is eval'd.
test Implicitly used to evaluate conditionals.
expr Calculate mathematical expressions.
exec Read the input and execute it instead of the shell in the current process.

DOS vs Unix commandline

I did have to work on Windows and Unix in parallel for pretty much my entire coding life, where I had a Windows machine as workstation, and the servers were some dialect of Unix. In this situation, the following may be useful. This doesn't want to be complete.
Operation Windows Unix csh-based (csh, tcsh)
initialisation NT: set to the values in "System Properties/Environment", DOS: set in autoexec.bat set in /usr/local/env/.cshrc (sometimes /etc/.cshrc), followd by ~user/.login if login shell.
I/O redirection cmd <in-file >out-file-new
cmd >>out-file-append
stderr cannot be redirected, always goes to screen.
dito.
filename expansion ? single char, * any number of any (including the dot at the beginning of filenames). eg dir test*.doc will find all files starting with test and the extention doc. dir test* and dir test*.* are the same. dito, but * doesn't match a dot when it's the first char in a filename (as such files are used as system ressources.). Mask meta-chars with \.
whitespace protection cmd "with blank" cmd 'with blank'
Process piping with cmd1 | cmd2 dito
Version winver showrev
environment setting set [var=[string]] sets env vars.
Note: no space behind =. Without var, all vars are shown on stdout. set vars locally for a skript between setlocal and endlocal . each setlocal must be freed by the endlocal before skript end.
set var = text with mandatory spaces. setenv name text sets and exports variable. set ! (bangs) in text are replaced by incremental numbers. array vars are defined as set var = ( foo bar baz )
path path=newpath;%path% set path = ( /bin /usr/bin /usr/local/bin ) path is defined as an arry.
directory listing dir ls
help option command /h man command or command --help
file identity check comp cmp
file difference comparison comp f1 f2 for same sized file (default binary)
fc f1 f2for text files.
cmp
file length
grepping find
findstr with regexen
grep
timing
size check
view users who -al, whoami, groups
view host hostname hostname, showrev
view user on remote host finger usrname@hostname dito
processes ps -elf
jobs jobs
rights chown, chgrp
file attribs chmod
scheduled execution at cron
Sortieren sort sort
Output paging more use pipeing to page output from other programs, file redir and name expansion to page contents of files. more
html downloading wget wget
free memory mem
show text file contents type f1 [f2 ..] cat
route tracing tracert traceroute
internet IP-name lookup nslookup nslookup
show net connections netstat
ping if computer is on network ping ping
printing lpr lp
prompt style prompt $p$g$p = pfad, $g = > # superuser, % normal, set prompt = "`hostname`:` pwd`>"
variables args for batches are stored in %1 to %9.
environment var contents are accessed like this: %VARNAME% Expansion works here too, i.e %* means a list of all %1 to %9.
$var for normal vars or full arrays, $array[2-4] for array slices, $#array number of elements, $?var if var is defined 1, else 0.
executable search path path without param shows path. entries separated by ;(semicolon).
append old path with path newdir;%path%

shell

Since /bin/sh is available on every Unix system, it is the shell of choice for writing scripts, even while it has much less power than the other shells. Therefore it is a good idea to learn the basic syntax. while man sh will give you the full story, here are some of the most common constructs. Probably you'd be more portable by just learning perl and using that everywhere. The shell can be used interactively or as a script.
Flow control
for if
for var [in word-list]
do
  cmds
done
            
if cmds
then 
  cmds
[elif 
  cmds
then  
  elif_cmds]
[else 
  else_cmds]
fi
            
case while
case word in
  pattern1) cmds;;
  ..
  patternN) cmds;;
esac
            
while cmds
do
  cmds
done
            
Tests
What irks me most is the shitty syntax for conditions. Here my personal back-breakers:
  • conditions test expressions with the syntax [ expression ] (note the spaces inside the brackets!)
  • logical operators: and is -a, or is -o, not is !. All have to be set apart by space! Or can also be emulated by [ test1 ] || [ test2 ]
  • if [ ! -e file ])
  • operators for file tests: -r file (readable), -d file (directory), -e file (exists)
  • string comparisons:= (string equal), != (strings different). May not have space around them!
  • numeric comparisons:-eq (numbers equal). Note the string/number thing is the other way round than it is in Perl, also note the - in -eq
  • there must be a semicolon before the keyword that indicates the start of the dependent block, if both are on the same line: like if [ -e file -a -r file ]; then echo "Yep"; fi or for x in ABC*.txt; do sort $x; done
Your usual constructs, with fi etc instead of blocks. Cmds means a list of commands. Bracketed parts are optional. Keywords must start on a fresh line, or after a semicolon. Patterns can use filename expansion (without the exeptions), word is usally a variable. For without in uses the position parms. You can use continue and break with loops.

Basics
Commands
Each command can be given arguments during its invocation. It turns into a process, reads data from standard input, outputs results to standard output, errors to standard error. Upon finishing it returns an errorlevel into $?, which is 0 if successful, positive otherwise.
Filenames
Any filename is possible, as long as the script has the magical #!/bin/sh as the first line. Conventional is name.sh
For mathematical expressions, use the expr command. Remember to escape special characters (like * for multiplication as \*).

Special characters and Names
I/O redirection
cmd<handle redirect input to handle
cmd>handle redirect output to handle
cmd>>file append to file
&0 handle for STDIN
&1 handle for STDOUT
&2 handle for STERR
stream>&stream redirect stream to stream
2>file redirect STDERR to file
>&2 redirect output to STDERR
2>&1 redirect STDERR to STDOUT
cmd1|cmd2 pipe output of cmd1 to input of cmd2

Regexen for filename expansion
* any number of chars, exept leading . and /., /
? one char
[ac-e] char a, c, d, or e.
[!ab] not char a or b

User-defined variables
To export variable names from the script to the environment use export (instead of setenv like in csh).
varname='value' Assignment of value to variable. No space around equals sign! The quotes are optional for simple words. varname= assigns "" (empty string). Example for settimg and exporting DISPLAY: DISPLAY=10.203.1.101:0.0; export DISPLAY
$varname Using a variable.
${varname} Using a variable, braced form if end of varname is ambiguous. Is undef if no assignment happened.
${varname:-word} Using a variable, or word, if it is undef or empty.
${varname:=word} Using a variable, after assigning word, if it is undef or empty.
${varname/pattern/string} Replace pattern with string in the value of varname.

Automatic variables
$1 to $9 args (params) 1-9 for the script.
$0 the script name
$# number of parameters
$@ the list of all parameters as a list
$* the list of all parameters as one string
$? errorlevel of last command
$! PID of last background process
$$ PID of current process
Quoting
'...' noninterpolated string
"..." $name, `cmd` and \ interpolated string
`cmd` command substitution replaces command with its output

Various
# turns the rest of the line into a comment
cmd& run cmd in background
cmd1 && cmd2 cmd2 only executes, when cmd1 returns success
cmd1 || cmd2 cmd2 only executes, when cmd1 returns failure
\ Escape. Concatenates lines.

Examples
Quoting/Redirection
chmod 755 `find . -type d` change rights for dirs under .
rm `find . -name "*.html~"` remove all .html~ files in dir tree
 find . -name "*.html~" -exec rm {} \; ditto, but can handle whitespace in file names 

bash

To write special characters under bash, you can use the usual shortcuts, like \t for tab, when you quote them with single quotes lead by a dollar sign, such as this: $'\t'. This will then generate a tab character.

Scilab

Find out all about Scilab at the Scilab website. This is just for me for starters. Scilab is a free environment for numerical computations that is very similar to the commercial MATLAB, although there are some differences.

Useful Functions

For the full monty, look at the online doc!
Interactive: disp who whos help apropos disp
I/O: exec mopen mclose mgetl fscanf fprintf error warning
Cave:getl cuts lines after 4096 chars.
Libraries: genlib
Graphics: xinit xend xset xgrid xtitle/titlepage xstring/xstringl xclear/xbasc hotcolormap plot2d contour2d grayplot plot3d plot3d1 geom3d matplot locate driver
Strings: length xegrep strcat strindex stripblanks strsubst tokens part evstr execstr
Statistics: correl/covar geomean ftest mean/median center regress variance functions size typeof union intersect isdef exists contr
Elementary math: log cos sin diag max/min prod round/int/floor/ceil sign sqrt sum rand fft
Sorting: gsort sort empty eye matrix ones zeros expm trianfml
Linear algebra: inv bdiag spec schur syslin xdscr ss2tf
Polynomials: coeff freq horner poly roots
Spline: interp interpln
ODE Solvers: dassl odedc
Optimisation: optim quapro linpro lmitool

Environment

scilab.star contains the general init file for loading default libs etc
.sclilab in your home dir can contain further initialisation code

Interactive

^-C               // pause
...               // line continues on next
;                 // separate commands in line ...
                  // and supress output if at end
[return]          // eval and print 
who               // list vars
whos()            // more detailed
help              // online help
help('fname')     // ... about function fname
apropos('name')   // help on anything named like name
resume            // resume after pause
pause             // new env inheriting vars; return vars with return
unix_s('cmds')    // execute cmds in unix shell
unix_w('cmds')    // execute cmds in unix shell, write output to window
write('name', x)  // write object x to file
clear             // clear env
clear('name')     // clear var name
lib               // load lib
disp(name)        // show functions in lib, call syntax for func
link('file.o', 'name', 'C') // link in external C code
call()            // call linked code

xset()            // panel to mod display settings

save('name')      // save vars/env to binary file
load('name')      // save vars/env from binary file
write('fname', o) // write o to file fname 
read('fname',2,3) // read part of matrux stored in fname
mopen('fname','w')// open handle to filr fname for write
mfprintf(fd, 'format', o) // printf to filehandle fh
x=mfscanf(fd, 'format')   // scanf from filehandle fh
mclose(fd)        // close file handle

deff('[x]=fact(n)', 'if n==0 then x=1, else x=n*fact(n-1), end') //function def

Scope

All variables not defined in a function are considered global. Functions are objects that can be given as args to other functions.

Literals

Scalars are constants, booleans, polynomials and their quotients and strings.
They can be used as elements in matrices.

'Hi' or "World"    // String
%t                 // true
%f                 // false
%pi                // Pi
%e                 // Euler's
%i                 // sqrt(-1) 
%s                 // poly(0,'s'), polynomial p(s) w/ var s, roots(s) = 0
%eps               // biggest number on machine           
%inf               // infinity
%nan               // Not a Number
[]                 // empty matrix, zero rows, cols
%io(2)             // scilab window output file handle
SCI                // path to scilab main directory

Syntax

Operators:
// comment
a = 7              // Assignment/Initialisation
[1,2,3]            // row vector
[1;2;3]            // col vector
M'                 // transpose (or complex conjugate) row <=> col 
0:0.1:1            // vector from 0 to 1 (inclusive) in 0.1 steps
-(1:4)             // vector from -1 to -4 in 1.0 steps

*                  // scalar product
+                  // element-wise addition, string concatenation
-                  // element-wise substraction
/                  // right division
\                  // left division
^                  // exponent
.*                 // element-wise multiplication
./                 // element-wise division
.^                 // element-wise exponent
.*.                // kronecker product
==                 // equals
<>                 // not equal
< > <= >=          // smaller, greater, smaller or equal, greater or equal
~                  // not
&                  // and
|                  // or

Lists: elements can be mixed, lists, even matrices can be elements
l = list('this',3,m,-(1:3))   // list. elems may be scalars, matrices, lists
tl = tlist(['mylist','color',value], 'red', 3) //typed list aka hash
l(3)(2)            // dereferencing a subentry in list
tl('color')        // dereferencing a hash

Matrices: all elements should have same type
M = [a+1 2 3
       0 0 atan(1)
       5 9 -1]     // constant matrix
M = [p,1-z;
     1,z*p]        // polynomal matrix
M = ['Hi','World';
     'Whats','Up?']// string matrix
M2 = matrix(M,2,3) // create matrix x with elements of M and given dims
M(2,3)             // element in row 2 column 3
M(1,:)             // extracting the first row
M(:,$)             // extracting the last column
I=(1:3);M(I,:)     // extracting the first three rows

Hypermatrices: more than 2 dimensions
M = hypermat([2,3,2],1:12) // filled 3-dimensional matrix
M(2,2,2)           // corner of hyper matrix
M.dims             // dimensions of hyper matrix
M.entries          // entries of hyper matrix


Functions:
exec('filename',-1)   // load functions from file, -1 makes that it isnt echoed
y = poly(0,'z')    // function call vor variable z
function [r1,...,rn]=name(p1,...,pn) //has to be 1st line in file
   ...
endfunction
argn      // return number of input/output arguments
error     // print error and exit function
warning
pause
return   // return to calling env
resume   // ...and pass on local vars

Control structures:
for elem=vector/matrix/list, body, end
while condition, body, end
break // exits from innermost loop
if condition then, if-body, else, else-body, end
select var, case cond, case-body, case cond2, case-body2,...[,else, catchall-body],end 

Python

Find out all about python at the python website.
Some useful idioms:
for line in open(file):
(my, list) = mystring.split() or mystring.split(',')
[ func(x), func2(x) for x in list if x > cond ]

Libraries

from libname import *
from libname import finc1, func2
import libname

Regexen

import re
m = re.compile(r'^16_est(\d+)').search(s) # r'foo' is noninterpolated raw string
m = re.search(r'^16_est(\d+)', s) # implicit compile
print m.group(0) # group 0 is whole match, parenthesis groups start from 1 
substituted = re.sub(pattern, repl, string[, count])
list = re.split(pattern, string)
  • search returns a MatchObject or None if there was no match.
  • sub returns the substituted string. repl can be a function.
  • split default splits on whitespace.
Regex syntax is like in perl.

Data structures

Sequences (tuple, list, string)

Lists, tuples and strings are all sequences and can be accessed via slicing. Lists [] are mutable, tuples () and strings '', "" are not.
Initialize empty lists with list = [].
In slices a[x:y] indexes start from 0:
  • a[1:3] are the elements with index 1 and 2
  • a[2:] all from (and including) the third.
  • a[:3] all until (and including) the third
  • a[-1] the last
  • a[:-2] all exept of the last two
  • a[-3:] the last three
In lists list.remove(item) remove item, del list[index] remove item at index.
Other built-in functions for sequences: len(s) min(s) max(s) del s[1:3] for x in s:
Cool functions on sequences, for functional programming:
  • lambda, lambda functions may only contain one expression (but are still cool for simple anonymous functions)
  • map, map is like in perl
  • apply,
  • filter, filter is perl's grep
  • reduce, reduce applies a funcion to the first two items of a list, then to the result and the third and so on, ideal for summing up.
  • zip, allows looping over multiple lists in parallel, by interleafing them into tuples.
Even cooler are list comprehensions like [(x,x*2) for x in range (1,11) if x % 2 == 0]

Dictionaries (dict)

  • Initialize empty dicts: dict = {}
  • Loop over keys: for x in dict:
  • Remove a key-value pair with: del dict[key]
  • Test for key: if 'key' in dict
  • Count keys (not all items): len(mydict)
  • Just the keys: dict.keys()
  • Just the values: dict.values()
  • (key,value) pairs: dict.items()
  • Get a value, or, if there is no such value, set it: mydict.setdefault('key', 'defaultvalue')

Strings

Type conversion to string: Enclose in "`" or use str(). This switches of interpretation of escaped characters when done on a string. Formated printing with print "%s ... %s" % (s1, s2). If you do not want the auto-appended newline, append a comma. raw strings (without escape interpolation with r"rawsting".

Control structures

Syntax:
  • Conditionals: if, elif, else
  • Loops: for x in seq:, while
  • Loop control: break, continue
  • Empty statement: pass
  • Exceptions:try:, except FooError:, else: and raise
Truth: empty lists, dictionaries, strings, the number zero and None (the undefined, void object) are false. Everything else is true. String comparison with ==, !=. None is smaller than anything except None. is checks for object identity (two pointers to the same object.)
Cool expressions for conditions: in checks if an item is in a list or a key in a dict.
Operators: ++ and -- are missing

Functions and Methods

Functions may not have the same name as data fields in classes, each member need a unique name, or you end up with a 'str' object is not callable error. A function definition must have been parsed before its call, so you cannot call a function that is defined later in the same file.
Parameter passing: all parameters are passed by reference. Of course immutable objects cannot be changed, so they might just as well be by value. You can assign other objects to the paramter names inside the called function without consequences. When calling methods without parameters, remember to put the parentheses behind the method: object.method(), otherwise you get the method object back, instead of calling it. Argument syntax for caller func(value) or named func(name=value), for definition def func(name) or optional args: def func(name=default) for defaults, def func(*name), def func(**name) to take rest of args into list or hash.
Names in functions have local scope, overriding globals with the same name. To use a global as such, declare it again inside the function with global theName. Variables are searched LGB (local, global, built-in). If the local fails, it looks through enclosing local scopes, too. Note that the class scope of a class inside a module is neither enclosing local, nor global, for the classes methods. Therefore, imports at class level are not seen in the methods.
A gotcha: If you only reference it, a global variable that is not locally defined is searched and found as a global, and no exceptions will be thrown. But if you later in a function assign a value to a global var, it is interpreted as a local. This will cause references to the var before that point to thow exceptions. You must declare it as global in this case. Built in names is stuff like len(), open etc.
join und other string functions can be called as methods of the string in question (better than importing the string module, i.e. string.join() or to string objects.
lambda anonymous functions may only contain a single expression. They are not real closures. (Sniff.)

Documentation

Phyton comes with built-in documentation support in the form of docstrings. the pydoc tool can be used to automatically extract this documentation. A docstring is a string literal that occurs as the first statement in a module, function, class, or method definition. Such a docstring becomes the __doc__ special attribute. By convention, docstrings use triple quotes """. One-liners have the quotes all on the same line and end with a period. They should contain an explanation, not a restatement of the pythion code, because you can get the paramters and member names via introspection.

Modularisation

In Python there are three levels of bundling things. On the basic level, you have the class, wich bundles its methods. This class can reside in a file together with other classes, a so called module. The file, or module, is the second level of packaging. This is very much like a Package in Perl.
Modules are namespaces. Loaded modules are objects of type module, but not classes. You can access names defined in them via modulname.__dict__ or dir(). If you want to import MyClass, it is not enough to put it into a file called MyClass in the import path and use import MyClass. This will only import the module, not the class object from the module, and you'll get a 'module' object is not callable error. Instead use from MyClass import MyClass, or reference it as x = MyClass.MyClass().
Modules have additional attributes like __author__ __builtins__ __date__ __file__ __name__ .
For larger projects, putting everything in the project into one file will not do. So you create a bunch of files and put them into a common directory, each file/module representing one larger logical part of your application, and add a __init__.py file to make that directory a package. The directory, or package, is the third level of packaging.
For really large projects, you can even create hierarchies of directories/packages, with each directory holding modules pertaining to a certain logical part of your application.
For example, you start with a simple app, MyApp, all in one file. When you realize it is too big, you split it into several modules, in a MyApp directory, lets call them parser, config, viewer and engine. When you realize, each of them in turn is getting too big again, you turn it into a director, for example the parser directory contains various parsers for the various file formats.

Persistance

cPickle is the module to serialize arbitrary datastructures as ASCII text or in binary form.
shelve is a dbm-based approach that creates persistent hashes, where the values can be any python object. The pickled version of this object must abide to the limitation of the dbm system. Of course you can also use dbm or gdbm directly.
MySQLdb, which conforms to the Python DB API interface:
>>> import MySQLdb
>>> db=MySQLdb.connect(user='bioinfo',host='biserv',passwd='',db='yoh')
>>> c = db.cursor()
>>> table = "bd_method"
>>> c.execute("select * from %s" % (table))
2L
>>> c.fetchone()
('BLAST',)
>>> c.fetchone()
('manual',)
>>> c.fetchone()


Debugging, Profiling

Creating the profile file
import profile
profile.run('foo()', 'profile_filename')
Evaluating the file (best done interactively in interpreter):
import pstats
p = pstats.Stats('profile_filename')
p.strip_dirs().sort_stats('time').print_stats(10)
strip_dirs() removes the pathnames from the names.
sort_stats('time') sotes stats in decreasing order of time used by each routine. Among other possibilities are 'calls' and 'name'.
print_stats(10) will print the top ten of the sorted list. print_stats('substr') will print the stats for functions whose name contains substr. Both filters can be used ('substr', 10) and are applied in order.

Objects, Types, Classes

Namespaces: module global (including modules __builtin__ for built-in functions, and __main__ for the default module that is used when invoking the python interpreter, for example by invoking a script), function local. Scope is seareched block-local, function-local (ascending through enclosing functions), module global, builtins.
Everything is an object, even types are type objects. And objects are instances of a type. Weird circular definition. The most general type is object.
dir() on a class or module gives all it's and it's superclasses __dict__ members, as it does for a module. On an instance it shows all its and inherited variables and members. Without argument it show the local variables in scope. dir replaces the deprecated __methods__ and __members__ attributes.

Types

Python has multiple inheritance, therefore no interfaces (like in Java).
All types have a few common attributes:
  • __doc__ a documentation string describing the type.
  • __dict__ a list of all attributes of the object
  • __class__ and __bases__ the class and list of base classes
  • __setattr__ and __getattribute__ are called when you try to access an attribute that is not defined
  • __hash__ __init____new__ __reduce__ __repr__ __str__ __delattr__ are built-in special attributes that are bound to wrappers for assignment, hashing and built in functions like str(), reduce(), del. They can be overwritten
There is no type checking of arguments. That means:
  1. you can pass any object that can perform the operations (has the members to call), no matter what class it is. It is generally expected in python that you do not "look before you leap" (LBYL), but practice "it's easier to ask forgivance than permission" (EAFTP), meaning you do not check the type before, but handle exceptions that are thrown when an object doesn't support the required operation
  2. The is no way to overload methods with the same number of arguments. You can only override (for example the builtins)
  3. Class-wide variables are only initialized once (when the class is loaded). To init for every instance, put initialisation into the __init__() method
tuple, list, dict, file, int, float, str, property are all built in types and can be inherited and instantiated. Some built in object types that are not directly instantiable are module, class, method, function, traceback, frame, code, builtin.
type(x) returns x's type. You can also do type(object) is type. x.__class__ is the same as type(x) for instances.

Classes

Whats the difference between types and classes? Basically, types are built in, and may represent things that cannot be instantiated, whereas classes are user (or library) defined. They are both of type 'type'. Classes can have some additional attributes:
  • __slots__ contains the list of legal member names in the classes __dict__ (normally you can create members just by assigning to them)
foo = staticmethod(foo) creates a static (class)-method foo that does not need a reference to self. There's also a weired classmethod, that furnishes a reference to the calling class.
The function inspect.getmembers(object[,predicate]) returns a list of all members if the object. Everything in python is an object, like object itself, even types and classes are.
Mutable sequence objects like lists also special attributes for indexing and slicing [:], and for the operators <,>,==,!=,>=,<=,+,* as well as built in methods like len() or object specific methods like append or remove. The distribution between object methods and built in functions is unfortunately arbitrary. There is many, many more special attributes for dictionaries, strings and various number types. You can see the attributes for any object by executing
import inspect
me = inspect.getmembers(put your object here)
for n, v in m: print n,"=>",v

Lisp

I always wanted to learn the language that needs syntax highlighting to be intelligble at all. Lisp is a functional language instead of a declarative one, for a change. Here are some scribble notes on EMACS Lisp, which conveniently comes with my editor.

Useful lisp control structures and idioms

Control structures
(if CONDITION THEN-BODY ELSE-BODY)
(when CONDITION WHEN-BODY)
(cond                       ;; lisps switch
   (CONDITION BODY)
   (CONDITION2 BODY2) ...)

(eq A B) ;; true if the same object (char literals, numbers etc)
(equal A B) ;; true if objects have same components (lists, strings)
and, or, not ;; combine conditions
everything that evals to nil or () (empty list, is nil, too) is false
everything else is true; t is true


Function definition
(defun NAME (ARG_LIST)
  "Commentary,
which may span several lines
and respects indentation."
   (interactive ["r"]) ;; if it is an interactive function
                       ;; * signal error if buffer is read-only 
                       ;; p take numeric prefix arg
                       ;; r take point and mark as params, smallest first (region) 
   ;;; here comes the actual function body
   )
Saving the current environ(buffer, point, mark)
(save-excursion
  ;;; code here
  )
Useful to wrap around the body of a function, so you
do not have to retore values by hand (and also works
if function crashes)

Local vars in functions 
(let ((varname1 (defining-function1))
      (varname2 (defining-function2)));; end of definition list
   ;;;code using local vars here
   )
(set 'symbol 'value) ;; quote value unless you want it eval'd
(setq symbol 'value) ;; the same, q stand for quoted symbol


Working with datatypes 
'(1 (2 (3))) ;; literal list
(list 1 2 3) ;; create a list
(concat SEQUENCES) ;; return a string; a SEQUENCE is a LIST, STRING or ARRAY 

numbers in a sequence are interpreted as the character with that
number. to get them as literals use (number-to-string)
For regexen that are represented as strings, a literal \
has to be ecaped twice, once for the regex, once for the
string, yielding "\\\\"


Useful Emacs variables and functions

Square brackets indicate optional arguments here. For full arguments and descriptions see the emacs documentation.
(point) ;; current position of point
(newline) ;; insert a newline
(goto-char POS) ;; move point to POS
(beginning-of-buffer), (beginning-of-line)
(end-of-buffer), (end-of-line)
(re-search-forward "re-string" [max-buffer-pos ret-nil-if-t repeat]) returns nil if no match found
(replace-match "newtext")
(kill-region BEG END)

(((vintage lisp)))

(progn A B C ...) ;; execute A, B, C in that order
                  ;; function bodies are an implicit progn
(cons A B) ;; build a cons cell from A and B
(car '(list)) ;; get the car of the first cons cell
(cdr '(list)) ;; get the cdr of the first cons cell

CSS

[Kira.]Using CSS is a clean way to separate the formatting of your web pages from the logical content, which is written in html paragraphs and headings.

Including CSS information

You can include CSS stylesheets in three major ways:
  • You define them in an exterior CSS file, and include it with the line <LINK REL=STYLESHEET TYPE="text/css" HREF="your/url.css"> in the header. This helps you to get a consistent style in all your pages, by including a single line. The URL can be fully qualified to a css file somewhere on the web, so you do not need to worry about relative paths.
  • You define the style for the page in the head section of the page. Enclose it in a multi-line comment for older browsers. This approach has the advantage that you do not need access to a separate CSS file, but only helps you get consistent style for one page:
    <style type="text/css">
    <!-- /* ... Style-Sheet-Directives, eg */
    p.nice { font-family:font- familiy:"Century Schoolbook",Times,serif; }
    p { font-size:10pt; }
    //-->
    </style>
  • You define mini-style sheets for single elements, with the STYLE attribute. If you don't have an element for your text, use the SPAN pseudo-element (which doesn't linebreak like DIV does): <SPAN STYLE="color:#FF0000">Like this!</SPAN>

CSS syntax

If you define styles in the head or a dedicated CSS file you have to define which elements to format. The syntax is simple: you just name the lowercase element name, or a comma separated list of them, and, enclosed in braces append the semicolon separated list of name:value pairs. If there are several values for a name, separate them by comma. You can quote strings with spaces with quotation marks. It doesn't hurt to append a semicolon after the last name:value pair. For example:

p, th, td { font-family:"Courier New",Courier,monospace; font-size:10pt; }

Often you do not want all your paragraphs or table cells to look the same way. For this, you can define subclasses, by following the element name with a dot and a subclass name you invent, for example:

p.emphasis { font-family:font- familiy:"Century Schoolbook",Times,serif; }

Then, you identify the elements to which this applies with the matching class attribute, for example:
<p class="emphasis">I promise nothing but blood, sweat, toil and tears</p>

If you leave out the element name, you define a general style open to any html element, for example:

.weeny { font-size:6pt; }

When formatting single elements with the STYLE attribute, you do of course not need the tagname or the braces. It's clear you refer to just that element.

IDs and classes

In addition to the class attribute, there is also an ID attribute. IDs are defined using a hash symbol instead of a dot, (like #myid). If an html element is given (like p#myid), then the ID is defined only for this element type. The element to use this formatting is then designated with the ID attribute, e.g.

<p ID="myid">.

What is the difference between the id and the class attribute in html elements for CSS? The id attribute is supposed to be used for elements that are unique in the page, appearing only once. Because of this, it is typically used for div sections that  that occur only once each per page, like navigation bars, sidebars, main text area. The class attribute is supposed to be used for elements that show up multiple times. That being said: if found that browsers uncomplainingly render multiple elements with the same ID attribute, and it is shorter to type.

Borders and colored areas

Styles make it possible to draw borders around a paragraph or tint it's background without having to resort to a table. This is a demonstration of setting the padding between Border and the Box representing the elements content to 10 pt (note there is still space between element content and border). background: yellow; border-color: black; border-style: solid; border-width: medium; padding: 10pt;

Free positioning of Elements on the page

Width together with position can be used to format columns (instead of blind tables). Caution: long words take precedence over width! Use float:left to make text flow around and have the element to the left or use float:right. Default is no float. Normally, the last elemet in the source prints over the ones before. With z-index: you can change that (the higher value, wins the front, works for absolute only!):
float:left; border:none; background-color: #AAEEFF; position:static; width:320px; padding: 5pt;
To position elements you define a DIV element, to which formatting is applied, and enclose your paragraphs and so on in it. position:absolute can be used to absolutely place the box on the page (at top: and left: pixels or cm from the upper left corner. position:relative does this in relation to the last element before and position:static is like a normal sub and is default (but you can use the width-tag).
Here a DIV <div>...</div> has been defined, formated with the following values:
z-index:1; border:none; background-color:#EEEEFF; padding:5mm; position:absolute; top:2090px; left:90px; width:360px;
A second DIV area in front the other one because the Z-Index is higher - independend of the order they appear in the HTML source:
z-index:2; background-color:#FFEEDD; position:absolute; top:1000px; left:400px; width:260px;

Java Refcard

This page about Java is quite old, from days when Java was new and kinda slow. Today there surely are all kinds of goodies and much better libraries.

/* ... */    multiline comment          /** ... */ javadoc comment, common tags
//           until end of line comment  @see @version @return @param           
\u1234       UNICODE character          @author @execption @deprecated         
                                                                               
public       access for all classes                                            
protected    access for classes from same package and subclasses               
<default>    access for classes from same package                              
private      access for no-one but the class itself                            
                                                                               
static       class variable, access by class not instance, inited in class init
final        class can not be subclassed, field can not be changed             
static final CONSTANT, is inlined during compilation                          
{ ... }      block delimiters                                                 
" ... "      String literal,                 
                                                  
void         no return value from method    ++, --  increment/decrement                       
boolean      true/false                     +, -    unary                                     
char         16 bit (UNICODE character)     ~       bitwise NOT                               
byte          8 bit                         !       logical NOT                               
short        16 bit integer                 (type)  typecast                                  
int          32 bit integer                 *, /, % mult, div, remainder                      
long         64 bit integer                 +, -    add, substract                            
float        32 bit IEEE754 floating point  +       String concat                             
double       64 bit IEEE754 floating point  <<      left shift                          
Object       any kind of non-primitive type >>      right shift sign ext.                                           
null         null reference to an Object    >>>     right shift no ext                                           
                                            <, <=, >=, > numeric comparison                                   
field        instance or class variable     instanceof   type comparison                  
method       function, subroutine           ==, !=  equality, inequality of               
member       field or method                        value or reference                    
constructor  class instance initializer     &       bitwise or boolean AND                       
                                            ^       bitwise or boolean XOR                             
\ is escape character                       |       bitwise or boolean OR               
\b backspace       ^H 08                    &&      conditional AND             
\t tab             ^I 09                    ||      conditional OR                                   
\n newline            0A                    ?:      conditional ternary op                                              
\r carriage return ^M 0D                    =       assignment                            
\f form feed                                *=, /=, %=, +=, -=, <<=, >>=,     
\" "                                        >>>=, &=, ^=, |= assign with op  
\' '                                        .equals() obj content comparison              
\\ \                                          
\nnn octal number (max. 377)                  

Reference class header

                                                                              
// for a class in a package, the .class file has to be in a subdir of the     
// classpath or of a jar file: for example, lets look at
// <classpath-dir>/pack/subpack/MyClass.class
package pack.subpack;                                                         
                                                                              
public   // visible outside of package. Only one class/file may be public.    
final    // cannot be subclassed                                              
abstract // contains abstract methods (methods without implementation block)  
class MyClass // instanceof MyClass, MySuperClass, MyInterface        
extends MySuperClass   // is a subclass of MySuperClass:                      
                       // inheritence of non-private members of MySuperClass  
                       // overriding of methods with same name                
                       // shadowing  of fields  with same name                
implements MyInterface // is a MyInterface, implements methods declared there 
{                                                                             
//fields
    // these modifiers can be used only with instance or class    
    // variables, not with vars local to a block (which always    
    // have only block-wide scope and visibility)                                                                        
    public or protected or private or <default>  
    final       // can not be overridden                                           
    static      // "global", shared by all instances of the class             
    MyType myVar; // declaration. If init missing, Java supplies default init 
                  // to 0 or  0.0 or false or null.                           
    static {   // static constructor for init of static vars. executed once   
    }          // when class is loaded.                                       
                                                                              
//constructors                                                                
    public or protected or private or <default>                   
    MyClass() {  // has no return type, since always returns                  
    }            // constructed object. If none specified, Java supplies      
                 // default MyClass() { super(); } constructor.               
    // this(args)  call another constructor of this class or                 
    // super(args) call another constructor of superclass, both have to be   
    // first statement in block. If missing Java supplies super()             
                                                                              
//methods                                                                     
    public or protected or private or <default>                   
    //if class final or method final or private, no overriding is possible,   
    //no dynamic binding needed, less overhead, calls are faster              
    final  // can not be overridden                                           
    static // class method not bound to instance, is also inherited,
           // thus implicitly final          
    native // platform dependent, makes use of native code
    synchronized // aquires monitor lock before executing (peer class if      
                 // peer object otherwise). used in multitherading            
    abstract // ; instead of {...} block                                      
    methodName(params) // overloading: same name with different param list    
    throws AnException // overriding methods must throw same exc. or subtype  
    

Classpath

The classpath is a list of places where Java classes can be found.
In the following path names jre refers to your java home directory, also specified in the java environment variable java.home, which is the top-level directory of your JRE installation. When the VM is searching for a class of a particular name
  1. it will first look among the bootstrap classes. These are core classes like the java.lang package. They are located in jre/lib/rt.jar and jre/lib/i18n.jar.
  2. it will next look for the class among any installed extensions. Installed extensions are jar files in the directory jre/lib/ext. A installed extension may contain executables and shared libraries (such as .dll files). Executables are placed in jre/bin on Windows, or jre/lib/[arch] on Unix. Native libraries may also be placed in jre/lib/ext/[arch], searched afterwards.
  3. it will then look for download extensions. A download extension is a jar file, and it's location is explicitely specified in the Class-Path header field in the manifest of another jar file. Download extensions cannot have any native code.
  4. finally, it searches the classpath, which is stored in java.class.path. You can set it explicitely using the option -classpath or -cp. Otherwise, the CLASSPATH environment variable sets it. If this also is not available, then it defaults to the current directory (which most of the time will not work). When using the -jar option, the classpath is set to the jar file, which overrides any other definition. The JAR file's manifest is consulted to find the name of the application's main class specified via the Main-Class manifest attribute.
When you install both JRE and JDK, you should use the directories of the JRE, which are also set in your registry on Windows by the installer. If you have both a JDK with it's own JRE and a standalone JRE, the system should use the latter . If it still looks under the JDK-JRE instead, rename the "jre" folder in the JDK, which will force it to use the actual JRE.

Windows Batch Refcard

Syntax, commands, control structures and conventions for Windows batch files. Handy if you want to automate something on your Windows machine. Although, to be honest, the last time I did so seriously was 20 years ago during my civil service.
filename Batch files end with .BAT there is no magical string like in Un*x batch file commands are case insensitive
comments rem at line start is a comment line
in/output redirection cmd <in-file >out-file-new
cmd >>out-file-append
stderr cannot be redirected, always goes to screen.
parameters %1 to %9 arguments number 1 to 9 given to the batch file %* all arguments given to the batch file (max. 9)
environment variables %varname%
conditions if [not] "string1" == "string2" command Enclose vars and strings in "". Idiom for empty check:
if "%var%" == "" command ("" needed for empty string).
if [not] exist filename command
if errorlevel 1 would test for any abnormal exit. errorlevel is the number the last program executed returned. The condition is true if the number is >= the errorlevel. Normal exit returns 0.
logical operators &&execute only when last prog returned ok
||execute only when last prog returned error
skript loops for %% var in ( group ) do command [params]
var wird zu jedem in group angegebenen dateinamen, eg for %%f in (*.*) do type %%f
skript jumps goto mark The target mark itself has to be prepended with : so NT can skip it during execution.
:mark
skript chaining call blah.bat calls another batch like a subroutine, i.e. the current batch's execution is resumed after the other one ends. Recursion possible.
skript messages echo [on|off][message]
@ at line start suppress showing of any single line
@echo off showing of any line (including this). Echo with message prints message.
getting user input for skripts pause[message] halts until any key pressed.
... shift shifts the params in a batch one down, e.g. %9 to %8. %0 is lost and if there were more than 10 params, the current number 11 becomes %9. Example loop with shift
        :next
        if "%0" == "" goto end
                type %0
                shift
        goto next
        :end

R

Find out all about R at the R website, which has really exhaustive wonderful documentation. The Language Reference is better than the Manual. This is just for me for starters so I can document what ground I covered and have the help available in another window. Online help is faster via the help function.

Note: in Blogger's dynamic template, unfortunately name anchors do not work, so you cannot use the above list to jump to the section of interest.

Scribbled Notes

This section contains raw scribbled notes that have to be revised.
return(x) - write as a function

matrix^-1 with solve(matrix)
x'A^-1x as x*solve(A,x)


Info

search() lists all objects in the current environment, without parameter that are all objetcs in the global environment. Those objects are usually packages.
The contents of packages in the environments listed by search may then be listed by ls(index) or ls('name'). Just ls() is like ls(1), which refers to ".GlobalEnv". For listing the contents of a package, use ls('package:libname').
dir() instead lists objects in directories on the file system, by default the current directory.
library lists all available packages, or loads one when called with a package name.
help(name) and apropos(name) search through the documentation, for exact matches or any item that somewhere contains the word. A shortcut for help(name) is ?name.
args(name) shows the arguments and default values of a function.
Typing the name of any function without parentheses lists the sorce code for this function. This is great to find out in detail what it does, and to learn programming in R.

Input/Output

Save data with save(obj, file="filename") and load it back with load("filename"). The data file is binary, and should end in .rda.
Using data() to load a dataset R searches for data files in data subdirs of the working directory or directories of loaded packages.
  • .R and .r files are source()ed as R source code
  • .RData and .rda are loaded as binary files
  • .tab .txt .csv are read with read.table().
Load data frames from the typical tab separated tables with a leading header row and column with read.table("filename", header=TRUE, row.names=1, sep="\t"). NOTE: that no row may contain a #, since R interprets it as starting a comment and ignores the rest of the line. Also ' seems to screw up the reading, probably because it is interpreted as a quotation.

Operators

<-,=    assignment
==,<=   comparison
%o%     outer product
%*%     matrix multiplication
:       sequence generation
*,/,+,- elementwise multiplication, divison, addition and substraction
|, &    list or, and
||,&&   expression short-circuiting atomic or, and

Datastructures

The most irritating thing for me as a beginner with R is the datastructures that vary quite a bit from other programming languages, seem redundand and sometimes not very, well, structured.
For starters, INDEXES START FROM 1. Not from zero, like any well-behaved index should.
There are vectors, arrays, matrices, factors, lists, and data frames. R knows no scalars. Most of the basic indexing and naming stuff that applies to all these datastructures is covered under Vector.
linear rectangular
all same type vector matrix
mixed type list data frame

Literals and Names

TRUE, FALSE, NA
Names are case sensitive, must start with a letter and may contain digits, letters and the dot, NO underscore!

Vector

Vectors are the simplest kind of list object. All elements must be of the same type (logical, integer, real complex or character). Even they can be indexed via name. Note that literal vectors are created by the c() function, not just by parentheses. Missing values are represented by NA.
Creation c(2,3,4)
1:10
seq(-5,5,by=.2)
rep(x,times=5)
a>2
Names names(x) = c("Frodo", "Bilbo", "Sam")
c("Frodo"="Ringbearer", "Bilbo"="Old One", "Sam"="Sidekick")
Indexing
a[2] single element
b[1:3] range
b[3:1] range, reverese order
c[-(2:3)] everything except that in range
d[c("Frodo", "Sam")] named elements
e[!is.na(e)] selection by boolean vector
f[f<17] dito
Useful funcs sum
mean
var
length
sort
Notes on indexing: especailly interesting is the possibility to provide a vector of booleans as indexes, as this vector can be generated by a test on the original vector, thus selecting all elements that pass the test.

Factors

Factors are vectors that fall into discrete classes. Levels are the different unique values of a factor.
Creation factor(c("Man", "Orc", "Orc", "Elf", "Man"))
Levels levels(x)
Useful funcs tapply(vector, factor, function)

List

Lists are like vectors, but can contain mixed elements of any kind of object, especially other lists. So you can build up complex data structures from them (hello, Lisp!).
Creation list(elements) as.list(vector)
Indexing
L[2] a sublist, (shown as a list including names)
L[[2]] a single element (shown as vector without the name)
L$a element named a (points to the same as L[[]])
L[["a"]] the same

Array

Arrays are lists with more than one dimension?

Matrix

A matrix is a two dimensional vector.
Creation matrix(data,nrow,ncol)
as.matrix(object)
rbind(vec1, vec2) row-wise
Useful funcs dim
Indexing For indexing matrices there are two ways: one, treating the matrix as one large vecor. This method is used if an index of only one dimension is given. Elements are counted running through cols top to bottom, then left to right, compare as.vector() and the indexing under vector. Two, treating the matrix as two-dimensional. This is used if a two dimensional index is given (using a comma):
M[13] 13th element as 1x1 matrix
M[[13]] 13th element
M[1:3,4:5] rows 1-3, col 4-5 of matrix
M[-(1:3),] rows 4 to end
M[1,] row 1 all cols
M[,2] all rows, col 2
M full matrix
M[,c("n","m")] cols "n" and "m"


Notes on indexing: Other than in data frames, indexing only a single dimension returns a single element, not a whole column.

Data Frame

A data frame looks like a matrix but may have differend types in different columns. Each column is a vector.
Creation
Useful funcs
Indexing
DF[1:3,1:2] upper left corner 3 rows x 2 cols of data frame
DF[1] col 1 as list
DF[[1]] col 1 as vector/factor
DF[,1] col 1 as vector/factor
DF[1,] row 1 as list(?)
DF['n'] col 'n' as list
DF[['n']] col 'n' as vector/factor
DF[,'n'] col 'n' as vector/factor
DF$n col 'n' as vector/factor
DF['n',] row 'n' as vector/factor
DF[c("n","m")] cols named "n" and "m"
Notes about indexing: For data frames x[,1] (or x[[1]]) returns the first column as a vector (x$myname returns the same if the column was named myname), which prints as a long list of values, as any vector would. Now, x[1] returns the first column in a one-element-list, wich prints as a nice single column. I imagine this is because data frames are implemented as a list of vectors, with each vector a column. So the nth element is the sublist of the nth column. It just puzzles me, how x[7,] then selects the seventh row.

Plotting

plot() for general plotting. pch='.' to use dots as characters.
abline(intercept, slope) draws a line into the existing plot.

Syntax

# comments

Lexical (static) scoping

All vars that are params or assigned to in a function are local, all
others are expected as free (try to look up in enclosing
environments, up to global)


Objects


Access (indices count from 1 not from 0)
A[M==2]         # all elems that are == 2

Function definition
a { } block is also an expression, it evaluates to the last statement within

funcname <- function(param,..,defparam=expr) expr
the expression ... may be used for pass-through argument lists

if (expr1) expr2 else expr3
for (var in vector) expr
break,next

switch (
    var,
    key1 = statement,
    kex2 = statement)
while (cond) expr
repeat expr # must be broken by break from within

is.null(item) # Method calls


Useful Functions

Packages update.packages package.contents library/require search
Object creation c vector array matrix data.frame list environment rep seq
Lists/Vectors unlist
Hashes/Environments environment ls get exists
Vectors c vector names
Arrays (Vectors with dim) array aperm dim outer
Matrices (2D-arrays) matrix t crossprod diag cbind rbind solve det eigen svd lsfit dist nrow ncol row col scale cor var cov
Lists list attach detach
Data Frames data.frame names row.names methods as.matrix
Interactive getwd edit
Coding dir mode any all lapply substitute eval table iter length unique as.function as.numeric
Debugging/Optimizing system.time
Regexen grep grep sub match
Info help apropos/find example search ls/objects methods data library
I/O data source load cat write.table read.table library/require
Math sqrt prod sum cumprod/cumprod density
Vizualisation heatmap image plot rug boxplot pairs coplot qqplot hist dotchart persp Lowlevel: points lines text axis title legend General Params: par
Stat sd var mean median median stem hist qqnorm qqline qqplot ecdf norm (dnorm=density, pnorm=cumul. density, qnorm=quantile fkt, rnorm=simulation)

Libraries

Rcmd INSTALL pkgs # where pkgs is a tar.gz file or dir location
libraries are installed under .Library in the following structure:
mylib                               lib name
|   CONTENTS
|   DESCRIPTION
|   INDEX                           created by Rdindex man > INDEX
|   TITLE                           deprecated, put it in Title: under DESCRIPTION
|   README                          optional
|
+---chtml
|                                   ?
+---help           
|       AnIndex
|       00Titles                    R help files, may be in zip file
|       caha
|       clin2mim ... etc
|
+---html 
|       00Index.html                html help files, may be in zip file
|       caha.html
|       clin2mim.html ... etc

+---latex
|       caha.tex                    latex help files, may be in zip file
|       clin2mim.tex ... etc
|
+---Man
|       caha.rd                     R help files in R documentation format, may be in zip file
|       clin2mim.rd  ... etc
|
+---R
|       mylib                       the actual library file with R code
|
\---R-ex
        fetchAvgDiff.R              code examples, may be in zip file
        firstpass.R ...

Environment

Initialisation sequence: Rprofile.site, .Rprofile, .RData, .First()
  1. $R_PROFILE || $R_HOME/etc/Rprofile.site is the site init file
  2. .Rprofile is sourced if
    • R is invoked from the same dir or
    • it's in your home dir
  3. .First() in any of the files executed
Cleanup sequence: .Last()

R and Emacs

To add R to your emacs, first install R to your machine. On windows there is a program called Rterm, which provides a command line interface to R.
Then, Install the Emacs ESS package (if it was not in the default packages), byte compile it like this: (byte-compile-file "d:/Programme/emacs-21.2/lisp/progmodes/perl-mode.el") and tell emacs to load it at startup in your .emacs file, like this: (load "d:/Programme/emacs-21.2/ess-5.1.24/lisp/ess-site" t)
Now you only have to let Emacs know where to look for the Rterm executable. This is done by adding the path to the executable to your Windows path variable, on Win2000 you can do this via Properties on the My Machine Icon.
You start an R-process with M-x R.
You send a buffer region to R with C-c C-r, a function with C-c C-f and the whole buffer with C-c C-b. (memo copy region/function/buffer)