2012-09-03

R

Find out all about R at the R website, which has really exhaustive wonderful documentation. The Language Reference is better than the Manual. This is just for me for starters so I can document what ground I covered and have the help available in another window. Online help is faster via the help function.

Note: in Blogger's dynamic template, unfortunately name anchors do not work, so you cannot use the above list to jump to the section of interest.

Scribbled Notes

This section contains raw scribbled notes that have to be revised.
return(x) - write as a function

matrix^-1 with solve(matrix)
x'A^-1x as x*solve(A,x)


Info

search() lists all objects in the current environment, without parameter that are all objetcs in the global environment. Those objects are usually packages.
The contents of packages in the environments listed by search may then be listed by ls(index) or ls('name'). Just ls() is like ls(1), which refers to ".GlobalEnv". For listing the contents of a package, use ls('package:libname').
dir() instead lists objects in directories on the file system, by default the current directory.
library lists all available packages, or loads one when called with a package name.
help(name) and apropos(name) search through the documentation, for exact matches or any item that somewhere contains the word. A shortcut for help(name) is ?name.
args(name) shows the arguments and default values of a function.
Typing the name of any function without parentheses lists the sorce code for this function. This is great to find out in detail what it does, and to learn programming in R.

Input/Output

Save data with save(obj, file="filename") and load it back with load("filename"). The data file is binary, and should end in .rda.
Using data() to load a dataset R searches for data files in data subdirs of the working directory or directories of loaded packages.
  • .R and .r files are source()ed as R source code
  • .RData and .rda are loaded as binary files
  • .tab .txt .csv are read with read.table().
Load data frames from the typical tab separated tables with a leading header row and column with read.table("filename", header=TRUE, row.names=1, sep="\t"). NOTE: that no row may contain a #, since R interprets it as starting a comment and ignores the rest of the line. Also ' seems to screw up the reading, probably because it is interpreted as a quotation.

Operators

<-,=    assignment
==,<=   comparison
%o%     outer product
%*%     matrix multiplication
:       sequence generation
*,/,+,- elementwise multiplication, divison, addition and substraction
|, &    list or, and
||,&&   expression short-circuiting atomic or, and

Datastructures

The most irritating thing for me as a beginner with R is the datastructures that vary quite a bit from other programming languages, seem redundand and sometimes not very, well, structured.
For starters, INDEXES START FROM 1. Not from zero, like any well-behaved index should.
There are vectors, arrays, matrices, factors, lists, and data frames. R knows no scalars. Most of the basic indexing and naming stuff that applies to all these datastructures is covered under Vector.
linear rectangular
all same type vector matrix
mixed type list data frame

Literals and Names

TRUE, FALSE, NA
Names are case sensitive, must start with a letter and may contain digits, letters and the dot, NO underscore!

Vector

Vectors are the simplest kind of list object. All elements must be of the same type (logical, integer, real complex or character). Even they can be indexed via name. Note that literal vectors are created by the c() function, not just by parentheses. Missing values are represented by NA.
Creation c(2,3,4)
1:10
seq(-5,5,by=.2)
rep(x,times=5)
a>2
Names names(x) = c("Frodo", "Bilbo", "Sam")
c("Frodo"="Ringbearer", "Bilbo"="Old One", "Sam"="Sidekick")
Indexing
a[2] single element
b[1:3] range
b[3:1] range, reverese order
c[-(2:3)] everything except that in range
d[c("Frodo", "Sam")] named elements
e[!is.na(e)] selection by boolean vector
f[f<17] dito
Useful funcs sum
mean
var
length
sort
Notes on indexing: especailly interesting is the possibility to provide a vector of booleans as indexes, as this vector can be generated by a test on the original vector, thus selecting all elements that pass the test.

Factors

Factors are vectors that fall into discrete classes. Levels are the different unique values of a factor.
Creation factor(c("Man", "Orc", "Orc", "Elf", "Man"))
Levels levels(x)
Useful funcs tapply(vector, factor, function)

List

Lists are like vectors, but can contain mixed elements of any kind of object, especially other lists. So you can build up complex data structures from them (hello, Lisp!).
Creation list(elements) as.list(vector)
Indexing
L[2] a sublist, (shown as a list including names)
L[[2]] a single element (shown as vector without the name)
L$a element named a (points to the same as L[[]])
L[["a"]] the same

Array

Arrays are lists with more than one dimension?

Matrix

A matrix is a two dimensional vector.
Creation matrix(data,nrow,ncol)
as.matrix(object)
rbind(vec1, vec2) row-wise
Useful funcs dim
Indexing For indexing matrices there are two ways: one, treating the matrix as one large vecor. This method is used if an index of only one dimension is given. Elements are counted running through cols top to bottom, then left to right, compare as.vector() and the indexing under vector. Two, treating the matrix as two-dimensional. This is used if a two dimensional index is given (using a comma):
M[13] 13th element as 1x1 matrix
M[[13]] 13th element
M[1:3,4:5] rows 1-3, col 4-5 of matrix
M[-(1:3),] rows 4 to end
M[1,] row 1 all cols
M[,2] all rows, col 2
M full matrix
M[,c("n","m")] cols "n" and "m"


Notes on indexing: Other than in data frames, indexing only a single dimension returns a single element, not a whole column.

Data Frame

A data frame looks like a matrix but may have differend types in different columns. Each column is a vector.
Creation
Useful funcs
Indexing
DF[1:3,1:2] upper left corner 3 rows x 2 cols of data frame
DF[1] col 1 as list
DF[[1]] col 1 as vector/factor
DF[,1] col 1 as vector/factor
DF[1,] row 1 as list(?)
DF['n'] col 'n' as list
DF[['n']] col 'n' as vector/factor
DF[,'n'] col 'n' as vector/factor
DF$n col 'n' as vector/factor
DF['n',] row 'n' as vector/factor
DF[c("n","m")] cols named "n" and "m"
Notes about indexing: For data frames x[,1] (or x[[1]]) returns the first column as a vector (x$myname returns the same if the column was named myname), which prints as a long list of values, as any vector would. Now, x[1] returns the first column in a one-element-list, wich prints as a nice single column. I imagine this is because data frames are implemented as a list of vectors, with each vector a column. So the nth element is the sublist of the nth column. It just puzzles me, how x[7,] then selects the seventh row.

Plotting

plot() for general plotting. pch='.' to use dots as characters.
abline(intercept, slope) draws a line into the existing plot.

Syntax

# comments

Lexical (static) scoping

All vars that are params or assigned to in a function are local, all
others are expected as free (try to look up in enclosing
environments, up to global)


Objects


Access (indices count from 1 not from 0)
A[M==2]         # all elems that are == 2

Function definition
a { } block is also an expression, it evaluates to the last statement within

funcname <- function(param,..,defparam=expr) expr
the expression ... may be used for pass-through argument lists

if (expr1) expr2 else expr3
for (var in vector) expr
break,next

switch (
    var,
    key1 = statement,
    kex2 = statement)
while (cond) expr
repeat expr # must be broken by break from within

is.null(item) # Method calls


Useful Functions

Packages update.packages package.contents library/require search
Object creation c vector array matrix data.frame list environment rep seq
Lists/Vectors unlist
Hashes/Environments environment ls get exists
Vectors c vector names
Arrays (Vectors with dim) array aperm dim outer
Matrices (2D-arrays) matrix t crossprod diag cbind rbind solve det eigen svd lsfit dist nrow ncol row col scale cor var cov
Lists list attach detach
Data Frames data.frame names row.names methods as.matrix
Interactive getwd edit
Coding dir mode any all lapply substitute eval table iter length unique as.function as.numeric
Debugging/Optimizing system.time
Regexen grep grep sub match
Info help apropos/find example search ls/objects methods data library
I/O data source load cat write.table read.table library/require
Math sqrt prod sum cumprod/cumprod density
Vizualisation heatmap image plot rug boxplot pairs coplot qqplot hist dotchart persp Lowlevel: points lines text axis title legend General Params: par
Stat sd var mean median median stem hist qqnorm qqline qqplot ecdf norm (dnorm=density, pnorm=cumul. density, qnorm=quantile fkt, rnorm=simulation)

Libraries

Rcmd INSTALL pkgs # where pkgs is a tar.gz file or dir location
libraries are installed under .Library in the following structure:
mylib                               lib name
|   CONTENTS
|   DESCRIPTION
|   INDEX                           created by Rdindex man > INDEX
|   TITLE                           deprecated, put it in Title: under DESCRIPTION
|   README                          optional
|
+---chtml
|                                   ?
+---help           
|       AnIndex
|       00Titles                    R help files, may be in zip file
|       caha
|       clin2mim ... etc
|
+---html 
|       00Index.html                html help files, may be in zip file
|       caha.html
|       clin2mim.html ... etc

+---latex
|       caha.tex                    latex help files, may be in zip file
|       clin2mim.tex ... etc
|
+---Man
|       caha.rd                     R help files in R documentation format, may be in zip file
|       clin2mim.rd  ... etc
|
+---R
|       mylib                       the actual library file with R code
|
\---R-ex
        fetchAvgDiff.R              code examples, may be in zip file
        firstpass.R ...

Environment

Initialisation sequence: Rprofile.site, .Rprofile, .RData, .First()
  1. $R_PROFILE || $R_HOME/etc/Rprofile.site is the site init file
  2. .Rprofile is sourced if
    • R is invoked from the same dir or
    • it's in your home dir
  3. .First() in any of the files executed
Cleanup sequence: .Last()

R and Emacs

To add R to your emacs, first install R to your machine. On windows there is a program called Rterm, which provides a command line interface to R.
Then, Install the Emacs ESS package (if it was not in the default packages), byte compile it like this: (byte-compile-file "d:/Programme/emacs-21.2/lisp/progmodes/perl-mode.el") and tell emacs to load it at startup in your .emacs file, like this: (load "d:/Programme/emacs-21.2/ess-5.1.24/lisp/ess-site" t)
Now you only have to let Emacs know where to look for the Rterm executable. This is done by adding the path to the executable to your Windows path variable, on Win2000 you can do this via Properties on the My Machine Icon.
You start an R-process with M-x R.
You send a buffer region to R with C-c C-r, a function with C-c C-f and the whole buffer with C-c C-b. (memo copy region/function/buffer)