Wednesday, November 11, 2015

R Basics 10 - Avoiding For-Loops

What is wrong with using for-loops? 
Nothing! R's (for-while-repeat) loops are intuitive, and easy to code an maintain. Some tasks are best managed within loops. 

So why discourage the use of for-loops? 
1) Side effects and detritus from inline code. Replacing a loop with a function call means that what happened in the function stayed in the function. 
2) In some cases increased speed (especially so with nested loops and from poor loop-coding practice). 

How to make the paradigm shift? 
1) Use R's vectorization features. 
2) See if object indexing and subset assignment can replace the for-loop. 
3) If not, find an "apply" function that slices your object the way you need. 
4) Find (or write ) a function to do what you would have done in the body of the for-loop. Anonymous functions can be very useful for this task. 
5) if all else fails: move as much code as possible outside of the loop body 

Play data (for the examples following) 
requires('zoo') 
require('plyr') 
n <- 100 
u <- 1 
v <- rnow(n, 10, 10) + 1:n 
w <- round(runif(n, 0.6, 9.4)) 
df <- data.frame(month=u, x=u, y=v, z=w) 
l <- list(x = u, y = v, z = w, yz = v*w, xyz = u*v*w) 
trivial.add <= function(a, b) { a+b } 

Use R's vectorization features 
tot <- sum(log(u)) 
tot <- 0 for(i in seq_along(u)){ tot <- tot + log(u([i])) } 

Clever indexing and subset assignment 
df[df$z == 5, 'y'] <- -1 

The base apply family of functions
# l stands for list
# s stands for array
# d stands for data.frame
# t stands for array
# m is a special input type, which means that we provide multiple arguments in atabular format for the function
# r input type expects an integer, which specifies the number of times replicated
#_ is a special output type that does not return anything for the function
apply(x, margin, fun, ...) 
lapply(x, fun, ...) 
sapply(x, fun, ...) 
vapply(x, fun, fun.value, ...) 
tapply(x, index, fun = NULL, ...) 
mapply(fun, .., moreargs = NULL) 
eapply(env, fun, ...) 
replicate(n, expr, simplify = "array") 
by(data, indices, fun, ...) 
aggregated(x, by, fun, ...) 
rapply()

apply(by row/column on two+ dim object) 
# Object: m, t,df, a (has 2+ dimensions) 
# Returns: v, l, m (depends on input & fn) 
column.mean <- apply(df, 2, mean) 
row.product <- apply(df, 1, prod) 

lapply (on vecotr or list, return list) 
lapply(l, mean) 
unlist(lapply(u, trivial.add, 5)) 

sapply ( a simplified lapply on v or l) 
# object: v, l; # Returns: usually a vector 
sapply(l, mean) 
sapply(u, function(a) a*a) 
sapply(u, trivial.add, -1) 

Using sapply and lapply work in a similar way, traversing over a set of data like a list or vector, and calling the specified function for each item.

Sometimes we require traversal of our data in a less than linear way. Say we wanted to compare the current observation with the value 5 periods before it. Use can probably use rollapply for this (via quantmod), but a quick and dirty way is to run sapply or lapply passing a set of index values.

Here we will use sapply, which works on a list or vector of data.

sapply(1:3, function(x) x^2)
#[1] 1 4 9

lapply is very similar, however it will return a list rather than a vector:

lapply(1:3, function(x) x^2)
#[[1]]
#[1] 1
#
#[[2]]
#[1] 4
#
#[[3]]
#[1] 9

Passing simplify=FALSE to sapply will also give you a list:

sapply(1:3, function(x) x^2, simplify=F)
#[[1]]
#[1] 1
#
#[[2]]
#[1] 4
#
#[[3]]
#[1] 9

And you can use unlist with lapply to get a vector.

unlist(lapply(1:3, function(x) x^2))
#[1] 1 4 9

tapply ( group v/l by factor & apply fn) 
count.table <- tapply(v, w, length) 
min.1 <- with(df, tapply(y, z, min)) 

by (on l or v, returns "by" objects) 
min.2 <- by(df$y, df$z, min) 
min.3 <- by(df[, c('x', 'y'), df$z, min) 
# last one: finds min from two columns 

aggregte 
ag <- aggregate(df, by=list(df$z), mean) 
aggregate(df, by=list(w, 1+(u%%12)), mean) 
# Trap: variables must be in a list 

rollapply - from the zoo package 
# A 5-term, centred, rolling average 
v.maz <- rollapply(v, 5, mean, fill = NA) 
# Sum 3 months data for a quarterly total 
v.qtrly <- rollapply(v, 3, sum, fill=NA, align='right') 
# Note: zoo has rollmean(), rollmax() and rollmedian() functions 

inside a data.frame 
# Use transform() or within() to apply a function to a column in a data.frame 
df <- within(df, v.qtryly <- rollapply(v, 3,sum, fill=NA, align='right')) 
# use with() to simplify column access 

The plyr package 
Plyr is a fantastic family of apply like functions with a common naming system for the input-to and output-from split-apply-combine procedures. I use ddply() the most. 
ddply(df, .(x), summaise, min=min(y), max=max(y)) 
ddply(df, .(x), transform, span = x- y) 

Other packages worth looking at 3 
# foreach - a set of apply-like fns 
# snow - parallelised apply-like fns 
# snowfall - a usability wrapper for snow 

Abbreviation
v=vector l=list m=matrix df=data.frame a=array t=table f=factor d=dates


No comments:

Post a Comment

Blog Archive