My DS Coding Bolg: R Basics 10 - Avoiding For-Loops

What is wrong with using for-loops?

Nothing! R's (for-while-repeat) loops are intuitive, and easy to code an maintain. Some tasks are best managed within loops.

So why discourage the use of for-loops?

1) Side effects and detritus from inline code. Replacing a loop with a function call means that what happened in the function stayed in the function.

2) In some cases increased speed (especially so with nested loops and from poor loop-coding practice).

How to make the paradigm shift?

1) Use R's vectorization features.

2) See if object indexing and subset assignment can replace the for-loop.

3) If not, find an "apply" function that slices your object the way you need.

4) Find (or write ) a function to do what you would have done in the body of the for-loop. Anonymous functions can be very useful for this task.

5) if all else fails: move as much code as possible outside of the loop body

Play data (for the examples following)

requires('zoo')

require('plyr')

n <- 100

u <- 1

v <- rnow(n, 10, 10) + 1:n

w <- round(runif(n, 0.6, 9.4))

df <- data.frame(month=u, x=u, y=v, z=w)

l <- list(x = u, y = v, z = w, yz = v*w, xyz = u*v*w)

trivial.add <= function(a, b) { a+b }

Use R's vectorization features

tot <- sum(log(u))

tot <- 0 for(i in seq_along(u)){ tot <- tot + log(u([i])) }

Clever indexing and subset assignment

df[df$z == 5, 'y'] <- -1

The base apply family of functions
# l stands for list
# s stands for array
# d stands for data.frame
# t stands for array
# m is a special input type, which means that we provide multiple arguments in atabular format for the function
# r input type expects an integer, which specifies the number of times replicated
#_ is a special output type that does not return anything for the function

apply(x, margin, fun, ...)

lapply(x, fun, ...)

sapply(x, fun, ...)

vapply(x, fun, fun.value, ...)

tapply(x, index, fun = NULL, ...)

mapply(fun, .., moreargs = NULL)

eapply(env, fun, ...)

replicate(n, expr, simplify = "array")

by(data, indices, fun, ...)

aggregated(x, by, fun, ...)

rapply()

apply(by row/column on two+ dim object)

# Object: m, t,df, a (has 2+ dimensions)

# Returns: v, l, m (depends on input & fn)

column.mean <- apply(df, 2, mean)

row.product <- apply(df, 1, prod)

lapply (on vecotr or list, return list)

lapply(l, mean)

unlist(lapply(u, trivial.add, 5))

sapply ( a simplified lapply on v or l)

# object: v, l; # Returns: usually a vector

sapply(l, mean)

sapply(u, function(a) a*a)

sapply(u, trivial.add, -1)

Using sapply and lapply work in a similar way, traversing over a set of data like a list or vector, and calling the specified function for each item.

Sometimes we require traversal of our data in a less than linear way. Say we wanted to compare the current observation with the value 5 periods before it. Use can probably use rollapply for this (via quantmod), but a quick and dirty way is to run sapply or lapply passing a set of index values.

Here we will use sapply, which works on a list or vector of data.

sapply(1:3, function(x) x^2)
#[1] 1 4 9

lapply is very similar, however it will return a list rather than a vector:

lapply(1:3, function(x) x^2)
#[[1]]
#[1] 1
#
#[[2]]
#[1] 4
#
#[[3]]
#[1] 9

Passing simplify=FALSE to sapply will also give you a list:

sapply(1:3, function(x) x^2, simplify=F)
#[[1]]
#[1] 1
#
#[[2]]
#[1] 4
#
#[[3]]
#[1] 9

And you can use unlist with lapply to get a vector.

unlist(lapply(1:3, function(x) x^2))
#[1] 1 4 9

tapply ( group v/l by factor & apply fn)

count.table <- tapply(v, w, length)

min.1 <- with(df, tapply(y, z, min))

by (on l or v, returns "by" objects)

min.2 <- by(df$y, df$z, min)

min.3 <- by(df[, c('x', 'y'), df$z, min)

# last one: finds min from two columns

aggregte

ag <- aggregate(df, by=list(df$z), mean)

aggregate(df, by=list(w, 1+(u%%12)), mean)

# Trap: variables must be in a list

rollapply - from the zoo package

# A 5-term, centred, rolling average

v.maz <- rollapply(v, 5, mean, fill = NA)

# Sum 3 months data for a quarterly total

v.qtrly <- rollapply(v, 3, sum, fill=NA, align='right')

# Note: zoo has rollmean(), rollmax() and rollmedian() functions

inside a data.frame

# Use transform() or within() to apply a function to a column in a data.frame

df <- within(df, v.qtryly <- rollapply(v, 3,sum, fill=NA, align='right'))

# use with() to simplify column access

The plyr package

Plyr is a fantastic family of apply like functions with a common naming system for the input-to and output-from split-apply-combine procedures. I use ddply() the most.

ddply(df, .(x), summaise, min=min(y), max=max(y))

ddply(df, .(x), transform, span = x- y)

Other packages worth looking at 3

# foreach - a set of apply-like fns

# snow - parallelised apply-like fns

# snowfall - a usability wrapper for snow

Abbreviation

v=vector l=list m=matrix df=data.frame a=array t=table f=factor d=dates

My DS Coding Bolg

Wednesday, November 11, 2015

R Basics 10 - Avoiding For-Loops

No comments:

Post a Comment

Blog Archive