Thursday, November 5, 2015

R Basics 7 - Factors

Factors
- A one-dimensional array of categorical (unordered) or ordinal (ordered) data.
- Indexed from 1 to N. Not fixed length.
- Named factors are possible (but rare).
- The hidden/unexpected coercion of object of a factor is a key source of bugs.

Why use Factors
- Specifying a non-alphabetical order
- Some statistical functions treat cat/ordinal data differently from continuous data.
- Deep ggplot2 code depends on it

Create 
Example 1 - unordered
> sex.v <- c('M', 'F', 'F', 'M', 'F'); sex.v
[1] "M" "F" "F" "M" "F"
> sex.f <- factor(sex.v); sex.f
[1] M F F M F
Levels: F M
> sex.w <- as.character(sex.f); sex.w
[1] "M" "F" "F" "M" "F"

Example 2 - ordered (small, medium, large)
> size.v <- c('S','L', 'M', 'L', 'S', 'M'); size.v
[1] "S" "L" "M" "L" "S" "M"
> size1.f <- factor(size.v, ordered = TRUE); size1.f
[1] S L M L S M
Levels: L < M < S
Example 3 - ordered, where we set the order
> size.lvls <- c('S','M','L')
> size2.f <- factor(size.v, levels=size.lvls); size2.f
[1] S L M L S M
Levels: S M L
 
Example 4 - ordered with levels and labels
> levels <- c(1,2,3,99)
> labels <- c('love', 'neutral', 'hate', NA); labels
[1] "love"    "neutral" "hate"    NA       
> data.v <- c(1,2,3,99,1,2,1,2,99); data.v
[1]  1  2  3 99  1  2  1  2 99
> data.f <- factor(data.v, levels=levels, labels=labels); data.f
[1] love    neutral hate    <NA>    love    neutral love    neutral <NA>   
Levels: love neutral hate <NA>
Example 5 - using the cut function to group
> i <- 1:50 + rnorm(50,0,5)
> k <- cut(i,5); k
 [1] (-0.859,10.2] (-0.859,10.2] (-0.859,10.2] (-0.859,10.2] (-0.859,10.2] (-0.859,10.2] (-0.859,10.2] (10.2,21.1]  
 [9] (-0.859,10.2] (10.2,21.1]   (-0.859,10.2] (-0.859,10.2] (-0.859,10.2] (10.2,21.1]   (-0.859,10.2] (10.2,21.1]  
[17] (10.2,21.1]   (21.1,32.1]   (21.1,32.1]   (10.2,21.1]   (-0.859,10.2] (10.2,21.1]   (10.2,21.1]   (21.1,32.1]  
[25] (21.1,32.1]   (10.2,21.1]   (32.1,43]     (21.1,32.1]   (21.1,32.1]   (21.1,32.1]   (32.1,43]     (32.1,43]    
[33] (32.1,43]     (21.1,32.1]   (21.1,32.1]   (32.1,43]     (32.1,43]     (43,54.1]     (32.1,43]     (43,54.1]    
[41] (32.1,43]     (32.1,43]     (32.1,43]     (43,54.1]     (43,54.1]     (43,54.1]     (43,54.1]     (43,54.1]    
[49] (43,54.1]     (43,54.1]    
Levels: (-0.859,10.2] (10.2,21.1] (21.1,32.1] (32.1,43] (43,54.1]
> 

Basic information about a factor
> dim(f)
NULL
> is.factor(f)
[1] TRUE
> is.atomic(f)
[1] TRUE
> is.vector(f)
[1] FALSE
> is.list(f)
[1] FALSE
> is.recursive(f)
[1] FALSE
> length(f)
[1] 24
> names(f)
NULL
> mode(f)
[1] "numeric"
> class(f)
[1] "factor"
> typeof(f)
[1] "integer"
> is.ordered(f)
[1] FALSE
> unclass(f)
 [1] 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1
attr(,"levels")
[1] "4" "3" "2" "1"
> cat(f)
4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1
> print(f)
 [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Levels: 4 3 2 1
> str(f)
 Factor w/ 4 levels "4","3","2","1": 4 3 2 1 4 3 2 1 4 3 ...
> dput(f)
structure(c(4L, 3L, 2L, 1L, 4L, 3L, 2L, 1L, 4L, 3L, 2L, 1L, 4L, 
3L, 2L, 1L, 4L, 3L, 2L, 1L, 4L, 3L, 2L, 1L), .Label = c("4", 
"3", "2", "1"), class = "factor")
> head(f)
[1] 1 2 3 4 1 2
Levels: 4 3 2 1

Indexing: much like atomic vectors
- [x] selects a factor for the cell/range x
- [[x]] selects a length=1 factor fro the single cell index x (rarely used)
- The $ operator is invalid with actors

Factor arithmetic & Boolean comparisons
- factors cannot be added, multiple, etc.
- same-type factors are equality testable
> x <- sex.f[1] == sex.f[2];x
[1] FALSE
-- order factors can be order compared
> z <- size1.f[1] < size1.f[2]; z
[1] FALSE
Managing the enumeration (levels) > f <- factor(letters[1:3]);f [1] a b c Levels: a b c > levels(f) [1] "a" "b" "c" > levels(f)[1] [1] "a" > any(levels(f) %in% c('a', 'b', 'c')) [1] TRUE # add new levels > levels(f)[length(levels(f))+1] <-'z'; f [1] a b c Levels: a b c z AA > levels(f) <- c(levels(f), 'AA');f [1] a b c Levels: a b c z AA # reorder levels > f <- factor(f, levels(f)[c(4,1:3,5)]);f [1] xx b c Levels: c z xx b BB # change/rename levels > levels(f)[1] <- 'xx';f [1] xx b c Levels: xx b c z BB > levels(f)[levels(f) %in% 'AA'] <- 'BB';f [1] xx b c Levels: xx b c z BB # delete(drop) unused levels > f <- f[drop=TRUE] > f [1] xx b c Levels: c xx b Adding an element to a factor > f <- factor(letters[1:10]); f [1] a b c d e f g h i j Levels: a b c d e f g h i j > f[length(f) + 1] <- 'a'; f [1] a b c d e f g h i j a Levels: a b c d e f g h i j > f <- factor(c(as.character(f), 'zz')); f [1] a b c d e f g h i j a zz Levels: a b c d e f g h i j zz Merging/combining factors > a <- factor(1:10);a [1] 1 2 3 4 5 6 7 8 9 10 Levels: 1 2 3 4 5 6 7 8 9 10 > b <- factor(letters[a]);b [1] a b c d e f g h i j Levels: a b c d e f g h i j > union <- factor(c(as.character(a), as.character(b))); union [1] 1 2 3 4 5 6 7 8 9 10 a b c d e f g h i j Levels: 1 10 2 3 4 5 6 7 8 9 a b c d e f g h i j > cross <- interaction(a,b); cross [1] 1.a 2.b 3.c 4.d 5.e 6.f 7.g 8.h 9.i 10.j 100 Levels: 1.a 2.a 3.a 4.a 5.a 6.a 7.a 8.a 9.a 10.a 1.b 2.b 3.b 4.b 5.b 6.b 7.b 8.b 9.b 10.b 1.c 2.c 3.c 4.c 5.c ... 10.j
Using factors within data frames
df$x <- reorder(df$f, df$x, F, order=T)
by(df$x, df$f, F)
Traps
1 Strings loaded from a file converted to factors (read.table or read.csv stringASFactors=FALSE)
2 Numbers from a file factorised. as.numeric(levels(f))[as.integer(f)]
3 One factor (enumeration) cannot be meaningfully compared with another
4 NA's missing data in factors and levels can cause problems
5 Adding a row to a data frame, which adds a new level to a column factor.

No comments:

Post a Comment

Blog Archive