Class06: R Functions

Author

Jervic Aquino (PID:A17756721)

Background

Function are at the heart of using R. Everything we do involves calling using functions (from data input, analysis to results output).

All functions in R have at least three things:

  1. A name: the thing we use to call the function
  2. One or more input arguments that are comma separated
  3. The body: lines of code between curly brackets { } that does the work of the function

A first function

Let’s write a function that adds some numbers:

add <- function(x) {
  x + 1
}

Let’s try it out:

add(100)
[1] 101

Will this work?

add(c(100, 200, 300))
[1] 101 201 301

Modify to be more useful and add more than just 1

add <- function(x, y=1) {
  x + y
}
add(100, 10)
[1] 110

Will this still work?

add(100)
[1] 101

No, because there is no y-component. We have to set y equal to value if adding only one argument or component in the created function.

How to argue:

plot(1:10, col="blue", typ="b")

log(10, base=10)
[1] 1

N.B. Input arguments can be either required or optional. The latter have a fall-back default that is specified in the function code with an equals sign.

#add(100, 200, 300)

A second function

All functions in R look like this

name <- function(arg) {
  body
}

The sample() function in R generates random numbers from the range of numbers given. It randomly picks items from the vector inputted.

sample(1:10, size=4)
[1] 5 6 7 2

Q. Return 12 numbers picked randomly from the input 1:10

sample(1:10, size=12, replace=TRUE)
 [1]  4  6  7  1  9  7  2  6  3  2 10 10

Q. Write the code to generate a random 12 nucleotide long DNA sequence

sample(c("a", "c", "g", "t"), size=12, replace=TRUE)
 [1] "t" "t" "c" "c" "c" "g" "g" "t" "t" "g" "g" "t"

Another way to write the code:

bases <- c("A", "T", "G", "C")
sample(bases, size=12, replace=TRUE)
 [1] "G" "A" "T" "C" "C" "G" "G" "C" "T" "A" "A" "T"

Q. Write a first version function called generate_dna() that generates a user specified length n, random DNA sequence

n <- sample(c(4:30), size=1)
generate_dna <- sample(bases, size = n, replace = TRUE)
generate_dna
 [1] "G" "G" "T" "A" "T" "T" "T" "T" "A" "T" "G" "T" "A" "G" "T" "A" "G" "G" "C"
[20] "A" "G" "C" "T" "G"
generate_dna <- function(n=6) {
  bases <- c("A", "T", "G", "C") 
  sample(bases, size=n, replace=TRUE)
}
generate_dna(100)
  [1] "A" "A" "G" "G" "T" "G" "T" "T" "C" "T" "T" "A" "G" "C" "C" "T" "A" "T"
 [19] "C" "G" "A" "T" "A" "C" "A" "G" "T" "C" "T" "C" "C" "G" "A" "G" "G" "G"
 [37] "T" "A" "A" "A" "G" "G" "G" "A" "A" "G" "G" "A" "G" "T" "T" "G" "C" "G"
 [55] "A" "A" "A" "A" "A" "C" "A" "A" "A" "T" "C" "T" "C" "C" "G" "T" "C" "G"
 [73] "A" "G" "T" "A" "T" "A" "G" "C" "A" "A" "A" "G" "A" "T" "C" "T" "C" "A"
 [91] "A" "G" "A" "G" "G" "A" "C" "T" "G" "T"

Q. Modify your function to return a FASTA-like sequence so rather than [10] “G” “C” “A” “A” “T” we want “GCAAT”

generate_dna <- function(n=6) {
  bases <- c("A", "T", "G", "C") 
  sequence <- sample(bases, size=n, replace=TRUE)
  sequence <- paste(sequence, collapse="")
  return(sequence)
}
generate_dna(10)
[1] "GGAATGAAGT"

Q. Give the user an option to return FASTA format output sequence or standard multi-element vector format

generate_dna <- function(n=6, fasta=TRUE) {
  bases <- c("A", "T", "G", "C") 
  sequence <- sample(bases, size=n, replace=TRUE)
  
  if(fasta) {
  sequence <- paste(sequence, collapse="")
  cat("Hello...")
  } else {
    cat("is it me you're looking for...")
  }
  
  return(sequence)
}  
generate_dna(10)
Hello...
[1] "TCCATTCATC"
generate_dna(10, fasta = FALSE)
is it me you're looking for...
 [1] "T" "A" "T" "A" "A" "A" "A" "A" "C" "C"

A new cool function

Q. Write a function called generate_protein() that generates a user specific length protein sequence in FASTA format

generate_protein <- function(n) {
  aa <- sample(c("A","R","N","D","C","Q","E","G","H",
        "I","L","K","M","F","P","S","T","W","Y","V"), size=n, replace=TRUE)
  protein <- paste(aa, collapse="")
  return(protein)
}
generate_protein(10)
[1] "FIAMGCLFYV"

Q. Use your new generate_protein() function to generate sequences between length 6 and 12 amino acids in length and check if any of these are unique in nature (i.e. found in the NR database at NCBI)

generate_protein <- function(n) {
  aa <- sample(c("A","R","N","D","C","Q","E","G","H",
        "I","L","K","M","F","P","S","T","W","Y","V"), size=n, replace=TRUE)
  protein <- paste(aa, collapse="")
  return(protein)
}
generate_protein(6)
[1] "HEQGMM"
generate_protein(7)
[1] "SITVSNT"
generate_protein(8)
[1] "YCEVTDGP"
generate_protein(9)
[1] "LCNQFYICR"
generate_protein(10)
[1] "CWRLAFNYHC"
generate_protein(11)
[1] "QYIVKKTGKVK"
generate_protein(12)
[1] "VIVDKYNASMMS"

Or we could do a for() loop:

for(i in 6:12) {
  cat(">", i, sep="", "\n" )
  cat(generate_protein(i), "\n")
}
>6
ARDDHM 
>7
WQYMSYN 
>8
EWFNKAVL 
>9
SDSNMGRAW 
>10
TDCQFDCMLQ 
>11
FSNFRTQIRKF 
>12
MFHEWVWGKYYA