Introduction to Bioconductor

HBIGS Core Course, May 15, 2025

Author

Florian Huber
florian.huber@flohub.de
Website: flohub.de

What is Bioconductor?

Bioconductor: Open Source Software for Bioinformatics

  • Bioconductor is an ecosystem of R packages for the analysis of biological data.
  • Ecosystem of packages that play well together.
  • Offers packages for sequencing analysis, genomics, searching databases and visualizations, plotting, etc.
  • Two releases per year.
  • BioC package subject to peer review and certain guidelines.
  • Not maintained packages are removed.
  • One of if not the reason for biologists to learn R programming.

Installing Bioconductor

  • Bioconductor builds on R so you need to install R first.
  • BioC installation is managed by BioC’s package manager BiocManager:
> if (!require("BiocManager", quietly = TRUE))
+   install.packages("BiocManager")
> 
> # leave empty to install the newest version
> BiocManager::install() # leave empty or specify e.g. version="3.18"
  • To install BioC packages run:
> BiocManager::install(c("sequencing", "DESeq2"))
> 
> # to update packages run BiocManager::install() without arguments: 
> BiocManager::install()
  • After installation packages can be loaded just like in R with the library() command:
> library(DESeq2)

Exploring Bioconductor and Bioconductor Resources

  • The Bioconductor website is a good place to start. There are a lot of example workflows and online learning materials available.
  • Bioconductor makes heavy use of so-called “vignettes”, which are a sort of tutorial to walk you through packages.
  • For example, to get a tutorial on how to work with sequence data in Bioconductor:
> browseVignettes("sequencing")
  • Another good place to get started is to read this paper introducing the Bioconductor framework.

About this Course

  • We will cover a number of topics that are essential to understand how Bioconductor works and that are important in many of the analyses that people use Bioconductor for.
  • However, as this is only a one-day course some topics are not covered at all and we will not have time to dive very deep into the topics covered.
  • Therefore, I have tried to focus on a “learning to learn” approach: Ways of finding help and consulting the documentation are shown. The course itself is a mix of a bit of theory, live coding, and hands-on exercises. This should help you to “hit the ground running” for your future analyses and to know where to continue.

Part I: Working with Biological Sequences

Overview

There are a lot of BioC packages that serve different purposes when working with DNA/RNA/protein sequence data. They cover all your needs from sequencing quality control to searching genomes, motif search, gene expression analysis etc. This figure is an overview for future reference (taken from vignette(sequencing)):

Biostrings

Assume you want to run analyses on DNA, RNA, or amino acid (AA) sequences such as finding motifs, calculating the GC content etc. This is where the Biostrings package comes in. It provides data types for DNA, RNA, and AA sequence representation and operations.

> library(Biostrings)
> 
> my_dna <- DNAString("ATGCTACGTATA")
> my_dna
## 12-letter DNAString object
## seq: ATGCTACGTATA
> 
> my_dna2 <- DNAString("ATGC.YG") # pYrimidines = C and T
> ?DNAString
> IUPAC_CODE_MAP
##      A      C      G      T      M      R      W      S      Y      K      V 
##    "A"    "C"    "G"    "T"   "AC"   "AG"   "AT"   "CG"   "CT"   "GT"  "ACG" 
##      H      D      B      N 
##  "ACT"  "AGT"  "CGT" "ACGT"
> 
> DNAString("AUG")
## Error in .Call2("new_XString_from_CHARACTER", class(x0), string, start, : key 85 (char 'U') not in lookup table
> DNA_ALPHABET
##  [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" "."
> 
> RNAString("AUG")
## 3-letter RNAString object
## seq: AUG
> 
> a_protein <- AAString("FHIG")
> a_protein
## 4-letter AAString object
## seq: FHIG

These “Biostrings” support a number of useful and biologically meaningful operations.

> my_dna
## 12-letter DNAString object
## seq: ATGCTACGTATA
> 
> my_dna[4:6]
## 3-letter DNAString object
## seq: CTA
> reverseComplement(my_dna)
## 12-letter DNAString object
## seq: TATACGTAGCAT
> translate(my_dna)
## 4-letter AAString object
## seq: MLRI
> codons(my_dna)
## Views on a 12-letter DNAString subject
## subject: ATGCTACGTATA
## views:
##       start end width
##   [1]     1   3     3 [ATG]
##   [2]     4   6     3 [CTA]
##   [3]     7   9     3 [CGT]
##   [4]    10  12     3 [ATA]
> 
> # GENETIC_CODE
> # GENETIC_CODE_TABLE
> 
> mt_genetic_code <- getGeneticCode("SGC1")
> translate(my_dna, genetic.code = mt_genetic_code)
## 4-letter AAString object
## seq: MLRM
> rev(my_dna)
## 12-letter DNAString object
## seq: ATATGCATCGTA

A Look Behind the Scenes: Some Technical Background on Bioconductor and How This Helps Us

Functional Object-Oriented Programming

Let’s interrupt our journey into Bioconductor for a technical detour which will improve your understanding of how R and Bioconductor works. This will make it easier for you to understand the documentation and help files, tutorials etc.

After all, at this point Bioconductor may seem like some wonderful but also confusing black box:

  • What are all these “objects”?
  • For example, what is a “DNAString object”? What can I do with it? Why does rev() from base R work with it?
  • How can I find help?

These are questions you may have already had when learning standard (“base”) R but in Bioconductor knowing the answers becomes more important to effectively use and understand the documentation.

When you write code in R you always “do something” with functions. Every operation invokes a function, either explicitly, for example when you call mean(), sd() etc., or behind the scenes, for example when you use mathematical operators or parentheses. Because functions are so central in R, R is also sometimes called a functional programming language.

> v <- c(3, 4, 8, 12)
> mean(v)
## [1] 6.75
> 
> 5 + 7
## [1] 12
> # note: everything that "does something" in R is a function. Even the +. 
> # Use backticks to call it like a function
> `+`(5, 7)
## [1] 12
> `[`(v, 3)
## [1] 8

However, this is only half the story. Functions always act on objects and objects are always of a certain class. But what are classes? Classes can be thought of as a custom data type. For example, you can define a “DNAString class” as you saw before. The user can then use a so-called constructor function or method to generate objects of that class, so-called “instances”. The class definition defines which properties a class has, which data it contains, and which operations are allowed on it. For example, a DNAString object (instance) contains some data (the DNA sequence), has some properties (for example its length and other metadata), and there are some operations that it supports (for example: creating the reverse complement or translating the sequence).

In fact, this is a behaviour not specific to Bioconductor but of the R language in general. Take a vector of strings, for example: it can be thought of as an instance of the “character” class and this implies that certain operations are supported whereas others are not:

> genes <- c("recA", "mrcB")
> class(genes)
## [1] "character"
> 
> nchar(genes)
## [1] 4 4
> sum(genes)
## Error in sum(genes): invalid 'type' (character) of argument

But as you may have guessed from the “DNAString class”, classes can be more complex: for example, if you ever ran a linear regression in R the output was an object of class “lm”.

> x <- rnorm(5)
> y <- rnorm(5)
> my_lm <- lm(y ~ x)
> my_lm
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      0.1500      -0.9112
> class(my_lm)
## [1] "lm"

It is this class property which will influence what functions do with their arguments: For example, when you use summary(), depending on the argument type (its class), you get a different output:

> x
## [1] -0.049786590  0.472937922  0.218420129  0.008936016  1.055637415
> summary(x)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.049787  0.008936  0.218420  0.341229  0.472938  1.055637
> 
> summary(my_lm)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##        1        2        3        4        5 
## -0.76021  0.62145 -1.75704  1.86229  0.03351 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.1500     0.9292   0.161    0.882
## x            -0.9112     1.7635  -0.517    0.641
## 
## Residual standard error: 1.583 on 3 degrees of freedom
## Multiple R-squared:  0.08172,    Adjusted R-squared:  -0.2244 
## F-statistic: 0.267 on 1 and 3 DF,  p-value: 0.641

This is why functions in R “magically” seem to know what to do with their arguments. Many R functions will first look at the type of their first argument and then call a specialized function behind the scenes to do the actual work. So when you call summary() on the lm object it called the specialized summary.lm() method behind the scenes:

> summary.lm(my_lm)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##        1        2        3        4        5 
## -0.76021  0.62145 -1.75704  1.86229  0.03351 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.1500     0.9292   0.161    0.882
## x            -0.9112     1.7635  -0.517    0.641
## 
## Residual standard error: 1.583 on 3 degrees of freedom
## Multiple R-squared:  0.08172,    Adjusted R-squared:  -0.2244 
## F-statistic: 0.267 on 1 and 3 DF,  p-value: 0.641
> # fails:
> summary.lm(x)
## Error in z$rank: $ operator is invalid for atomic vectors
> 
> # in the case of x it's not summary.numeric but summary.default:
> summary.default(x)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.049787  0.008936  0.218420  0.341229  0.472938  1.055637

We say that summary is a generic function that invokes particular methods depending on the class of its first argument. See also the help file for summary: ?summary. This mechanism is also called type-based dispatch.

This, in a nutshell, is how object oriented programming (OOP) works in R. It is a bit different from how OOP works in languages like Python or Java.

You may be surprised at the number of methods for a given generic function. You can find them by using the methods() function:

> methods("summary")
##  [1] summary,ANY-method                   summary,CompressedIRangesList-method
##  [3] summary,FilterResults-method         summary,FilterRules-method          
##  [5] summary,Hits-method                  summary,IPos-method                 
##  [7] summary,IPosRanges-method            summary,mle-method                  
##  [9] summary,Rle-method                   summary,Seqinfo-method              
## [11] summary,Vector-method                summary.aov                         
## [13] summary.aovlist*                     summary.aspell*                     
## [15] summary.check_packages_in_dir*       summary.connection                  
## [17] summary.data.frame                   summary.Date                        
## [19] summary.default                      summary.ecdf*                       
## [21] summary.factor                       summary.glm                         
## [23] summary.Hits                         summary.infl*                       
## [25] summary.IPos                         summary.IPosRanges                  
## [27] summary.lm                           summary.loess*                      
## [29] summary.manova                       summary.matrix                      
## [31] summary.mlm*                         summary.nls*                        
## [33] summary.packageStatus*               summary.POSIXct                     
## [35] summary.POSIXlt                      summary.ppr*                        
## [37] summary.prcomp*                      summary.princomp*                   
## [39] summary.proc_time                    summary.rlang_error*                
## [41] summary.rlang_message*               summary.rlang_trace*                
## [43] summary.rlang_warning*               summary.rlang:::list_of_conditions* 
## [45] summary.Rle                          summary.Seqinfo                     
## [47] summary.srcfile                      summary.srcref                      
## [49] summary.stepfun                      summary.stl*                        
## [51] summary.table                        summary.tukeysmooth*                
## [53] summary.Vector                       summary.warnings                    
## see '?methods' for accessing help and source code

To find out which methods exist for a given class use the methods() function:

> methods(class="lm")
##  [1] add1           alias          anova          case.names     coerce        
##  [6] confint        cooks.distance deviance       dfbeta         dfbetas       
## [11] drop1          dummy.coef     effects        extractAIC     family        
## [16] formula        hatvalues      influence      initialize     kappa         
## [21] labels         logLik         model.frame    model.matrix   nobs          
## [26] plot           predict        print          proj           qr            
## [31] residuals      rstandard      rstudent       show           simulate      
## [36] slotsFromS3    summary        variable.names vcov          
## see '?methods' for accessing help and source code

The previous paragraphs describe what is known as the S3 class system.

Bioconductor uses the more advanced S4 class system, which is similar to S3 but it offers some additional features. For example, classes can have multiple parent classes and dispatch can be based on the classes of multiple arguments. However, conceptually, from a user’s perspective, it works similar to S3.

As the name “parent class” suggests, classes can be related to each other. Subclasses (also called child classes or derived classes) are often more specialized versions of their parent classes (= superclasses = base classes).

How Does This Help Us?

Now let us find out how this knowledge can help us to navigate Bioconductor. For example, consider our “DNAString” object:

Code
> my_dna <- DNAString("ATGCTACGTATA")
> class(my_dna)
## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"
> 
> # class = is necessary, the default first argument is generic.function
> methods(class = "DNAString")
##   [1] !=                                         
##   [2] [                                          
##   [3] [<-                                        
##   [4] %in%                                       
##   [5] <                                          
##   [6] <=                                         
##   [7] ==                                         
##   [8] >                                          
##   [9] >=                                         
##  [10] aggregate                                  
##  [11] alphabetFrequency                          
##  [12] anyDuplicated                              
##  [13] anyNA                                      
##  [14] append                                     
##  [15] as.character                               
##  [16] as.complex                                 
##  [17] as.data.frame                              
##  [18] as.env                                     
##  [19] as.integer                                 
##  [20] as.list                                    
##  [21] as.logical                                 
##  [22] as.matrix                                  
##  [23] as.numeric                                 
##  [24] as.raw                                     
##  [25] as.vector                                  
##  [26] bindROWS                                   
##  [27] by                                         
##  [28] c                                          
##  [29] chartr                                     
##  [30] codons                                     
##  [31] coerce                                     
##  [32] compact                                    
##  [33] complement                                 
##  [34] countOverlaps                              
##  [35] countPattern                               
##  [36] countPDict                                 
##  [37] countPWM                                   
##  [38] duplicated                                 
##  [39] elementMetadata                            
##  [40] elementMetadata<-                          
##  [41] eval                                       
##  [42] expand                                     
##  [43] expand.grid                                
##  [44] extract_character_from_XString_by_positions
##  [45] extract_character_from_XString_by_ranges   
##  [46] extractAt                                  
##  [47] extractList                                
##  [48] extractROWS                                
##  [49] extractTranscriptSeqs                      
##  [50] FactorToClass                              
##  [51] findOverlaps                               
##  [52] findPalindromes                            
##  [53] hasOnlyBaseLetters                         
##  [54] head                                       
##  [55] intersect                                  
##  [56] is.na                                      
##  [57] isMatchingEndingAt                         
##  [58] isMatchingStartingAt                       
##  [59] lcprefix                                   
##  [60] lcsubstr                                   
##  [61] lcsuffix                                   
##  [62] length                                     
##  [63] lengths                                    
##  [64] letter                                     
##  [65] letterFrequency                            
##  [66] letterFrequencyInSlidingView               
##  [67] make_XString_from_string                   
##  [68] maskMotif                                  
##  [69] masks                                      
##  [70] masks<-                                    
##  [71] match                                      
##  [72] matchLRPatterns                            
##  [73] matchPattern                               
##  [74] matchPDict                                 
##  [75] matchProbePair                             
##  [76] matchPWM                                   
##  [77] mcols                                      
##  [78] mcols<-                                    
##  [79] merge                                      
##  [80] mergeROWS                                  
##  [81] metadata                                   
##  [82] metadata<-                                 
##  [83] mstack                                     
##  [84] nchar                                      
##  [85] neditEndingAt                              
##  [86] neditStartingAt                            
##  [87] oligonucleotideFrequency                   
##  [88] overlapsAny                                
##  [89] palindromeArmLength                        
##  [90] palindromeLeftArm                          
##  [91] palindromeRightArm                         
##  [92] parallel_slot_names                        
##  [93] pcompare                                   
##  [94] pmatchPattern                              
##  [95] rank                                       
##  [96] relist                                     
##  [97] relistToClass                              
##  [98] rename                                     
##  [99] rep                                        
## [100] rep.int                                    
## [101] replaceAt                                  
## [102] replaceLetterAt                            
## [103] replaceROWS                                
## [104] rev                                        
## [105] reverse                                    
## [106] reverseComplement                          
## [107] selfmatch                                  
## [108] seqlevelsInUse                             
## [109] seqtype                                    
## [110] seqtype<-                                  
## [111] setdiff                                    
## [112] setequal                                   
## [113] shiftApply                                 
## [114] show                                       
## [115] showAsCell                                 
## [116] sort                                       
## [117] split                                      
## [118] split<-                                    
## [119] subseq                                     
## [120] subseq<-                                   
## [121] subset                                     
## [122] subsetByOverlaps                           
## [123] substr                                     
## [124] substring                                  
## [125] summary                                    
## [126] table                                      
## [127] tail                                       
## [128] tapply                                     
## [129] toComplex                                  
## [130] toString                                   
## [131] transform                                  
## [132] translate                                  
## [133] trimLRPatterns                             
## [134] twoWayAlphabetFrequency                    
## [135] union                                      
## [136] unique                                     
## [137] uniqueLetters                              
## [138] unmasked                                   
## [139] updateObject                               
## [140] values                                     
## [141] values<-                                   
## [142] vcountPattern                              
## [143] vcountPDict                                
## [144] Views                                      
## [145] vmatchPattern                              
## [146] vmatchPDict                                
## [147] vwhichPDict                                
## [148] which.isMatchingEndingAt                   
## [149] which.isMatchingStartingAt                 
## [150] whichPDict                                 
## [151] window                                     
## [152] window<-                                   
## [153] with                                       
## [154] xtabs                                      
## [155] xtfrm                                      
## [156] xvcopy                                     
## see '?methods' for accessing help and source code

We can see that DNAStrings support a translate method.

> # magic!
> translate(my_dna)
## 4-letter AAString object
## seq: MLRI

With methods(class = "DNAString") we could find out which methods the DNAString class supports (what we “can do with it”). But we can also ask in the other direction: Which other classes support a translate method, i.e. what other things can we translate? Let’s find out:

> methods("translate")
## [1] translate,DNAString-method       translate,DNAStringSet-method   
## [3] translate,MaskedDNAString-method translate,MaskedRNAString-method
## [5] translate,RNAString-method       translate,RNAStringSet-method   
## see '?methods' for accessing help and source code

To find out if an object is of a certain class (or derives from it), use is():

> is(my_dna, "DNAString")
## [1] TRUE
> is(my_dna, "XString")
## [1] TRUE

To get more help about a class, you can add some qualifiers to ? to be more specific about the class or method you are interested in, for example:

> # in this case the following to point to the same help page but this is not 
> # always the case
> ?DNAString
> ?"DNAString-class"
> ?`translate,DNAString-method`

Using the Vignettes

One of the great things about Bioconductor is that most packages have very detailed tutorials called “vignettes”. You can check them in your package inspector in RStudio or like this:

> browseVignettes(package = "Biostrings")

Now that you know about Bioconductor classes and methods you will find those help files more useful.

Exercise

Explore the different classes in the Biostrings package and the operations they support using the R documentation and vignettes only.

The Ranges infrastructure: IRanges and GenomicRanges

Classes to work with DNA, RNA, and protein sequences are a good start but often the sequences you work with are embedded in a common context, for example a chromosome or a set of chromosomes.

There are many things you may be interested in when working with a genome: perhaps you are interested in finding all the introns of a specific gene or you may be looking for certain sequence motifs in the upstream sequence of a protein coding gene and many more.

To support such operations it is necessary to define classes that support the properties and operations required for such work. The “IRanges” and “GRanges” (GenomicRanges) classes provide these features.

IRanges

IRanges is the name for a class that supports generic operations on sequence data. They form the foundation for GenomicRanges, which is Bioconductor’s way of representing genome data. Let’s have a quick look at what IRanges can do:

> library(IRanges)
> 
> ir1 <- IRanges(start = c(1, 5, 15), 
+               width = c(3, 25, 10))
> ir1
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1         3         3
##   [2]         5        29        25
##   [3]        15        24        10
> 
> ir2 <- IRanges(start = 7, width = 5)
> ir2 
## IRanges object with 1 range and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         7        11         5
> 
> start(ir1)
## [1]  1  5 15
> ir1[2:3]
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         5        29        25
##   [2]        15        24        10

The following function will make it easier to illustrate the different IRanges operations:

> # from the IRanges vignette:
> plotRanges <- function(x, xlim=x, main=deparse(substitute(x)), col="black", sep=0.5, ...) {
+   height <- 1
+   if (is(xlim, "IntegerRanges"))
+     xlim <- c(min(start(xlim)), max(end(xlim)))
+   bins <- disjointBins(IRanges(start(x), end(x) + 1))
+   plot.new()
+   plot.window(xlim, c(0, max(bins)*(height + sep)))
+   ybottom <- bins * (sep + height) - height
+   rect(start(x)-0.5, ybottom, end(x)+0.5, ybottom + height, col=col, ...)
+   title(main)
+   axis(1)
+ }
> 
> plotRanges(ir1)

> plotRanges(ir2)

> plotRanges(reduce(ir1), xlim = c(1, 30))

The following selection of operations give you a taste of what’s possible with IRanges. Can you think of the corresponding biological questions?

> findOverlaps(ir1, ir2)
## Hits object with 1 hit and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         2           1
##   -------
##   queryLength: 3 / subjectLength: 1
> 
> plotRanges(disjoin(ir1))

> gaps(ir1)
## IRanges object with 1 range and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         4         4         1
> ir1
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1         3         3
##   [2]         5        29        25
##   [3]        15        24        10
> flank(ir1, width = 5)
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        -4         0         5
##   [2]         0         4         5
##   [3]        10        14         5

GenomicRanges (GRanges)

The GRanges class is how functional units (e.g. genes) on a chromosome are represented in Bioconductor. It has biologically meaningful properties such as the strand on which a sequence is located and supports metadata.

> library(GenomicRanges)
> 
> gr <- GRanges(
+     seqnames = Rle(c("chr1", "chr2", "chr1"), c(1, 3, 2)),
+     ranges = IRanges(start=101:106, end = 111:116, names = letters[1:6]), 
+     strand = strand(c("+", "+", "-", "-", "+", "+")), 
+     score = runif(6), 
+     GC = c(0.45, 0.47, 0.7, 0.68, 0.3, 0.33))
> gr
## GRanges object with 6 ranges and 2 metadata columns:
##     seqnames    ranges strand |     score        GC
##        <Rle> <IRanges>  <Rle> | <numeric> <numeric>
##   a     chr1   101-111      + |  0.714799      0.45
##   b     chr2   102-112      + |  0.786448      0.47
##   c     chr2   103-113      - |  0.499891      0.70
##   d     chr2   104-114      - |  0.654050      0.68
##   e     chr1   105-115      + |  0.821489      0.30
##   f     chr1   106-116      + |  0.946777      0.33
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

There are now a couple of accessor functions for the genomic coordinates as well as the metadata columns:

> # GenomicRanges are vector-like
> gr
## GRanges object with 6 ranges and 2 metadata columns:
##     seqnames    ranges strand |     score        GC
##        <Rle> <IRanges>  <Rle> | <numeric> <numeric>
##   a     chr1   101-111      + |  0.714799      0.45
##   b     chr2   102-112      + |  0.786448      0.47
##   c     chr2   103-113      - |  0.499891      0.70
##   d     chr2   104-114      - |  0.654050      0.68
##   e     chr1   105-115      + |  0.821489      0.30
##   f     chr1   106-116      + |  0.946777      0.33
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
> gr[c("a", "c")]
## GRanges object with 2 ranges and 2 metadata columns:
##     seqnames    ranges strand |     score        GC
##        <Rle> <IRanges>  <Rle> | <numeric> <numeric>
##   a     chr1   101-111      + |  0.714799      0.45
##   c     chr2   103-113      - |  0.499891      0.70
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
> strand(gr)
## factor-Rle of length 6 with 3 runs
##   Lengths: 2 2 2
##   Values : + - +
## Levels(3): + - *
> granges(gr)
## GRanges object with 6 ranges and 0 metadata columns:
##     seqnames    ranges strand
##        <Rle> <IRanges>  <Rle>
##   a     chr1   101-111      +
##   b     chr2   102-112      +
##   c     chr2   103-113      -
##   d     chr2   104-114      -
##   e     chr1   105-115      +
##   f     chr1   106-116      +
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
> names(gr)
## [1] "a" "b" "c" "d" "e" "f"
> 
> # to get the metadata
> mcols(gr)
## DataFrame with 6 rows and 2 columns
##       score        GC
##   <numeric> <numeric>
## a  0.714799      0.45
## b  0.786448      0.47
## c  0.499891      0.70
## d  0.654050      0.68
## e  0.821489      0.30
## f  0.946777      0.33
> mcols(gr)[c("a", "c"), ]
## DataFrame with 2 rows and 2 columns
##       score        GC
##   <numeric> <numeric>
## a  0.714799      0.45
## c  0.499891      0.70
> mcols(gr)[1]
## DataFrame with 6 rows and 1 column
##       score
##   <numeric>
## a  0.714799
## b  0.786448
## c  0.499891
## d  0.654050
## e  0.821489
## f  0.946777
> mcols(gr)[[1]]
## [1] 0.7147989 0.7864476 0.4998908 0.6540496 0.8214889 0.9467766
> 
> # to get a subset of the GRanges with only part of the metadata:
> gr[3:5, "score"]
## GRanges object with 3 ranges and 1 metadata column:
##     seqnames    ranges strand |     score
##        <Rle> <IRanges>  <Rle> | <numeric>
##   c     chr2   103-113      - |  0.499891
##   d     chr2   104-114      - |  0.654050
##   e     chr1   105-115      + |  0.821489
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Several GenomicRanges can be put together into a list, the GRangesList. This makes sense because some sequences are made from disjoint sequences on a chromosome. Think of transcripts, for example, that can consist of several exons:

> gr1 <- GRanges(
+     seqnames = "chr2",
+     ranges = IRanges(103, 106),
+     strand = "+",
+     score = 5L, GC = 0.45)
> gr2 <- GRanges(
+     seqnames = c("chr1", "chr1"),
+     ranges = IRanges(c(107, 113), width = 3),
+     strand = c("+", "-"),
+     score = 3:4, GC = c(0.3, 0.5))
> grl <- GRangesList("txA" = gr1, "txB" = gr2)
> grl
## GRangesList object of length 2:
## $txA
## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     score        GC
##          <Rle> <IRanges>  <Rle> | <integer> <numeric>
##   [1]     chr2   103-106      + |         5      0.45
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
## 
## $txB
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames    ranges strand |     score        GC
##          <Rle> <IRanges>  <Rle> | <integer> <numeric>
##   [1]     chr1   107-109      + |         3       0.3
##   [2]     chr1   113-115      - |         4       0.5
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Loading sequence data into R

There are several ways to import sequence data into R using Bioconductor. Depending on your need you will probably access them from either: - local files, perhaps as a result from sequencing e.g. fasta, fastq, BAM, gtf, bed, wig etc. files. - some sort of sequence database, including reference genomes.

You can read more on sequence imports on this page:

> browseVignettes("sequencing")

Accessing Sequence Information from Reference Genomes

Imagine the following scenario: in a screening experiment you identified, say, 100 genes which responded to your treatment. Now you may be interested if certain sequence motifs, e.g. transcription factor binding sites, are present in the promoter regions of these genes. Or perhaps you want to do some comparative genomics and perform a BLAST search. How can you do this in Bioconductor?

To do this, you will need two things:

  • A package containing the genome of the species you are interested in. These usually have the format “BSgenome.SPECIESNAME.UCSC…”.
  • A package giving you information about the transcripts in the genome. These are often called “TxDb.SPECIESNAME.UCSC…”.

In addition, you will probably face the problem that you need to work with gene names and annotations. This topic is covered in more detail at the end of this course (section “Using Annotation Resources”).

> library("BSgenome.Scerevisiae.UCSC.sacCer3")
> 
> sc_genome <- BSgenome.Scerevisiae.UCSC.sacCer3
> sc_genome
## | BSgenome object for Yeast
## | - organism: Saccharomyces cerevisiae
## | - provider: UCSC
## | - genome: sacCer3
## | - release date: April 2011
## | - 17 sequence(s):
## |     chrI    chrII   chrIII  chrIV   chrV    chrVI   chrVII  chrVIII chrIX  
## |     chrX    chrXI   chrXII  chrXIII chrXIV  chrXV   chrXVI  chrM           
## | 
## | Tips: call 'seqnames()' on the object to get all the sequence names, call
## | 'seqinfo()' to get the full sequence info, use the '$' or '[[' operator to
## | access a given sequence, see '?BSgenome' for more information.
> seqnames(sc_genome)
##  [1] "chrI"    "chrII"   "chrIII"  "chrIV"   "chrV"    "chrVI"   "chrVII" 
##  [8] "chrVIII" "chrIX"   "chrX"    "chrXI"   "chrXII"  "chrXIII" "chrXIV" 
## [15] "chrXV"   "chrXVI"  "chrM"
> seqinfo(sc_genome)
## Seqinfo object with 17 sequences (1 circular) from sacCer3 genome:
##   seqnames seqlengths isCircular  genome
##   chrI         230218      FALSE sacCer3
##   chrII        813184      FALSE sacCer3
##   chrIII       316620      FALSE sacCer3
##   chrIV       1531933      FALSE sacCer3
##   chrV         576874      FALSE sacCer3
##   ...             ...        ...     ...
##   chrXIII      924431      FALSE sacCer3
##   chrXIV       784333      FALSE sacCer3
##   chrXV       1091291      FALSE sacCer3
##   chrXVI       948066      FALSE sacCer3
##   chrM          85779       TRUE sacCer3
> ?'BSgenome'
> metadata(sc_genome)
## $organism
## [1] "Saccharomyces cerevisiae"
## 
## $common_name
## [1] "Yeast"
## 
## $provider
## [1] "UCSC"
## 
## $genome
## [1] "sacCer3"
## 
## $release_date
## [1] "April 2011"
## 
## $source_url
## [1] "http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/"
> seqnames(sc_genome)
##  [1] "chrI"    "chrII"   "chrIII"  "chrIV"   "chrV"    "chrVI"   "chrVII" 
##  [8] "chrVIII" "chrIX"   "chrX"    "chrXI"   "chrXII"  "chrXIII" "chrXIV" 
## [15] "chrXV"   "chrXVI"  "chrM"
> seqlengths(sc_genome)
##    chrI   chrII  chrIII   chrIV    chrV   chrVI  chrVII chrVIII   chrIX    chrX 
##  230218  813184  316620 1531933  576874  270161 1090940  562643  439888  745751 
##   chrXI  chrXII chrXIII  chrXIV   chrXV  chrXVI    chrM 
##  666816 1078177  924431  784333 1091291  948066   85779
> mseqnames(sc_genome)
## NULL
> 
> sapply(
+   list(metadata = metadata, seqnames = seqnames, 
+        seqlengths = seqlengths, masknames = masknames, 
+        mseqnames = mseqnames), 
+   function(f) f(sc_genome)
+   )
## $metadata
## $metadata$organism
## [1] "Saccharomyces cerevisiae"
## 
## $metadata$common_name
## [1] "Yeast"
## 
## $metadata$provider
## [1] "UCSC"
## 
## $metadata$genome
## [1] "sacCer3"
## 
## $metadata$release_date
## [1] "April 2011"
## 
## $metadata$source_url
## [1] "http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/"
## 
## 
## $seqnames
##  [1] "chrI"    "chrII"   "chrIII"  "chrIV"   "chrV"    "chrVI"   "chrVII" 
##  [8] "chrVIII" "chrIX"   "chrX"    "chrXI"   "chrXII"  "chrXIII" "chrXIV" 
## [15] "chrXV"   "chrXVI"  "chrM"   
## 
## $seqlengths
##    chrI   chrII  chrIII   chrIV    chrV   chrVI  chrVII chrVIII   chrIX    chrX 
##  230218  813184  316620 1531933  576874  270161 1090940  562643  439888  745751 
##   chrXI  chrXII chrXIII  chrXIV   chrXV  chrXVI    chrM 
##  666816 1078177  924431  784333 1091291  948066   85779 
## 
## $masknames
## NULL
## 
## $mseqnames
## NULL
> chrI <- sc_genome[["chrI"]]
> chrI 
## 230218-letter DNAString object
## seq: CCACACCACACCCACACACCCACACACCACACCACA...GGTGTGTGGGTGTGGTGTGGGTGTGGTGTGTGTGGG
> class(chrI)
## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"
> length(chrI)
## [1] 230218
> chrI[150:250]
## 101-letter DNAString object
## seq: CCCTGTCCCATTCAACCATACCACTCCGAACCACCA...ACCGTTACCCTCCAATTACCCATATCCAACCCACTG

To get information on transcripts in, for example, the yeast genome, we load the “TxDb.Scerevisiae.UCSC.sacCer3.sgdGene” package. It supports a wide range of interesting operations:

> library("TxDb.Scerevisiae.UCSC.sacCer3.sgdGene")
> sc_txdb <- TxDb.Scerevisiae.UCSC.sacCer3.sgdGene
> sc_txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: sacCer3
## # Organism: Saccharomyces cerevisiae
## # Taxonomy ID: 4932
## # UCSC Table: sgdGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Name of canonical transcript in cluster
## # Full dataset: yes
## # miRBase build ID: NA
## # transcript_nrow: 6692
## # exon_nrow: 7034
## # cds_nrow: 7034
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-10-07 18:20:42 +0000 (Wed, 07 Oct 2015)
## # GenomicFeatures version at creation time: 1.21.30
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1
> class(sc_txdb)
## [1] "TxDb"
## attr(,"package")
## [1] "GenomicFeatures"
> 
> # Check the help: ?"TxDb-class"
> methods(class = "TxDb")
##  [1] $                      $<-                    annotatedDataFrameFrom
##  [4] as.list                asBED                  asGFF                 
##  [7] assayData              assayData<-            cds                   
## [10] cdsBy                  cdsByOverlaps          coerce                
## [13] columns                combine                contents              
## [16] dbconn                 dbfile                 dbInfo                
## [19] dbmeta                 dbschema               distance              
## [22] exons                  exonsBy                exonsByOverlaps       
## [25] ExpressionSet          extractUpstreamSeqs    featureNames          
## [28] featureNames<-         fiveUTRsByTranscript   genes                 
## [31] initialize             intronsByTranscript    isActiveSeq           
## [34] isActiveSeq<-          isNA                   keys                  
## [37] keytypes               mapIds                 mapIdsToRanges        
## [40] mappedkeys             mapRangesToIds         mapToTranscripts      
## [43] metadata               microRNAs              nhit                  
## [46] organism               promoters              revmap                
## [49] sample                 sampleNames            sampleNames<-         
## [52] saveDb                 select                 seqinfo               
## [55] seqinfo<-              seqlevels<-            seqlevels0            
## [58] show                   species                storageMode           
## [61] storageMode<-          taxonomyId             terminators           
## [64] threeUTRsByTranscript  transcripts            transcriptsBy         
## [67] transcriptsByOverlaps  tRNAs                  updateObject          
## see '?methods' for accessing help and source code
> 
> seqinfo(sc_txdb)
## Seqinfo object with 17 sequences (1 circular) from sacCer3 genome:
##   seqnames seqlengths isCircular  genome
##   chrI         230218       <NA> sacCer3
##   chrII        813184       <NA> sacCer3
##   chrIII       316620       <NA> sacCer3
##   chrIV       1531933       <NA> sacCer3
##   chrV         576874       <NA> sacCer3
##   ...             ...        ...     ...
##   chrXIII      924431       <NA> sacCer3
##   chrXIV       784333       <NA> sacCer3
##   chrXV       1091291       <NA> sacCer3
##   chrXVI       948066       <NA> sacCer3
##   chrM          85779       TRUE sacCer3
> seqlevels(sc_txdb)
##  [1] "chrI"    "chrII"   "chrIII"  "chrIV"   "chrV"    "chrVI"   "chrVII" 
##  [8] "chrVIII" "chrIX"   "chrX"    "chrXI"   "chrXII"  "chrXIII" "chrXIV" 
## [15] "chrXV"   "chrXVI"  "chrM"
> seqlengths(sc_txdb)
##    chrI   chrII  chrIII   chrIV    chrV   chrVI  chrVII chrVIII   chrIX    chrX 
##  230218  813184  316620 1531933  576874  270161 1090940  562643  439888  745751 
##   chrXI  chrXII chrXIII  chrXIV   chrXV  chrXVI    chrM 
##  666816 1078177  924431  784333 1091291  948066   85779
> 
> # two things are important (see help page ?"TxDb-class")
> # transcripts(), transcriptsBy(), transcriptsByOverlaps()
> # select-methods 
> transcripts(sc_txdb)
## GRanges object with 6692 ranges and 2 metadata columns:
##          seqnames      ranges strand |     tx_id     tx_name
##             <Rle>   <IRanges>  <Rle> | <integer> <character>
##      [1]     chrI     335-649      + |         1     YAL069W
##      [2]     chrI     538-792      + |         2   YAL068W-A
##      [3]     chrI   2480-2707      + |         3   YAL067W-A
##      [4]     chrI 10091-10399      + |         4     YAL066W
##      [5]     chrI 12046-12426      + |         5   YAL064W-B
##      ...      ...         ...    ... .       ...         ...
##   [6688]     chrM 65770-66174      + |      6688       Q0182
##   [6689]     chrM 73758-74513      + |      6689       Q0250
##   [6690]     chrM 74495-75984      + |      6690       Q0255
##   [6691]     chrM 79213-80022      + |      6691       Q0275
##   [6692]     chrM 85554-85709      + |      6692       Q0297
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
> 
> # how to find a gene? 
> # Check the help: ?`transcriptsBy,TxDb-method`
> transcriptsBy(sc_txdb)
## GRangesList object of length 6534:
## $Q0010
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames    ranges strand |     tx_id     tx_name
##          <Rle> <IRanges>  <Rle> | <integer> <character>
##   [1]     chrM 3952-4338      + |      6665       Q0010
##   [2]     chrM 4254-4415      + |      6666       Q0017
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
## 
## $Q0032
## GRanges object with 1 range and 2 metadata columns:
##       seqnames      ranges strand |     tx_id     tx_name
##          <Rle>   <IRanges>  <Rle> | <integer> <character>
##   [1]     chrM 11667-11957      + |      6667       Q0032
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
## 
## $Q0055
## GRanges object with 6 ranges and 2 metadata columns:
##       seqnames      ranges strand |     tx_id     tx_name
##          <Rle>   <IRanges>  <Rle> | <integer> <character>
##   [1]     chrM 13818-16322      + |      6668       Q0050
##   [2]     chrM 13818-18830      + |      6669       Q0055
##   [3]     chrM 13818-19996      + |      6670       Q0060
##   [4]     chrM 13818-21935      + |      6671       Q0065
##   [5]     chrM 13818-23167      + |      6672       Q0070
##   [6]     chrM 13818-26701      + |      6673       Q0045
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
## 
## ...
## <6531 more elements>
> 
> ## TO DO: how does this work exactly? 
> transcriptsBy(sc_txdb, by = "gene")
## GRangesList object of length 6534:
## $Q0010
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames    ranges strand |     tx_id     tx_name
##          <Rle> <IRanges>  <Rle> | <integer> <character>
##   [1]     chrM 3952-4338      + |      6665       Q0010
##   [2]     chrM 4254-4415      + |      6666       Q0017
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
## 
## $Q0032
## GRanges object with 1 range and 2 metadata columns:
##       seqnames      ranges strand |     tx_id     tx_name
##          <Rle>   <IRanges>  <Rle> | <integer> <character>
##   [1]     chrM 11667-11957      + |      6667       Q0032
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
## 
## $Q0055
## GRanges object with 6 ranges and 2 metadata columns:
##       seqnames      ranges strand |     tx_id     tx_name
##          <Rle>   <IRanges>  <Rle> | <integer> <character>
##   [1]     chrM 13818-16322      + |      6668       Q0050
##   [2]     chrM 13818-18830      + |      6669       Q0055
##   [3]     chrM 13818-19996      + |      6670       Q0060
##   [4]     chrM 13818-21935      + |      6671       Q0065
##   [5]     chrM 13818-23167      + |      6672       Q0070
##   [6]     chrM 13818-26701      + |      6673       Q0045
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
## 
## ...
## <6531 more elements>
> transcriptsBy(sc_txdb, by = "gene") |> names() |> head()
## [1] "Q0010" "Q0032" "Q0055" "Q0075" "Q0080" "Q0085"
> transcriptsBy(sc_txdb, by = "gene") |> lengths() |> head()
## Q0010 Q0032 Q0055 Q0075 Q0080 Q0085 
##     2     1     6     1     1     1
> # most of them have just one transcript but not all, e.g. YAL069W: 2 transcripts annotated for that gene
> transcriptsBy(sc_txdb, by = "gene")["YAL069W"]
## GRangesList object of length 1:
## $YAL069W
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames    ranges strand |     tx_id     tx_name
##          <Rle> <IRanges>  <Rle> | <integer> <character>
##   [1]     chrI   335-649      + |         1     YAL069W
##   [2]     chrI   538-792      + |         2   YAL068W-A
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
> 
> transcriptsBy(sc_txdb, by = "exon") |> names() |> head()
## [1] "1" "2" "3" "4" "5" "6"
> transcriptsBy(sc_txdb, by = "cds") |> names() |> head()
## [1] "1" "2" "3" "4" "5" "6"
> 
> exonsBy(sc_txdb, by = "tx")
## GRangesList object of length 6692:
## $`1`
## GRanges object with 1 range and 3 metadata columns:
##       seqnames    ranges strand |   exon_id   exon_name exon_rank
##          <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chrI   335-649      + |         1        <NA>         1
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
## 
## $`2`
## GRanges object with 1 range and 3 metadata columns:
##       seqnames    ranges strand |   exon_id   exon_name exon_rank
##          <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chrI   538-792      + |         2        <NA>         1
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
## 
## $`3`
## GRanges object with 1 range and 3 metadata columns:
##       seqnames    ranges strand |   exon_id   exon_name exon_rank
##          <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chrI 2480-2707      + |         3        <NA>         1
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
## 
## ...
## <6689 more elements>
> exonsBy(sc_txdb, by = "gene")
## GRangesList object of length 6534:
## $Q0010
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames    ranges strand |   exon_id   exon_name
##          <Rle> <IRanges>  <Rle> | <integer> <character>
##   [1]     chrM 3952-4338      + |      6992        <NA>
##   [2]     chrM 4254-4415      + |      6993        <NA>
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
## 
## $Q0032
## GRanges object with 1 range and 2 metadata columns:
##       seqnames      ranges strand |   exon_id   exon_name
##          <Rle>   <IRanges>  <Rle> | <integer> <character>
##   [1]     chrM 11667-11957      + |      6994        <NA>
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
## 
## $Q0055
## GRanges object with 13 ranges and 2 metadata columns:
##        seqnames      ranges strand |   exon_id   exon_name
##           <Rle>   <IRanges>  <Rle> | <integer> <character>
##    [1]     chrM 13818-13986      + |      6995        <NA>
##    [2]     chrM 13818-16322      + |      6996        <NA>
##    [3]     chrM 16435-16470      + |      6997        <NA>
##    [4]     chrM 16435-18830      + |      6998        <NA>
##    [5]     chrM 18954-18991      + |      6999        <NA>
##    ...      ...         ...    ... .       ...         ...
##    [9]     chrM 21995-22246      + |      7003        <NA>
##   [10]     chrM 21995-23167      + |      7004        <NA>
##   [11]     chrM 23612-23746      + |      7005        <NA>
##   [12]     chrM 25318-25342      + |      7008        <NA>
##   [13]     chrM 26229-26701      + |      7009        <NA>
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
## 
## ...
## <6531 more elements>
> 
> ime4 <- transcriptsBy(sc_txdb, by = "gene")[["YGL192W"]]
> ime4
## GRanges object with 1 range and 2 metadata columns:
##       seqnames        ranges strand |     tx_id     tx_name
##          <Rle>     <IRanges>  <Rle> | <integer> <character>
##   [1]   chrVII 142246-144048      + |      2136     YGL192W
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
> getSeq(sc_genome, ime4)
## DNAStringSet object of length 1:
##     width seq
## [1]  1803 ATGATTAACGATAAACTAGTACATTTTCTGATCC...CAAACACTAAATAACCTATATTTTGCTCAGTAA
> 
> intronsByTranscript(sc_txdb)
## GRangesList object of length 6692:
## $`1`
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
## 
## $`2`
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
## 
## $`3`
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
## 
## ...
## <6689 more elements>
> intronsByTranscript(sc_txdb, use.names = TRUE)["YAL067W-A"]
## GRangesList object of length 1:
## $`YAL067W-A`
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
> intronsByTranscript(sc_txdb, use.names = TRUE)[2035]
## GRangesList object of length 1:
## $YFL039C
## GRanges object with 1 range and 0 metadata columns:
##       seqnames      ranges strand
##          <Rle>   <IRanges>  <Rle>
##   [1]    chrVI 54378-54686      -
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
> intronsByTranscript(sc_txdb, use.names = TRUE)["YFL039C"]
## GRangesList object of length 1:
## $YFL039C
## GRanges object with 1 range and 0 metadata columns:
##       seqnames      ranges strand
##          <Rle>   <IRanges>  <Rle>
##   [1]    chrVI 54378-54686      -
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
> 
> transcriptsBy(sc_txdb)["YFL039C"]
## GRangesList object of length 1:
## $YFL039C
## GRanges object with 1 range and 2 metadata columns:
##       seqnames      ranges strand |     tx_id     tx_name
##          <Rle>   <IRanges>  <Rle> | <integer> <character>
##   [1]    chrVI 53260-54696      - |      2035     YFL039C
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
> 
> # intronsByTranscript(sc_txdb)["YFL039C"]
> 
> fiveUTRsByTranscript(sc_txdb)
## GRangesList object of length 0:
## <0 elements>
> 
> # Transcripts have id numbers
> cdsBy(sc_txdb, "tx")
## GRangesList object of length 6692:
## $`1`
## GRanges object with 1 range and 3 metadata columns:
##       seqnames    ranges strand |    cds_id    cds_name exon_rank
##          <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chrI   335-649      + |         1        <NA>         1
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
## 
## $`2`
## GRanges object with 1 range and 3 metadata columns:
##       seqnames    ranges strand |    cds_id    cds_name exon_rank
##          <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chrI   538-792      + |         2        <NA>         1
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
## 
## $`3`
## GRanges object with 1 range and 3 metadata columns:
##       seqnames    ranges strand |    cds_id    cds_name exon_rank
##          <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chrI 2480-2707      + |         3        <NA>         1
##   -------
##   seqinfo: 17 sequences (1 circular) from sacCer3 genome
## 
## ...
## <6689 more elements>
> names(cdsBy(sc_txdb, "tx")) |> head()
## [1] "1" "2" "3" "4" "5" "6"
> # use use.names = TRUE to get a more useful identifier:
> cdsBy(sc_txdb, "tx", use.names = TRUE) |> names() |> grep(pattern = "YFL039C")
## [1] 2035
> 
> # ?id2name: 
> # "this internal id was generated at the time the TxDb object was created and has 
> # no meaning outside the scope of this object
> id2name(sc_txdb, feature.type = "tx")[2035]
##      2035 
## "YFL039C"
> 
> # some things were apparently not recorded for the yeast genome TxDb
> methods(class = "TxDb")
##  [1] $                      $<-                    annotatedDataFrameFrom
##  [4] as.list                asBED                  asGFF                 
##  [7] assayData              assayData<-            cds                   
## [10] cdsBy                  cdsByOverlaps          coerce                
## [13] columns                combine                contents              
## [16] dbconn                 dbfile                 dbInfo                
## [19] dbmeta                 dbschema               distance              
## [22] exons                  exonsBy                exonsByOverlaps       
## [25] ExpressionSet          extractUpstreamSeqs    featureNames          
## [28] featureNames<-         fiveUTRsByTranscript   genes                 
## [31] initialize             intronsByTranscript    isActiveSeq           
## [34] isActiveSeq<-          isNA                   keys                  
## [37] keytypes               mapIds                 mapIdsToRanges        
## [40] mappedkeys             mapRangesToIds         mapToTranscripts      
## [43] metadata               microRNAs              nhit                  
## [46] organism               promoters              revmap                
## [49] sample                 sampleNames            sampleNames<-         
## [52] saveDb                 select                 seqinfo               
## [55] seqinfo<-              seqlevels<-            seqlevels0            
## [58] show                   species                storageMode           
## [61] storageMode<-          taxonomyId             terminators           
## [64] threeUTRsByTranscript  transcripts            transcriptsBy         
## [67] transcriptsByOverlaps  tRNAs                  updateObject          
## see '?methods' for accessing help and source code

Exercise: Performing DNA Sequence Motif Search with Bioconductor

It is now time to practice what we have learned so far! In the following exercise we will extract the downstream sequence of a gene and perform a motif search on it. Not all functions that you need have been covered yet, you will also need to consult the documentation: this is by design, a lot of the work when analysing data with Bioconductor is spent finding out how to do things - even for experienced programmers and bioinformaticians!

The setting: You have performed a genetic screen in Saccharomyces cerevisiae and identified several genes that are suppressed by antisense transcription. These antisense RNAs (asRNAs) originate in the 3’-intergenic region (3’-IGR) of the genes. To understand how these asRNAs are regulated you decide in a first step to check if the genes contain a TATA-binding protein (TBP) sequence motif (“TATA box”) in antisense direction relative to the coding sequence of the respective genes.

Your task: use Bioconductor packages and functions to find out if there is a TBP-binding motif in antisense direction in the 3’-IGR of the IME4 gene. Limit the search to the first 300 base pairs downstream of the gene. This will require a number of steps:

  1. Mapping annotations:
  • As “IME4” is not a systematic gene name you first need to find out what is the systematic name of this gene. One option is to use the org.Sc.sgd.db package. Can you find out how to use it to map to the systematic name? After loading the package you can type its name in the console: read the output carefully as it gives you a hint on how to use it. Hint: You find trivial names in the “GENENAME” column and the systematic name in the “ENSEMBL” column. We will explore the usage of annotation packages a bit more at the end of the course.
Solution
> "YGL192W"
  1. Extracting the sequence:
  • Load the TxDb.Scerevisiae.UCSC.sacCer3.sgdGene package. For easier typing, assign it to a variable: txdb <- TxDb.Scerevisiae.UCSC.sacCer3.sgdGene. Then check the output of transcriptsBy(txdb). Can you guess how you could subset it to extract the genomic coordinates of the IME4 coding region? Afterwards, save it in a variable ime4.
  • Create a new variable ime4_downstream that contains the first 300 bp downstream of IME4.
  • Load the Biostrings and the BSgenome.Scerevisiae.UCSC.sacCer3 packages. The latter will load the yeast genome into memory, to access it you can save yourself some typing later by running yeast_genome <- BSgenome.Scerevisiae.UCSC.sacCer3.
  • Then retrieve the DNA sequence of ime4_downstream from the yeast genome and convert it to its reverse complement. unlist() the result and store it in the variable ime4_downstream_rc.
  • Tip to get help: Consult the vignettes for the GenomicRanges package (browseVignettes("GenomicRanges")) for this exercise. Specifically, look at the introduction chapter and the HowTo chapter (section 2.13).
  1. Performing the motif search:
  • Extract the position weight matrix for SPT15 from S. cerevisiae from the “jaspar2022” database using the MotifDb package. You will need the function query() with the right values for the andStrings. Consult the help file and also the vignette and package description for help! At the end, you should get a MotifDb object of length 1. To then extract the position weight matrix (pwm) from this object simply extract it with list subsetting: [[1]].
  • Use the matchPWM() function with the DNAString and the PWM objects to find SPT15-binding sequences in ime4_downstream_rc. Try different values for the min.score argument.
  • Bonus task 1: extract the exact sequence(s) that matched.
  • Bonus task 2: visualize the position weight matrix of SPT15 using the seqLogo() function in the seqLogo package.

Note: After loading packages such as TxDb.Scerevisiae.UCSC.sacCer3.sgdGene a variable of the same name is created. You can copy this value to a new variable with a shorter name to save yourself some typing: txdb <- TxDb.Scerevisiae.UCSC.sacCer3.sgdGene.

Part II: Differential Gene Expression Analysis: DESeq2

The analysis of sequencing data is one of the more common types of analyses in molecular biology. In this course, we will cover the basics of differential gene expression analysis based on (bulk) RNA sequencing. At its core analyzing RNA sequencing data is about how to correctly analyze count data, which is relatively common in modern molecular biology techniques (e.g. all sequencing data but also mass spectrometry).

The theory behind RNA-seq analysis is complex and has many facets. Therefore, covering all the theoretical and statistical aspects is out of scope. Nevertheless, we will try to cover the most important concepts behind an RNA-sequencing analysis and how to run it using Bioconductor packages. This is possible because a lot of the details have been abstracted away so you can perform such analyses with just a few commands. We will use the DESeq2 package, one of the standard packages for this type of analyses. It was authored by Michael Love, Simon Anders, and Wolfgang Huber.

If you are interested in a more in-depth understanding of the topic I recommend the following resources:

  • The DESeq2 vignette (vignette(DESeq2)) and the “RNA-seq workflow” it refers to (link).
  • The DESeq2 paper: Love, M.I., Huber, W., Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. 10.1186/s13059-014-0550-8
  • The chapters on count data from the book “Modern Statistics for Modern Biology” by Susan Holmes and Wolfang Huber. It is available for free online.

Note: parts of the following case study have been adapted from the following repository of the Software Carpentry (CC BY 4.0 license) as well as the “Modern Statistics for Modern Biology” (Holmes and Huber) book just mentioned (CC BY-NC-SA license).

The Setting: Impact of Influenza Infection on a Mouse Model of Multiple Sclerosis

We will discuss and explore the most important concepts behind RNA-seq data analysis using a dataset from a study where mice prone to autoimmune disease where infected with influenza virus to study if this can trigger the disease or exacerbate disease progression. Our RNA sequencing analysis will assume that quality control and read mapping have already been done. Our starting point is therefore an unnormalized count matrix. The dataset is taken from the following paper:

Blackmore, Stephen, et al. Influenza infection triggers disease in a genetic model of experimental autoimmune encephalomyelitis. PNAS 114.30 (2017):E6107-E6116. https://doi.org/10.1073/pnas.1620415114

For this course, I have already prepared the data you will need in the ./data directory.

First, let us load some packages that we will need in the subsequent analysis:

> library(DESeq2)
> library(SummarizedExperiment)
> library(ggplot2)
> library(tibble)

Importing the Data

The original experimental design in Blackmore et al. 2017 was fairly complex: 8 week old male and female C57BL/6 mice were collected at Day 0 (before influenza infection), Day 4 and Day 8 after influenza infection. From each mouse, cerebellum and spinal cord tissues were taken for RNA-seq. There were originally 4 mice x 2 sexes x 3 time points x 2 tissue types groups (48 combinations), but a few were lost along the way resulting in a total of 45 samples. For this workshop, we are going to simplify the analysis by only using the 22 cerebellum samples. In addition to the counts per gene per sample, we also need information on which sample belongs to which Sex/Time point/Replicate. As for the genes, it is helpful to have extra information from an annotation dataset.

> # import the count matrix
> counts <- read.csv("./data/GSE96870_counts_cerebellum.csv", row.names = 1)
> dim(counts)
## [1] 41786    22
> counts[1:5, 1:5]
##              GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340
## Xkr4               1891       2410       2159       1980       1977
## LOC105243853          0          0          1          4          0
## LOC105242387        204        121        110        120        172
## LOC105242467         12          5          5          5          2
## Rp1                   2          2          0          3          2
> 
> # import the sample annotations
> coldata <- read.csv("./data/GSE96870_coldata_cerebellum.csv")
> dim(coldata)
## [1] 22 11
> head(coldata)
##       sample           title geo_accession     organism     age    sex
## 1 GSM2545336 CNS_RNA-seq_10C    GSM2545336 Mus musculus 8 weeks Female
## 2 GSM2545337 CNS_RNA-seq_11C    GSM2545337 Mus musculus 8 weeks Female
## 3 GSM2545338 CNS_RNA-seq_12C    GSM2545338 Mus musculus 8 weeks Female
## 4 GSM2545339 CNS_RNA-seq_13C    GSM2545339 Mus musculus 8 weeks Female
## 5 GSM2545340 CNS_RNA-seq_14C    GSM2545340 Mus musculus 8 weeks   Male
## 6 GSM2545341 CNS_RNA-seq_17C    GSM2545341 Mus musculus 8 weeks   Male
##     infection  strain time     tissue mouse
## 1  InfluenzaA C57BL/6 Day8 Cerebellum    14
## 2 NonInfected C57BL/6 Day0 Cerebellum     9
## 3 NonInfected C57BL/6 Day0 Cerebellum    10
## 4  InfluenzaA C57BL/6 Day4 Cerebellum    15
## 5  InfluenzaA C57BL/6 Day4 Cerebellum    18
## 6  InfluenzaA C57BL/6 Day8 Cerebellum     6
> 
> # import gene annotations
> rowranges <- read.delim("./data/GSE96870_rowranges.tsv", 
+                         sep = "\t", 
+                         colClasses = c(ENTREZID = "character"),
+                         header = TRUE, 
+                         quote = "", 
+                         row.names = 5)
> dim(rowranges)
## [1] 41786     7
> head(rowranges)
##              seqnames   start     end strand  ENTREZID
## Xkr4                1 3670552 3671742      -    497097
## LOC105243853        1 3357323 3366505      + 105243853
## LOC105242387        1 3658847 3670456      - 105242387
## LOC105242467        1 4233436 4233728      - 105242467
## Rp1                 1 4409170 4409241      -     19888
## Sox17               1 4496291 4497354      -     20671
##                                                                                  product
## Xkr4         X Kell blood group precursor related family member 4, transcript variant X1
## LOC105243853                         uncharacterized LOC105243853, transcript variant X2
## LOC105242387                                                uncharacterized LOC105242387
## LOC105242467    lipoxygenase homology domain-containing protein 1, transcript variant X2
## Rp1                                 retinitis pigmentosa 1 (human), transcript variant 2
## Sox17                        SRY (sex determining region Y)-box 17, transcript variant 5
##              gbkey
## Xkr4          mRNA
## LOC105243853 ncRNA
## LOC105242387 ncRNA
## LOC105242467  mRNA
## Rp1           mRNA
## Sox17         mRNA
> 
> # coerce to GRanges object, needed later
> rowranges <- as(rowranges, "GRanges")

The SummarizedExperiment Class

When looking at the counts, coldata, and rowranges variables one can see that they serve different purposes in terms of recording measurement values or metadata. In the course of an analysis it is often desirable to keep data and metadata in sync. To do so, there is a special S4 class widely used in Bioconductor: the “SummarizedExperiment”.

The SummarizedExperiment consists of the following components:

  • The assays: A matrix with genes in the rows and samples in the columns. “Sample” refers to an experimental sample (typically one sequencing library e.g. treated mouse with a certain genotype at a certain time point). The values in the matrix are the unnormalized transcript count values.
  • The rowData or rowRanges: A table containing additional metadata about the genes in the assays table.
  • The colData: A table containing information about the samples. This is important for building a statistical model about the count data.

The structure of SummarizedExperiment is shown in the following scheme:

Exercise: Creating a SummarizedExperiment Instance

In this exercise we will familiarize ourselves a bit with the SummarizedExperiment class. It forms the starting point for the analysis of RNA sequencing data with DESeq2 and it is also a good example of how Bioconductor provides classes that provide a meaningful abstraction for biological datasets.

  • First, consult the vignette of the SummarizedExperiment package (vignette("SummarizedExperiment")) and the help file ?SummarizedExperiment to get familiar with the SummarizedExperiment class.
  • Using the data from the previous chunk (counts, coldata, rowranges) as input: create an instance of the SummarizedExperiment class and save it in the variable se. Note: you will need to convert counts to a matrix first and then put it into a list (list(as.matrix(counts))).
  • Inspect the output of se and summary(se) in the console.
  • Use the functions assays(), colData(), rowData() and rowRanges() on your se variable to inspect the contents of your SummarizedExperiment instance. What do these functions show you?
  • There are different options to subset a SummarizedExperiment object. Run the following code snippets:
    1. se[, se$time == "Day0"]
    2. se[seqnames(rowRanges(se)) == "1", ] What do they do? Check the output. What happens with the assay data and other metadata after you do the subsetting?
  • Save the SummarizedExperiment instance you created with the saveRDS() function and load it back into R with readRDS().

RNA-seq Analysis with DESeq2: An Overview

The final result of an RNA sequencing analysis is to make inferences about the effect of changing experimental factors on the average gene expression levels for every gene independently. The effect size is measured in log2-fold changes (l2fc for short). DESeq2 uses statistical models to test for the significance of the log2-fold changes and allows you to contrast the different experimental factors. Briefly, what DESeq2 does is the following:

  • It normalizes the read counts with a correction factor estimated from library sizes.
  • It estimates gene-wise dispersion (variability) estimates using a combination of replicate information and statistical modelling.
  • These dispersion estimates make it possible to assign a p-value to the log2-fold changes observed between experimental conditions. The default option is to calculate them such that the false-discovery-rate is 10% but this can be adjusted.
Questions
  • Why are log2-fold changes used rather than ratios?
  • What is the false discovery rate? What is the family-wise error rate?

RNA-seq Analysis with DESeq2 Step by Step

Step 1: Preparing the Dataset

First, we create a SummarizedExperiment instance using the data imported above in order to continue with our RNA-seq analysis:

Code
> se <- SummarizedExperiment(
+   # needed for subsequent conversion to DESeqDataSet
+   assays = list(counts = as.matrix(counts)), 
+   colData = coldata, 
+   rowRanges = rowranges
+ )

In fact, while we can use a SummarizedExperiment instance to run the RNA-seq analysis we will use the specialized DESeqDataSet subclass because it makes some aspects of the analysis easier.

Note the design = ~ time argument in the following code chunk. Briefly, it means that DESeq2 is going to build a model that will calculate the log2-fold changes along with their p-values to compare the expression levels between time points.

> ms_dataset <- DESeqDataSet(se, design = ~ time)
> ms_dataset
## class: DESeqDataSet 
## dim: 41786 22 
## metadata(1): version
## assays(1): counts
## rownames(41786): Xkr4 LOC105243853 ... TrnT TrnP
## rowData names(3): ENTREZID product gbkey
## colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
## colData names(11): sample title ... tissue mouse

Step 2: Normalizing the Read Counts

The first factor to take into account are differences in library sizes. Correcting for these differences is called normalization, we distinguish unnormalized (raw) read counts from normalized read counts.

Note

In the analysis of high-throughput experiments the term “normalization” is used for different things and people often do not specify what they mean. Do not be shy to ask.

Rather than simply dividing the number of reads by the sum, DESeq2 uses something different.

Consider a dataset with 5 genes and two samples as displayed in Figure 8.1. If we estimate for each of the two samples by its sum of counts, then the slope of the blue line represents their ratio. According to this, gene C is down-regulated in sample 2 compared to sample 1, while the other genes are all somewhat up-regulated. If we now instead estimate such that their ratios correspond to the red line, then we will still conclude that gene C is down-regulated, while the other genes are unchanged. The second version is more parsimonious and is often preferred by scientists. The slope of the red line can be obtained by robust regression. This is what the DESeq2 method does.

Size factor estimation. The points correspond to hypothetical genes whose counts in two samples are indicated by their x- and y-coordinates. The lines indicate two different ways of size factor estimation explained in the text. (image taken from: Holmes and Huber: Modern Statistics for Modern Biology)

Size factor estimation. The points correspond to hypothetical genes whose counts in two samples are indicated by their x- and y-coordinates. The lines indicate two different ways of size factor estimation explained in the text. (image taken from: Holmes and Huber: Modern Statistics for Modern Biology)
Exercise

DESeq2 uses the function estimateSizeFactorsForMatrix() to get read count normalization factors, which takes the matrix of raw read counts as an argument. How does this compare to obtaining a normalization factor from the sum of the reads? Try plotting the size factors of both methods against each other.

Solution
> counts_raw <- assay(ms_dataset)
> size_factor <- estimateSizeFactorsForMatrix(counts_raw)
> 
> comparison <- tibble(
+   size_factor = size_factor, 
+   sum = colSums(counts_raw), 
+   size_factor_from_sum = colSums(counts_raw) / mean(colSums(counts_raw))
+ )
> 
> comparison
> 
> ggplot(comparison, aes(x = size_factor, y = size_factor_from_sum)) + 
+   geom_point() + 
+   geom_abline(slope = 1) + 
+   coord_fixed()
> 
> norm_mat <- matrix(size_factor, byrow = TRUE, 
+                    ncol = ncol(counts_raw), nrow = nrow(counts_raw))
> counts_normalized <- counts_raw / norm_mat

Luckily, DESeq2 does a lot of the heavy lifting for us: when running an RNA-seq analysis you can simply run the following command to normalize the counts:

> ms_dataset <- estimateSizeFactors(ms_dataset)

Step 3: Estimating dispersions

Next, DESeq2 estimates the variability for every gene based on the count data. A characteristic of RNA-seq datasets is that in the low expression range, the variance is proportional to the mean. For those genes, it would therefore be possible to model the variability based on a Poisson distribution, which is a distribution where the variance is equal to the mean. However, in the higher expression ranges, the variance is actually larger than the mean. Therefore, a negative binomial distribution (also known as a gamma-Poisson distribution) is applied to the data to account for this extra variability. Intuitively, what happens is that the more highly expressed genes vary more than what the Poisson distribution predicts. This increased variability is due to unaccounted sources of variance during sample preparation and the fact that these genes are subject to more complex regulatory patterns. Apart from taking the information from replicates into account, DESeq2 also uses the information from other genes in the dataset to get a model of how variance depends on the mean expression.

Note

DESeq2 makes the assumption that most genes are not differentially expressed. In fact, many statistical procedures for high-throughput experiments make this assumption. Otherwise, it becomes difficult to estimate what the distribution under the null hypothesis looks like.

Exercise

Using counts_normalized, calculate for every gene the mean and the variance and plot the log of the variance vs. the plot of the mean (hint: you may find the functions rowMeans() and rowVars() useful). Add two lines on top of the plot, one with slope = 1 and another one with slope = 2. What can you see?

Solution
> day0_cols <- grep("Day0", colnames(counts_normalized))
> counts_normalized_day0 <- counts_normalized[, day0_cols]
> 
> var_mean_tab <- tibble(
+   means = rowMeans(counts_normalized), 
+   vars = rowVars(counts_normalized)
+ )
> 
> ggplot(var_mean_tab, aes(x = log(means), y = log(vars))) + 
+   geom_hex() + 
+   coord_fixed(xlim = c(-5, 20), ylim = c(-5, 20)) + 
+   geom_abline(slope = 1:2, color = (c("forestgreen", "red"))) + 
+   NULL

Again, the previous code chunk was more to understand what is going on, a quicker way to achieve this is to simply run

> ms_dataset <- estimateDispersions(ms_dataset)

Step 4: Testing for differential expression and retrieving results

This last step is done by using a version of the t-test called Wald test for the negative binomial distribution.

> ms_dataset <- nbinomWaldTest(ms_dataset)

At the end, the results of the analysis can be retrieved with the results() function.

> res <- results(ms_dataset)
> res
## log2 fold change (MLE): time Day8 vs Day0 
## Wald test p-value: time Day8 vs Day0 
## DataFrame with 41786 rows and 6 columns
##                 baseMean log2FoldChange     lfcSE      stat      pvalue
##                <numeric>      <numeric> <numeric> <numeric>   <numeric>
## Xkr4         1937.755565      -0.239387  0.107807 -2.220508 0.026384308
## LOC105243853    0.962588       0.518529  1.184125  0.437900 0.661458667
## LOC105242387  169.913400       0.455431  0.138380  3.291167 0.000997725
## LOC105242467    3.435073       0.288255  0.572181  0.503783 0.614414050
## Rp1             2.273456      -0.282147  0.668853 -0.421838 0.673143629
## ...                  ...            ...       ...       ...         ...
## TrnS2            1.76325       0.773975  0.697694  1.109333 2.67287e-01
## TrnL2          169.37651       0.499418  0.101147  4.937560 7.91063e-07
## TrnE           335.92910       0.194226  0.211981  0.916244 3.59539e-01
## TrnT           379.46721       0.418641  0.107507  3.894087 9.85692e-05
## TrnP           277.35435       0.111077  0.240277  0.462285 6.43877e-01
##                     padj
##                <numeric>
## Xkr4          0.07721803
## LOC105243853  0.79602142
## LOC105242387  0.00494211
## LOC105242467  0.76160580
## Rp1           0.80425322
## ...                  ...
## TrnS2        4.44850e-01
## TrnL2        9.69265e-06
## TrnE         5.43421e-01
## TrnT         6.82194e-04
## TrnP         7.83182e-01

Putting it all together: the DESeq() function

In fact, all the previous steps can be run at once using the function DESeq(), which is simply a wrapper that applies estimateSizeFactors(), estimateDispersions(), and nbinomWaldTest() one after the other. To demonstrate this, let’s construct our DESeqDataSet again and apply the DESeq() function to it:

> ms_dataset <- DESeqDataSet(se, design = ~ time)
> ms_dataset
## class: DESeqDataSet 
## dim: 41786 22 
## metadata(1): version
## assays(1): counts
## rownames(41786): Xkr4 LOC105243853 ... TrnT TrnP
## rowData names(3): ENTREZID product gbkey
## colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
## colData names(11): sample title ... tissue mouse
> 
> ms_dataset <- DESeq(ms_dataset)
> res2 <- results(ms_dataset)
> res2
## log2 fold change (MLE): time Day8 vs Day0 
## Wald test p-value: time Day8 vs Day0 
## DataFrame with 41786 rows and 6 columns
##                 baseMean log2FoldChange     lfcSE      stat      pvalue
##                <numeric>      <numeric> <numeric> <numeric>   <numeric>
## Xkr4         1937.755565      -0.239387  0.107807 -2.220508 0.026384308
## LOC105243853    0.962588       0.518529  1.184125  0.437900 0.661458667
## LOC105242387  169.913400       0.455431  0.138380  3.291167 0.000997725
## LOC105242467    3.435073       0.288255  0.572181  0.503783 0.614414050
## Rp1             2.273456      -0.282147  0.668853 -0.421838 0.673143629
## ...                  ...            ...       ...       ...         ...
## TrnS2            1.76325       0.773975  0.697694  1.109333 2.67287e-01
## TrnL2          169.37651       0.499418  0.101147  4.937560 7.91063e-07
## TrnE           335.92910       0.194226  0.211981  0.916244 3.59539e-01
## TrnT           379.46721       0.418641  0.107507  3.894087 9.85692e-05
## TrnP           277.35435       0.111077  0.240277  0.462285 6.43877e-01
##                     padj
##                <numeric>
## Xkr4          0.07724757
## LOC105243853  0.79612012
## LOC105242387  0.00494419
## LOC105242467  0.76168873
## Rp1           0.80435586
## ...                  ...
## TrnS2        4.44882e-01
## TrnL2        9.69873e-06
## TrnE         5.43487e-01
## TrnT         6.82622e-04
## TrnP         7.83275e-01

Exploring DESeq2 results

P-value histogram:

> res_df <- as.data.frame(res)
> head(res_df)
##                  baseMean log2FoldChange      lfcSE       stat       pvalue
## Xkr4         1937.7555653     -0.2393865 0.10780710 -2.2205079 0.0263843077
## LOC105243853    0.9625883      0.5185287 1.18412550  0.4379002 0.6614586666
## LOC105242387  169.9134003      0.4554314 0.13837988  3.2911674 0.0009977252
## LOC105242467    3.4350732      0.2882549 0.57218095  0.5037828 0.6144140499
## Rp1             2.2734558     -0.2821471 0.66885267 -0.4218375 0.6731436288
## Sox17         251.0059632     -0.3178237 0.09472532 -3.3552132 0.0007930385
##                     padj
## Xkr4         0.077218031
## LOC105243853 0.796021417
## LOC105242387 0.004942109
## LOC105242467 0.761605800
## Rp1          0.804253221
## Sox17        0.004081573
> 
> ggplot(res_df, aes(x = pvalue)) + 
+   geom_histogram(binwidth = 0.01, boundary = 0)

> 
> ggplot(res_df, aes(x = padj)) + 
+   geom_histogram(binwidth = 0.01, boundary = 0) + 
+   scale_y_continuous(breaks = c(0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000))

MA plot: mean log2-fold change vs. average expression, where the average is across both of the conditions compared.

> plotMA(res)

> 
> res_df$is_signif <- res_df$padj < 0.05
> 
> # using ggplot2:
> ggplot(res_df, aes(x = log10(baseMean), y = log2FoldChange, colour = is_signif)) + 
+   geom_point(shape = 1) + 
+   scale_colour_manual(values = c("grey", "steelblue", "grey")) + 
+   theme_bw()

> 
> # resons for NA values:
> # - low counts (i.e. exclusion by quality control)
> # - perfect separation: zero count in one of the two conditions
> # - outliers (based on Cook's distance)

Two- and multi-factor analysis of the RNA-seq data

Remember our definition of the DESeq2 dataset above:

> DESeqDataSet(se, design = ~ time)

And now closely inspect the output of our results object. Which factor levels are compared in the results table?

> res
## log2 fold change (MLE): time Day8 vs Day0 
## Wald test p-value: time Day8 vs Day0 
## DataFrame with 41786 rows and 6 columns
##                 baseMean log2FoldChange     lfcSE      stat      pvalue
##                <numeric>      <numeric> <numeric> <numeric>   <numeric>
## Xkr4         1937.755565      -0.239387  0.107807 -2.220508 0.026384308
## LOC105243853    0.962588       0.518529  1.184125  0.437900 0.661458667
## LOC105242387  169.913400       0.455431  0.138380  3.291167 0.000997725
## LOC105242467    3.435073       0.288255  0.572181  0.503783 0.614414050
## Rp1             2.273456      -0.282147  0.668853 -0.421838 0.673143629
## ...                  ...            ...       ...       ...         ...
## TrnS2            1.76325       0.773975  0.697694  1.109333 2.67287e-01
## TrnL2          169.37651       0.499418  0.101147  4.937560 7.91063e-07
## TrnE           335.92910       0.194226  0.211981  0.916244 3.59539e-01
## TrnT           379.46721       0.418641  0.107507  3.894087 9.85692e-05
## TrnP           277.35435       0.111077  0.240277  0.462285 6.43877e-01
##                     padj
##                <numeric>
## Xkr4          0.07721803
## LOC105243853  0.79602142
## LOC105242387  0.00494211
## LOC105242467  0.76160580
## Rp1           0.80425322
## ...                  ...
## TrnS2        4.44850e-01
## TrnL2        9.69265e-06
## TrnE         5.43421e-01
## TrnT         6.82194e-04
## TrnP         7.83182e-01

What happened in the background when you ran DESeq() was that DESeq modelled gene expression levels according to the following formula:

y = \beta_0 + x_1\beta_1 + \epsilon.

This can be interpreted as follows: y is the experimental measurement of interest for a particular gene. \beta_0 is the baseline expression level. x_1 is an indicator variable that becomes 1 if the “time” variable was “Day8” or not. \beta_1 is then the average increase in expression from Day0 to Day8. Such a model is fitted for every gene independently. Of course, the expression level cannot be described to 100% by this linear model, which is why an error term, \epsilon is also added.

You can use the contrast argument to contrast other levels, for example:

> results(ms_dataset, contrast = c("time", "Day4", "Day0"))
## log2 fold change (MLE): time Day4 vs Day0 
## Wald test p-value: time Day4 vs Day0 
## DataFrame with 41786 rows and 6 columns
##                 baseMean log2FoldChange     lfcSE       stat    pvalue
##                <numeric>      <numeric> <numeric>  <numeric> <numeric>
## Xkr4         1937.755565      -0.127984  0.104312  -1.226929 0.2198491
## LOC105243853    0.962588       0.703107  1.134552   0.619722 0.5354407
## LOC105242387  169.913400       0.240995  0.134806   1.787717 0.0738217
## LOC105242467    3.435073      -0.193479  0.575847  -0.335990 0.7368785
## Rp1             2.273456       0.223697  0.612173   0.365415 0.7148015
## ...                  ...            ...       ...        ...       ...
## TrnS2            1.76325      0.0454198 0.7271729  0.0624607 0.9501959
## TrnL2          169.37651      0.2254050 0.0993673  2.2684025 0.0233047
## TrnE           335.92910      0.0689015 0.2053937  0.3354605 0.7372778
## TrnT           379.46721      0.1853258 0.1046038  1.7716930 0.0764455
## TrnP           277.35435     -0.0962761 0.2329107 -0.4133607 0.6793424
##                   padj
##              <numeric>
## Xkr4          0.718680
## LOC105243853        NA
## LOC105242387  0.476948
## LOC105242467        NA
## Rp1                 NA
## ...                ...
## TrnS2               NA
## TrnL2         0.286259
## TrnE          0.956626
## TrnT          0.481675
## TrnP          0.941030

Notice that above we did not define “sex” as an experimental factor, so the following will fail:

> results(ms_dataset, contrast = c("sex", "Female", "Male"))
## Error in cleanContrast(object, contrast, expanded = isExpanded, listValues = listValues, : sex is not a factor, see ?results

This way of specifying the experimental design (the design argument of DESeq2) is problematic because it means that all the other experimental covariates that you know about will be “modelled” only by the error term \epsilon. So if for a particular gene male mice have a higher expression level than female ones or if its regulation is sex-specific this difference will be modelled as random variation. In the best case, this reduces the power of your statistical test, in the worst case your results will suffer from confounding and differences may be wrongly attributed to the change of an experimental factor when in fact they are the result of a confounding variable.

To make this more specific, let’s have a look at how the log2-fold change is actually calculated when using design = ~ time:

> ms_dataset_df <- as_tibble(assay(ms_dataset), rownames = "gene")
> ms_dataset_df
## # A tibble: 41,786 × 23
##    gene        GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
##    <chr>            <int>      <int>      <int>      <int>      <int>      <int>
##  1 Xkr4              1891       2410       2159       1980       1977       1945
##  2 LOC1052438…          0          0          1          4          0          0
##  3 LOC1052423…        204        121        110        120        172        173
##  4 LOC1052424…         12          5          5          5          2          6
##  5 Rp1                  2          2          0          3          2          1
##  6 Sox17              251        239        218        220        261        232
##  7 Gm7357               3          0          1          2          1          2
##  8 LOC1052438…          0          3          5          0          2          0
##  9 LOC1052438…          0          0          0          0          0          0
## 10 Gm7369               0          1          0          1          0          0
## # ℹ 41,776 more rows
## # ℹ 16 more variables: GSM2545342 <int>, GSM2545343 <int>, GSM2545344 <int>,
## #   GSM2545345 <int>, GSM2545346 <int>, GSM2545347 <int>, GSM2545348 <int>,
## #   GSM2545349 <int>, GSM2545350 <int>, GSM2545351 <int>, GSM2545352 <int>,
## #   GSM2545353 <int>, GSM2545354 <int>, GSM2545362 <int>, GSM2545363 <int>,
## #   GSM2545380 <int>
> 
> colData(ms_dataset)
## DataFrame with 22 rows and 13 columns
##                 sample           title geo_accession     organism         age
##            <character>     <character>   <character>  <character> <character>
## GSM2545336  GSM2545336 CNS_RNA-seq_10C    GSM2545336 Mus musculus     8 weeks
## GSM2545337  GSM2545337 CNS_RNA-seq_11C    GSM2545337 Mus musculus     8 weeks
## GSM2545338  GSM2545338 CNS_RNA-seq_12C    GSM2545338 Mus musculus     8 weeks
## GSM2545339  GSM2545339 CNS_RNA-seq_13C    GSM2545339 Mus musculus     8 weeks
## GSM2545340  GSM2545340 CNS_RNA-seq_14C    GSM2545340 Mus musculus     8 weeks
## ...                ...             ...           ...          ...         ...
## GSM2545353  GSM2545353  CNS_RNA-seq_3C    GSM2545353 Mus musculus     8 weeks
## GSM2545354  GSM2545354  CNS_RNA-seq_4C    GSM2545354 Mus musculus     8 weeks
## GSM2545362  GSM2545362  CNS_RNA-seq_5C    GSM2545362 Mus musculus     8 weeks
## GSM2545363  GSM2545363  CNS_RNA-seq_6C    GSM2545363 Mus musculus     8 weeks
## GSM2545380  GSM2545380  CNS_RNA-seq_9C    GSM2545380 Mus musculus     8 weeks
##                    sex   infection      strain     time      tissue     mouse
##            <character> <character> <character> <factor> <character> <integer>
## GSM2545336      Female  InfluenzaA     C57BL/6     Day8  Cerebellum        14
## GSM2545337      Female NonInfected     C57BL/6     Day0  Cerebellum         9
## GSM2545338      Female NonInfected     C57BL/6     Day0  Cerebellum        10
## GSM2545339      Female  InfluenzaA     C57BL/6     Day4  Cerebellum        15
## GSM2545340        Male  InfluenzaA     C57BL/6     Day4  Cerebellum        18
## ...                ...         ...         ...      ...         ...       ...
## GSM2545353      Female NonInfected     C57BL/6     Day0  Cerebellum         4
## GSM2545354        Male NonInfected     C57BL/6     Day0  Cerebellum         2
## GSM2545362      Female  InfluenzaA     C57BL/6     Day4  Cerebellum        20
## GSM2545363        Male  InfluenzaA     C57BL/6     Day4  Cerebellum        12
## GSM2545380      Female  InfluenzaA     C57BL/6     Day8  Cerebellum        19
##            sizeFactor replaceable
##             <numeric>   <logical>
## GSM2545336   1.134369        TRUE
## GSM2545337   0.909434        TRUE
## GSM2545338   0.856613        TRUE
## GSM2545339   0.914962        TRUE
## GSM2545340   0.929113        TRUE
## ...               ...         ...
## GSM2545353    1.10654        TRUE
## GSM2545354    0.97923        TRUE
## GSM2545362    1.08336        TRUE
## GSM2545363    1.00029        TRUE
## GSM2545380    1.13368        TRUE
> day8_cols <- colData(ms_dataset)$sample[colData(ms_dataset)$time == "Day8"]
> day0_cols <- colData(ms_dataset)$sample[colData(ms_dataset)$time == "Day0"]
> 
> my_gene <- "Xkr4"
> 
> # average count day8:
> ratio <- mean(counts_normalized[my_gene, day8_cols]) / mean(counts_normalized[my_gene, day0_cols])
> log2(ratio)
## [1] -0.2393478
> 
> # which happens to be the value you see in the results table:
> res
## log2 fold change (MLE): time Day8 vs Day0 
## Wald test p-value: time Day8 vs Day0 
## DataFrame with 41786 rows and 6 columns
##                 baseMean log2FoldChange     lfcSE      stat      pvalue
##                <numeric>      <numeric> <numeric> <numeric>   <numeric>
## Xkr4         1937.755565      -0.239387  0.107807 -2.220508 0.026384308
## LOC105243853    0.962588       0.518529  1.184125  0.437900 0.661458667
## LOC105242387  169.913400       0.455431  0.138380  3.291167 0.000997725
## LOC105242467    3.435073       0.288255  0.572181  0.503783 0.614414050
## Rp1             2.273456      -0.282147  0.668853 -0.421838 0.673143629
## ...                  ...            ...       ...       ...         ...
## TrnS2            1.76325       0.773975  0.697694  1.109333 2.67287e-01
## TrnL2          169.37651       0.499418  0.101147  4.937560 7.91063e-07
## TrnE           335.92910       0.194226  0.211981  0.916244 3.59539e-01
## TrnT           379.46721       0.418641  0.107507  3.894087 9.85692e-05
## TrnP           277.35435       0.111077  0.240277  0.462285 6.43877e-01
##                     padj
##                <numeric>
## Xkr4          0.07721803
## LOC105243853  0.79602142
## LOC105242387  0.00494211
## LOC105242467  0.76160580
## Rp1           0.80425322
## ...                  ...
## TrnS2        4.44850e-01
## TrnL2        9.69265e-06
## TrnE         5.43421e-01
## TrnT         6.82194e-04
## TrnP         7.83182e-01
Exercise

When you include the “sex” variable in addition to time into the linear model of DESeq2 the resulting model can be formally described as:

y = \beta_0 + x_1\beta_1 + x_2\beta_2 + \epsilon.

\beta_2 then indicates the effect of sex on the resulting gene expression value. Such an analysis is also called a two-factor analysis.

Your task: Redo the DESeq() analysis again but this time include the “sex” variable in addition to time into the linear model of DESeq2.

You will need to

  • Define a new DESeqDataSet object (save it in a variable called ms_dataset_2factor) where you will need to adjust the design argument using the following expression: ms_dataset_2factor = DESeqDataSet(se, design = ???). Check the documentation of DESeqDataSet() to find out how!
  • Rerun DESeq() on `ms_dataset_2factor.
Solution
> # colData(se)$mouse <- factor(colData(se)$mouse)
> ms_dataset_2factor <- DESeqDataSet(se, design = ~ sex + time)
> ms_dataset_2factor <- DESeq(ms_dataset_2factor)
> res_2factor <- results(ms_dataset_2factor)
> res_2factor
Exercise
  • Compare the p-values of the analysis that took sex into account vs. the one that did not.
  • Compare the output of plotMA() of the two analyses.
Solution
> comp_tab <- tibble(
+   pvals_onefactor = res$padj, 
+   pvals_twofactor = res_2factor$padj
+ )
> 
> 
> ggplot(comp_tab, aes(x = pvals_onefactor, y = pvals_twofactor)) +
+     geom_hex(bins = 75) + 
+     #geom_point() +
+     coord_fixed() +
+     xlab("Single factor analysis (condition)") +
+     ylab("Two factor analysis (type + condition)") +
+     geom_abline(col = "orange")
> 
> sum(res$padj < 0.1, na.rm = TRUE)
> sum(res_2factor$padj < 0.1, na.rm = TRUE)
> 
> mean(res$padj < 0.1, na.rm = TRUE)
> mean(res_2factor$padj < 0.1, na.rm = TRUE)
> 
> plotMA(res)
> plotMA(res_2factor)

Part III: Using Annotation Resources

Working with annotations may not be the most exciting part of bioinformatics but a very important one. Instead of spending hours clicking your way through websites you can solve these problems programmatically with Bioconductor! Note also: when you are on a website with annotation resources, check if there is a section on API access if you are interested in programmatic access.

Overview of Bioconductor Annotation Resources

Resource Description Use Cases
Annotation Packages Provides comprehensive annotation data for various organisms, e.g., org.Hs.eg.db for humans. Gene annotation, mapping IDs, retrieving functions, linking genes to pathways.
Biostrings & BSgenome Infrastructure for handling biological sequences and accessing full genome sequences, e.g. BSgenome.Scerevisiae.UCSC.sacCer3. Extracting and analyzing nucleotide/protein sequences, linking to annotations.
GenomicFeatures Create, manipulate, and annotate genomic ranges, including transcript databases from various sources, e.g. TxDb.Hsapiens.UCSC.hg19.knownGene. Creating TxDb objects to find e.g. gene, transcript, or exon information.
AnnotationHub Resource for discovering and using a variety of genomic annotation resources from different sources such as UCSC or Ensembl. Finding up-to-date annotations, integrating data into analyses.
biomaRt Interface to the BioMart system for retrieving genomic annotations from Ensembl and other databases. Obtaining annotations like gene coordinates, names, homologs, variant info.
rtracklayer Facilities for importing and exporting genome annotations in various formats, such as GFF, BED, and WIG. Working with track-based data, integrating external annotations.
GO.db & KEGG.db Access to Gene Ontology (GO) terms and KEGG pathways for annotating genes with functional and pathway information. Functional enrichment analysis, identifying biological themes.

org.db and TxDb packages

This tends to be useful to link annotation IDs to each other, for example to map between trivial and systematic gene names or to find out which Uniprot IDs are linked to a particular gene. One advantage of retrieving annotations programmatically rather than manually is that you have a record of which database or genome assembly version was used.

> library(org.Hs.eg.db)
> 
> org <- org.Hs.eg.db
> org
## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: HUMAN_DB
## | ORGANISM: Homo sapiens
## | SPECIES: Human
## | EGSOURCEDATE: 2025-Feb22
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | CENTRALID: EG
## | TAXID: 9606
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
## | GOSOURCEDATE: 2025-02-06
## | GOEGSOURCEDATE: 2025-Feb22
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | KEGGSOURCENAME: KEGG GENOME
## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
## | KEGGSOURCEDATE: 2011-Mar15
## | GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
## | GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/database
## | GPSOURCEDATE: UTC-Mar1
## | ENSOURCEDATE: 2024-Oct17
## | ENSOURCENAME: Ensembl
## | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
## | UPSOURCENAME: Uniprot
## | UPSOURCEURL: http://www.UniProt.org/
## | UPSOURCEDATE: Sat Mar  1 19:14:02 2025

Let us assume you want to find out which Uniprot IDs exist for the “EGFR” gene. You can solve many of these questions using the so-called “select interface”:

> gene <- "EGFR"
> 
> columns(org)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [26] "UNIPROT"
> keytypes(org)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [26] "UNIPROT"
> 
> # by default it will return the primary keys for the database but if used with the keytype argument, 
> # it will return the keys from that keytype
> 
> head(keys(org))
## [1] "1"  "2"  "3"  "9"  "10" "11"
> 
> head(keys(org, keytype = "GENENAME"))
## [1] "alpha-1-B glycoprotein"             "alpha-2-macroglobulin"             
## [3] "alpha-2-macroglobulin pseudogene 1" "N-acetyltransferase 1"             
## [5] "N-acetyltransferase 2"              "N-acetyltransferase pseudogene"
> head(keys(org, keytype = "ENTREZID"))
## [1] "1"  "2"  "3"  "9"  "10" "11"
> head(keys(org, keytype = "SYMBOL"))
## [1] "A1BG"  "A2M"   "A2MP1" "NAT1"  "NAT2"  "NATP"
> grep(keys(org, keytype = "SYMBOL"), pattern = gene)
## [1]  1553 36154
> 
> keys(org, keytype = "SYMBOL")[grep(keys(org, keytype = "SYMBOL"), pattern = gene)]
## [1] "EGFR"     "EGFR-AS1"
> 
> # explicitly calling select from AnnotationDbi to prevent problems of name masking 
> # (when a function of the same name from another package masks the intended function name)
> AnnotationDbi::select(org, keys = gene, keytype = "SYMBOL", 
+                       columns = c("SYMBOL", "ALIAS", "UNIPROT")) |>
+   head()
##   SYMBOL ALIAS UNIPROT
## 1   EGFR  ERBB  F2YGG7
## 2   EGFR  ERBB  Q504U8
## 3   EGFR  ERBB  A9CB80
## 4   EGFR  ERBB  E7BSV0
## 5   EGFR  ERBB  B7Z2I3
## 6   EGFR  ERBB  C9JYS6
> 
> AnnotationDbi::select(org, keys = gene, keytype = "SYMBOL", 
+                       columns = c("SYMBOL", "ENSEMBL"))
##   SYMBOL         ENSEMBL
## 1   EGFR ENSG00000146648
> 
> AnnotationDbi::select(org, keys = "ENSG00000146648", keytype = "ENSEMBL", 
+                       columns = c("GO")) |>
+   head()
##           ENSEMBL         GO EVIDENCE ONTOLOGY
## 1 ENSG00000146648 GO:0000139      IEA       CC
## 2 ENSG00000146648 GO:0000165      IEA       BP
## 3 ENSG00000146648 GO:0001503      NAS       BP
## 4 ENSG00000146648 GO:0001618      IEA       MF
## 5 ENSG00000146648 GO:0001892      IEA       BP
## 6 ENSG00000146648 GO:0001934      IDA       BP
> 
> search_res <- AnnotationDbi::select(org, keys = gene, keytype = "SYMBOL", 
+                       columns = c("ENTREZID"))
> 
> gene_id <- search_res$ENTREZID
> 
> AnnotationDbi::select(org, keys = gene, keytype = "SYMBOL", 
+                       columns = c("GO", "ONTOLOGY")) |>
+   head()
##   SYMBOL         GO EVIDENCE ONTOLOGY
## 1   EGFR GO:0000139      IEA       CC
## 2   EGFR GO:0000165      IEA       BP
## 3   EGFR GO:0001503      NAS       BP
## 4   EGFR GO:0001618      IEA       MF
## 5   EGFR GO:0001892      IEA       BP
## 6   EGFR GO:0001934      IDA       BP
> 
> 
> # ENTREZID is 1956 (GENEID in subsequent searches)

The retrieved IDs can be useful for subsetting other annotation packages. For example, the ENTREZID for our EGFR was 1956. We can use this to subset the corresponding transcript from TxDb.Hsapiens.UCSC.hg38.knownGene.

> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
> 
> txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
> txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg38
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # UCSC Track: GENCODE V47
## # Resource URL: https://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: NA
## # Nb of transcripts: 412034
## # Db created by: txdbmaker package from Bioconductor
## # Creation time: 2025-03-02 02:45:03 +0000 (Sun, 02 Mar 2025)
## # txdbmaker version at creation time: 1.3.1
## # RSQLite version at creation time: 2.3.9
## # DBSCHEMAVERSION: 1.2
> 
> egfr_region <- genes(txdb, filter = list(GENEID = 1956))
> egfr_region
## GRanges object with 1 range and 1 metadata column:
##        seqnames            ranges strand |     gene_id
##           <Rle>         <IRanges>  <Rle> | <character>
##   1956     chr7 55019017-55211628      + |        1956
##   -------
##   seqinfo: 711 sequences (1 circular) from hg38 genome
> exons(txdb, filter = list(GENEID = 1956))
## GRanges object with 53 ranges and 1 metadata column:
##        seqnames            ranges strand |   exon_id
##           <Rle>         <IRanges>  <Rle> | <integer>
##    [1]     chr7 55019017-55019365      + |    340477
##    [2]     chr7 55019021-55019365      + |    340478
##    [3]     chr7 55019032-55019365      + |    340479
##    [4]     chr7 55019034-55019365      + |    340480
##    [5]     chr7 55019088-55019365      + |    340481
##    ...      ...               ...    ... .       ...
##   [49]     chr7 55205256-55206016      + |    340525
##   [50]     chr7 55205256-55208067      + |    340526
##   [51]     chr7 55205256-55211536      + |    340527
##   [52]     chr7 55205256-55211628      + |    340528
##   [53]     chr7 55205347-55205865      + |    340529
##   -------
##   seqinfo: 711 sequences (1 circular) from hg38 genome

Finding Regulatory Regions with AnnotationHub

AnnotationHub allows you to access annotations from several sources. From the description file of the package: “This package provides a client for the Bioconductor AnnotationHub web resource. The AnnotationHub web resource provides a central location where genomic files (e.g., VCF, bed, wig) and other resources from standard locations (e.g., UCSC, Ensembl) can be discovered.”

> library(AnnotationHub)
> 
> ah <- AnnotationHub()
> ah
## AnnotationHub with 70637 records
## # snapshotDate(): 2025-04-08
## # $dataprovider: Ensembl, BroadInstitute, UCSC, ftp://ftp.ncbi.nlm.nih.gov/g...
## # $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Rattus norv...
## # $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, OrgDb, ChainFile...
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH5012"]]' 
## 
##              title                                   
##   AH5012   | Chromosome Band                         
##   AH5013   | STS Markers                             
##   AH5014   | FISH Clones                             
##   AH5015   | Recomb Rate                             
##   AH5016   | ENCODE Pilot                            
##   ...        ...                                     
##   AH121712 | Data.table for PubMed Author Information
##   AH121713 | Data.table for PMC                      
##   AH121714 | Data.table for MeSH (Descriptor)        
##   AH121715 | Data.table for MeSH (Qualifier)         
##   AH121716 | Data.table for MeSH (SCR)
> 
> # Discover records in a hub using mcols(), query(), subset(), and [.
> mcols(ah)
## DataFrame with 70637 rows and 15 columns
##                           title dataprovider      species taxonomyid
##                     <character>  <character>  <character>  <integer>
## AH5012          Chromosome Band         UCSC Homo sapiens       9606
## AH5013              STS Markers         UCSC Homo sapiens       9606
## AH5014              FISH Clones         UCSC Homo sapiens       9606
## AH5015              Recomb Rate         UCSC Homo sapiens       9606
## AH5016             ENCODE Pilot         UCSC Homo sapiens       9606
## ...                         ...          ...          ...        ...
## AH121712 Data.table for PubMe..         NCBI           NA         NA
## AH121713     Data.table for PMC         NCBI           NA         NA
## AH121714 Data.table for MeSH ..         NCBI           NA         NA
## AH121715 Data.table for MeSH ..         NCBI           NA         NA
## AH121716 Data.table for MeSH ..         NCBI           NA         NA
##               genome            description coordinate_1_based
##          <character>            <character>          <integer>
## AH5012          hg19 GRanges object from ..                  1
## AH5013          hg19 GRanges object from ..                  1
## AH5014          hg19 GRanges object from ..                  1
## AH5015          hg19 GRanges object from ..                  1
## AH5016          hg19 GRanges object from ..                  1
## ...              ...                    ...                ...
## AH121712          NA Correspondence table..                  1
## AH121713          NA Correspondence table..                  1
## AH121714          NA Correspondence table..                  1
## AH121715          NA Correspondence table..                  1
## AH121716          NA Correspondence table..                  1
##                      maintainer rdatadateadded          preparerclass
##                     <character>    <character>            <character>
## AH5012   Marc Carlson <mcarls..     2013-03-26 UCSCFullTrackImportP..
## AH5013   Marc Carlson <mcarls..     2013-03-26 UCSCFullTrackImportP..
## AH5014   Marc Carlson <mcarls..     2013-03-26 UCSCFullTrackImportP..
## AH5015   Marc Carlson <mcarls..     2013-03-26 UCSCFullTrackImportP..
## AH5016   Marc Carlson <mcarls..     2013-03-26 UCSCFullTrackImportP..
## ...                         ...            ...                    ...
## AH121712 Koki Tsuyuzaki <k.t...     2025-04-08            AHPubMedDbs
## AH121713 Koki Tsuyuzaki <k.t...     2025-04-08            AHPubMedDbs
## AH121714 Koki Tsuyuzaki <k.t...     2025-04-08            AHPubMedDbs
## AH121715 Koki Tsuyuzaki <k.t...     2025-04-08            AHPubMedDbs
## AH121716 Koki Tsuyuzaki <k.t...     2025-04-08            AHPubMedDbs
##                                    tags  rdataclass              rdatapath
##                                  <AsIs> <character>            <character>
## AH5012          cytoBand,UCSC,track,...     GRanges goldenpath/hg19/data..
## AH5013            stsMap,UCSC,track,...     GRanges goldenpath/hg19/data..
## AH5014        fishClones,UCSC,track,...     GRanges goldenpath/hg19/data..
## AH5015        recombRate,UCSC,track,...     GRanges goldenpath/hg19/data..
## AH5016     encodeRegions,UCSC,track,...     GRanges goldenpath/hg19/data..
## ...                                 ...         ...                    ...
## AH121712     data.table,NCBI,PubMed,...  data.table AHPubMedDbs/v009/aut..
## AH121713        data.table,NCBI,PMC,...  data.table AHPubMedDbs/v009/pmc..
## AH121714 data.table,Descriptor,MeSH,...  data.table AHPubMedDbs/v009/des..
## AH121715       data.table,MeSH,NCBI,...  data.table AHPubMedDbs/v009/qua..
## AH121716       data.table,MeSH,NCBI,...  data.table AHPubMedDbs/v009/scr..
##                       sourceurl  sourcetype
##                     <character> <character>
## AH5012   rtracklayer://hgdown..  UCSC track
## AH5013   rtracklayer://hgdown..  UCSC track
## AH5014   rtracklayer://hgdown..  UCSC track
## AH5015   rtracklayer://hgdown..  UCSC track
## AH5016   rtracklayer://hgdown..  UCSC track
## ...                         ...         ...
## AH121712 https://github.com/r..         XML
## AH121713 https://github.com/r..         XML
## AH121714 https://github.com/r..         XML
## AH121715 https://github.com/r..         XML
## AH121716 https://github.com/r..         XML
Exercise

Retrieve all enhancer regions for the human genome using the ah AnnotationHub object just created. Use the documentation to get started (?AnnotationHub or vignette("AnnotationHub")). Hint: use “enhancer” and “hg38” as query strings and subsetByOverlaps() to get the overlap of the retrieved enhancer sequences and the egfr_region retrieved in the pre-previous code chunk.

Solution
> regulatory_regions <- query(ah, c("enhancer", "hg38"))
> ah["AH116724"]
## AnnotationHub with 1 record
## # snapshotDate(): 2025-04-08
## # names(): AH116724
## # $dataprovider: NA
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # $rdatadateadded: 2024-04-29
## # $title: TENET_consensus_enhancer_regions
## # $description: A composite GRanges object containing regions of putative en...
## # $taxonomyid: 9606
## # $genome: hg38
## # $sourcetype: Multiple
## # $sourceurl: https://github.com/rhielab/TENET.AnnotationHub/raw/devel/data-...
## # $sourcesize: NA
## # $tags: c("EpigenomeRoadMap", "FANTOM5", "GenomicSequence",
## #   "Homo_sapiens", "TENET") 
## # retrieve record with 'object[["AH116724"]]'
> mcols(ah["AH116724"])
## DataFrame with 1 row and 15 columns
##                           title dataprovider      species taxonomyid
##                     <character>  <character>  <character>  <integer>
## AH116724 TENET_consensus_enha..           NA Homo sapiens       9606
##               genome            description coordinate_1_based
##          <character>            <character>          <integer>
## AH116724        hg38 A composite GRanges ..                  1
##                      maintainer rdatadateadded       preparerclass
##                     <character>    <character>         <character>
## AH116724 Rhie Lab at the Univ..     2024-04-29 TENET.AnnotationHub
##                                                  tags  rdataclass
##                                                <AsIs> <character>
## AH116724 EpigenomeRoadMap,FANTOM5,GenomicSequence,...     GRanges
##                       rdatapath              sourceurl  sourcetype
##                     <character>            <character> <character>
## AH116724 TENET.AnnotationHub/.. https://github.com/r..    Multiple
> ah["AH116724"]$description
## [1] "A composite GRanges object containing regions of putative enhancer elements from a variety of sources, primarily for use in the TENET Bioconductor package. This dataset is composed of regions of strong enhancers as annotated by the Roadmap Epigenomics ChromHMM expanded 18-state model based on 98 reference epigenomes, lifted over to the hg38 genome, as well as regions of human permissive enhancers identified by the FANTOM5 project. For additional information on component datasets, see the manifest file hosted at https://github.com/rhielab/TENET.AnnotationHub/blob/devel/data-raw/TENET_consensus_datasets_manifest.tsv"
> 
> enhancers <- regulatory_regions[["AH116724"]]
> enhancers
## GRanges object with 403602 ranges and 0 metadata columns:
##            seqnames            ranges strand
##               <Rle>         <IRanges>  <Rle>
##        [1]     chr1       10001-10400      *
##        [2]     chr1       14801-15200      *
##        [3]     chr1       16001-16600      *
##        [4]     chr1       20001-20400      *
##        [5]     chr1       79001-79800      *
##        ...      ...               ...    ...
##   [403598]     chrY 56856054-56856253      *
##   [403599]     chrY 56873201-56873229      *
##   [403600]     chrY 56873666-56873879      *
##   [403601]     chrY 56879705-56879876      *
##   [403602]     chrM           1-16398      *
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths
> 
> subsetByOverlaps(enhancers, egfr_region)
## GRanges object with 10 ranges and 0 metadata columns:
##        seqnames            ranges strand
##           <Rle>         <IRanges>  <Rle>
##    [1]     chr7 55020308-55035707      *
##    [2]     chr7 55036708-55114507      *
##    [3]     chr7 55115308-55162707      *
##    [4]     chr7 55163708-55164107      *
##    [5]     chr7 55165308-55165707      *
##    [6]     chr7 55166708-55167707      *
##    [7]     chr7 55167766-55167997      *
##    [8]     chr7 55168908-55169507      *
##    [9]     chr7 55169708-55195707      *
##   [10]     chr7 55198308-55210907      *
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths
> 
> 
> # Could have done the same thing to find histone modifications:
> 
> histone_mods <- query(ah, c("H3K27ac", "hg38"))
> histone_mods
## AnnotationHub with 3 records
## # snapshotDate(): 2025-04-08
## # $dataprovider: ENCODE, NA
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH116721"]]' 
## 
##              title                                 
##   AH116721 | TENET_10_cancer_panel_enhancer_regions
##   AH116727 | ENCODE_dELS_regions                   
##   AH116728 | ENCODE_pELS_regions
> 
> histone_data <- histone_mods[["AH116727"]]
> histone_data
## GRanges object with 786756 ranges and 0 metadata columns:
##            seqnames            ranges strand
##               <Rle>         <IRanges>  <Rle>
##        [1]     chr1     271227-271468      *
##        [2]     chr1     274330-274481      *
##        [3]     chr1     605331-605668      *
##        [4]     chr1     727122-727350      *
##        [5]     chr1     807737-807916      *
##        ...      ...               ...    ...
##   [786752]     chrY 26319537-26319816      *
##   [786753]     chrY 26319848-26320155      *
##   [786754]     chrY 26363541-26363721      *
##   [786755]     chrY 26670949-26671287      *
##   [786756]     chrY 26671293-26671478      *
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths
> 
> 
> # Find overlaps with EGFR gene region
> histone_in_egfr <- subsetByOverlaps(histone_data, egfr_region)
> 
> 
> # Display histone modification data
> histone_in_egfr
## GRanges object with 201 ranges and 0 metadata columns:
##         seqnames            ranges strand
##            <Rle>         <IRanges>  <Rle>
##     [1]     chr7 55020968-55021132      *
##     [2]     chr7 55021188-55021347      *
##     [3]     chr7 55021444-55021635      *
##     [4]     chr7 55021651-55021842      *
##     [5]     chr7 55021954-55022284      *
##     ...      ...               ...    ...
##   [197]     chr7 55205299-55205450      *
##   [198]     chr7 55206264-55206550      *
##   [199]     chr7 55207491-55207780      *
##   [200]     chr7 55209422-55209744      *
##   [201]     chr7 55210237-55210526      *
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths
> 
> 
> # We can also retrieve more richly annotated resources:
> # 77 refers to the version number of the ENSEMBL release
> query(ah, c("GTF", "77", "Ensembl", "Homo sapiens"))
## AnnotationHub with 1 record
## # snapshotDate(): 2025-04-08
## # names(): AH28812
## # $dataprovider: Ensembl
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # $rdatadateadded: 2015-03-25
## # $title: Homo_sapiens.GRCh38.77.gtf
## # $description: Gene Annotation for Homo sapiens
## # $taxonomyid: 9606
## # $genome: GRCh38
## # $sourcetype: GTF
## # $sourceurl: ftp://ftp.ensembl.org/pub/release-77/gtf/homo_sapiens/Homo_sap...
## # $sourcesize: 44454526
## # $tags: c("GTF", "ensembl", "Gene", "Transcript", "Annotation") 
## # retrieve record with 'object[["AH28812"]]'
> ah[["AH28812"]]
## GRanges object with 2672001 ranges and 21 metadata columns:
##             seqnames      ranges strand |   source       type     score
##                <Rle>   <IRanges>  <Rle> | <factor>   <factor> <numeric>
##         [1]        1 11869-14409      + |   havana gene              NA
##         [2]        1 11869-14409      + |   havana transcript        NA
##         [3]        1 11869-12227      + |   havana exon              NA
##         [4]        1 12613-12721      + |   havana exon              NA
##         [5]        1 13221-14409      + |   havana exon              NA
##         ...      ...         ...    ... .      ...        ...       ...
##   [2671997]       MT 15888-15953      + |  ensembl transcript        NA
##   [2671998]       MT 15888-15953      + |  ensembl exon              NA
##   [2671999]       MT 15956-16023      - |  insdc   gene              NA
##   [2672000]       MT 15956-16023      - |  ensembl transcript        NA
##   [2672001]       MT 15956-16023      - |  ensembl exon              NA
##                 phase         gene_id gene_version   gene_name gene_source
##             <integer>     <character>    <numeric> <character> <character>
##         [1]      <NA> ENSG00000223972            5     DDX11L1      havana
##         [2]      <NA> ENSG00000223972            5     DDX11L1      havana
##         [3]      <NA> ENSG00000223972            5     DDX11L1      havana
##         [4]      <NA> ENSG00000223972            5     DDX11L1      havana
##         [5]      <NA> ENSG00000223972            5     DDX11L1      havana
##         ...       ...             ...          ...         ...         ...
##   [2671997]      <NA> ENSG00000210195            2       MT-TT       insdc
##   [2671998]      <NA> ENSG00000210195            2       MT-TT       insdc
##   [2671999]      <NA> ENSG00000210196            2       MT-TP       insdc
##   [2672000]      <NA> ENSG00000210196            2       MT-TP       insdc
##   [2672001]      <NA> ENSG00000210196            2       MT-TP       insdc
##                       gene_biotype   transcript_id transcript_version
##                        <character>     <character>          <numeric>
##         [1] transcribed_unproces..            <NA>                 NA
##         [2] transcribed_unproces.. ENST00000456328                  2
##         [3] transcribed_unproces.. ENST00000456328                  2
##         [4] transcribed_unproces.. ENST00000456328                  2
##         [5] transcribed_unproces.. ENST00000456328                  2
##         ...                    ...             ...                ...
##   [2671997]                Mt_tRNA ENST00000387460                  2
##   [2671998]                Mt_tRNA ENST00000387460                  2
##   [2671999]                Mt_tRNA            <NA>                 NA
##   [2672000]                Mt_tRNA ENST00000387461                  2
##   [2672001]                Mt_tRNA ENST00000387461                  2
##             transcript_name transcript_source   transcript_biotype exon_number
##                 <character>       <character>          <character>   <numeric>
##         [1]            <NA>              <NA>                 <NA>          NA
##         [2]     DDX11L1-002            havana processed_transcript          NA
##         [3]     DDX11L1-002            havana processed_transcript           1
##         [4]     DDX11L1-002            havana processed_transcript           2
##         [5]     DDX11L1-002            havana processed_transcript           3
##         ...             ...               ...                  ...         ...
##   [2671997]       MT-TT-201           ensembl              Mt_tRNA          NA
##   [2671998]       MT-TT-201           ensembl              Mt_tRNA           1
##   [2671999]            <NA>              <NA>                 <NA>          NA
##   [2672000]       MT-TP-201           ensembl              Mt_tRNA          NA
##   [2672001]       MT-TP-201           ensembl              Mt_tRNA           1
##                     exon_id exon_version         tag     ccds_id  protein_id
##                 <character>    <numeric> <character> <character> <character>
##         [1]            <NA>           NA        <NA>        <NA>        <NA>
##         [2]            <NA>           NA        <NA>        <NA>        <NA>
##         [3] ENSE00002234944            1        <NA>        <NA>        <NA>
##         [4] ENSE00003582793            1        <NA>        <NA>        <NA>
##         [5] ENSE00002312635            1        <NA>        <NA>        <NA>
##         ...             ...          ...         ...         ...         ...
##   [2671997]            <NA>           NA        <NA>        <NA>        <NA>
##   [2671998] ENSE00001544475            2        <NA>        <NA>        <NA>
##   [2671999]            <NA>           NA        <NA>        <NA>        <NA>
##   [2672000]            <NA>           NA        <NA>        <NA>        <NA>
##   [2672001] ENSE00001544473            2        <NA>        <NA>        <NA>
##             protein_version
##                   <numeric>
##         [1]              NA
##         [2]              NA
##         [3]              NA
##         [4]              NA
##         [5]              NA
##         ...             ...
##   [2671997]              NA
##   [2671998]              NA
##   [2671999]              NA
##   [2672000]              NA
##   [2672001]              NA
##   -------
##   seqinfo: 270 sequences (1 circular) from GRCh38 genome

The GO.db package allows retrieving pathway and GO term and gene function information:

> library(GO.db)
> columns(GO.db)
## [1] "DEFINITION" "GOID"       "ONTOLOGY"   "TERM"
> 
> go_terms <- c("GO:0097421", "GO:0097489")
> go_terms
## [1] "GO:0097421" "GO:0097489"
> 
> AnnotationDbi::select(GO.db, keys = go_terms, keytype = "GOID", columns = c("TERM", "ONTOLOGY"))
##         GOID                                        TERM ONTOLOGY
## 1 GO:0097421                          liver regeneration       BP
## 2 GO:0097489 multivesicular body, internal vesicle lumen       CC

Session Info

> sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Berlin
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] GO.db_3.21.0                               
##  [2] AnnotationHub_3.16.0                       
##  [3] BiocFileCache_2.16.0                       
##  [4] dbplyr_2.5.0                               
##  [5] TxDb.Hsapiens.UCSC.hg38.knownGene_3.21.0   
##  [6] org.Hs.eg.db_3.21.0                        
##  [7] tibble_3.2.1                               
##  [8] ggplot2_3.5.2                              
##  [9] DESeq2_1.48.1                              
## [10] SummarizedExperiment_1.38.1                
## [11] MatrixGenerics_1.20.0                      
## [12] matrixStats_1.5.0                          
## [13] TxDb.Scerevisiae.UCSC.sacCer3.sgdGene_3.2.2
## [14] GenomicFeatures_1.60.0                     
## [15] AnnotationDbi_1.70.0                       
## [16] Biobase_2.68.0                             
## [17] BSgenome.Scerevisiae.UCSC.sacCer3_1.4.0    
## [18] BSgenome_1.76.0                            
## [19] rtracklayer_1.68.0                         
## [20] BiocIO_1.18.0                              
## [21] GenomicRanges_1.60.0                       
## [22] Biostrings_2.76.0                          
## [23] GenomeInfoDb_1.44.0                        
## [24] XVector_0.48.0                             
## [25] IRanges_2.42.0                             
## [26] S4Vectors_0.46.0                           
## [27] BiocGenerics_0.54.0                        
## [28] generics_0.1.4                             
## [29] BiocManager_1.30.25                        
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1         dplyr_1.1.4              farver_2.1.2            
##  [4] blob_1.2.4               filelock_1.0.3           bitops_1.0-9            
##  [7] fastmap_1.2.0            RCurl_1.98-1.17          GenomicAlignments_1.44.0
## [10] XML_3.99-0.18            digest_0.6.37            mime_0.13               
## [13] lifecycle_1.0.4          KEGGREST_1.48.0          RSQLite_2.3.11          
## [16] magrittr_2.0.3           compiler_4.5.0           rlang_1.1.6             
## [19] tools_4.5.0              utf8_1.2.5               yaml_2.3.10             
## [22] knitr_1.50               labeling_0.4.3           S4Arrays_1.8.0          
## [25] htmlwidgets_1.6.4        bit_4.6.0                curl_6.2.2              
## [28] DelayedArray_0.34.1      RColorBrewer_1.1-3       abind_1.4-8             
## [31] BiocParallel_1.42.0      purrr_1.0.4              withr_3.0.2             
## [34] grid_4.5.0               scales_1.4.0             cli_3.6.5               
## [37] rmarkdown_2.29           crayon_1.5.3             rstudioapi_0.17.1       
## [40] httr_1.4.7               rjson_0.2.23             DBI_1.2.3               
## [43] cachem_1.1.0             parallel_4.5.0           restfulr_0.0.15         
## [46] vctrs_0.6.5              Matrix_1.7-3             jsonlite_2.0.0          
## [49] bit64_4.6.0-1            locfit_1.5-9.12          glue_1.8.0              
## [52] codetools_0.2-20         gtable_0.3.6             BiocVersion_3.21.1      
## [55] UCSC.utils_1.4.0         pillar_1.10.2            rappdirs_0.3.3          
## [58] htmltools_0.5.8.1        GenomeInfoDbData_1.2.14  R6_2.6.1                
## [61] evaluate_1.0.3           lattice_0.22-7           png_0.1-8               
## [64] Rsamtools_2.24.0         memoise_2.0.1            Rcpp_1.0.14             
## [67] SparseArray_1.8.0        xfun_0.52                pkgconfig_2.0.3