> if (!require("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
>
> # leave empty to install the newest version
> BiocManager::install() # leave empty or specify e.g. version="3.18"Introduction to Bioconductor
HBIGS Core Course, May 15, 2025
What is Bioconductor?
Bioconductor: Open Source Software for Bioinformatics
- Bioconductor is an ecosystem of R packages for the analysis of biological data.
- Ecosystem of packages that play well together.
- Offers packages for sequencing analysis, genomics, searching databases and visualizations, plotting, etc.
- Two releases per year.
- BioC package subject to peer review and certain guidelines.
- Not maintained packages are removed.
- One of if not the reason for biologists to learn R programming.
Installing Bioconductor
- Bioconductor builds on R so you need to install R first.
- BioC installation is managed by BioC’s package manager
BiocManager:
- To install BioC packages run:
> BiocManager::install(c("sequencing", "DESeq2"))
>
> # to update packages run BiocManager::install() without arguments:
> BiocManager::install()- After installation packages can be loaded just like in R with the
library()command:
> library(DESeq2)Exploring Bioconductor and Bioconductor Resources
- The Bioconductor website is a good place to start. There are a lot of example workflows and online learning materials available.
- Bioconductor makes heavy use of so-called “vignettes”, which are a sort of tutorial to walk you through packages.
- For example, to get a tutorial on how to work with sequence data in Bioconductor:
> browseVignettes("sequencing")- Another good place to get started is to read this paper introducing the Bioconductor framework.
About this Course
- We will cover a number of topics that are essential to understand how Bioconductor works and that are important in many of the analyses that people use Bioconductor for.
- However, as this is only a one-day course some topics are not covered at all and we will not have time to dive very deep into the topics covered.
- Therefore, I have tried to focus on a “learning to learn” approach: Ways of finding help and consulting the documentation are shown. The course itself is a mix of a bit of theory, live coding, and hands-on exercises. This should help you to “hit the ground running” for your future analyses and to know where to continue.
Part I: Working with Biological Sequences
Overview
There are a lot of BioC packages that serve different purposes when working with DNA/RNA/protein sequence data. They cover all your needs from sequencing quality control to searching genomes, motif search, gene expression analysis etc. This figure is an overview for future reference (taken from vignette(sequencing)):
Biostrings
Assume you want to run analyses on DNA, RNA, or amino acid (AA) sequences such as finding motifs, calculating the GC content etc. This is where the Biostrings package comes in. It provides data types for DNA, RNA, and AA sequence representation and operations.
> library(Biostrings)
>
> my_dna <- DNAString("ATGCTACGTATA")
> my_dna
## 12-letter DNAString object
## seq: ATGCTACGTATA
>
> my_dna2 <- DNAString("ATGC.YG") # pYrimidines = C and T
> ?DNAString
> IUPAC_CODE_MAP
## A C G T M R W S Y K V
## "A" "C" "G" "T" "AC" "AG" "AT" "CG" "CT" "GT" "ACG"
## H D B N
## "ACT" "AGT" "CGT" "ACGT"
>
> DNAString("AUG")
## Error in .Call2("new_XString_from_CHARACTER", class(x0), string, start, : key 85 (char 'U') not in lookup table
> DNA_ALPHABET
## [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" "."
>
> RNAString("AUG")
## 3-letter RNAString object
## seq: AUG
>
> a_protein <- AAString("FHIG")
> a_protein
## 4-letter AAString object
## seq: FHIGThese “Biostrings” support a number of useful and biologically meaningful operations.
> my_dna
## 12-letter DNAString object
## seq: ATGCTACGTATA
>
> my_dna[4:6]
## 3-letter DNAString object
## seq: CTA
> reverseComplement(my_dna)
## 12-letter DNAString object
## seq: TATACGTAGCAT
> translate(my_dna)
## 4-letter AAString object
## seq: MLRI
> codons(my_dna)
## Views on a 12-letter DNAString subject
## subject: ATGCTACGTATA
## views:
## start end width
## [1] 1 3 3 [ATG]
## [2] 4 6 3 [CTA]
## [3] 7 9 3 [CGT]
## [4] 10 12 3 [ATA]
>
> # GENETIC_CODE
> # GENETIC_CODE_TABLE
>
> mt_genetic_code <- getGeneticCode("SGC1")
> translate(my_dna, genetic.code = mt_genetic_code)
## 4-letter AAString object
## seq: MLRM
> rev(my_dna)
## 12-letter DNAString object
## seq: ATATGCATCGTAA Look Behind the Scenes: Some Technical Background on Bioconductor and How This Helps Us
Functional Object-Oriented Programming
Let’s interrupt our journey into Bioconductor for a technical detour which will improve your understanding of how R and Bioconductor works. This will make it easier for you to understand the documentation and help files, tutorials etc.
After all, at this point Bioconductor may seem like some wonderful but also confusing black box:
- What are all these “objects”?
- For example, what is a “DNAString object”? What can I do with it? Why does
rev()from base R work with it? - How can I find help?
These are questions you may have already had when learning standard (“base”) R but in Bioconductor knowing the answers becomes more important to effectively use and understand the documentation.
When you write code in R you always “do something” with functions. Every operation invokes a function, either explicitly, for example when you call mean(), sd() etc., or behind the scenes, for example when you use mathematical operators or parentheses. Because functions are so central in R, R is also sometimes called a functional programming language.
> v <- c(3, 4, 8, 12)
> mean(v)
## [1] 6.75
>
> 5 + 7
## [1] 12
> # note: everything that "does something" in R is a function. Even the +.
> # Use backticks to call it like a function
> `+`(5, 7)
## [1] 12
> `[`(v, 3)
## [1] 8However, this is only half the story. Functions always act on objects and objects are always of a certain class. But what are classes? Classes can be thought of as a custom data type. For example, you can define a “DNAString class” as you saw before. The user can then use a so-called constructor function or method to generate objects of that class, so-called “instances”. The class definition defines which properties a class has, which data it contains, and which operations are allowed on it. For example, a DNAString object (instance) contains some data (the DNA sequence), has some properties (for example its length and other metadata), and there are some operations that it supports (for example: creating the reverse complement or translating the sequence).
In fact, this is a behaviour not specific to Bioconductor but of the R language in general. Take a vector of strings, for example: it can be thought of as an instance of the “character” class and this implies that certain operations are supported whereas others are not:
> genes <- c("recA", "mrcB")
> class(genes)
## [1] "character"
>
> nchar(genes)
## [1] 4 4
> sum(genes)
## Error in sum(genes): invalid 'type' (character) of argumentBut as you may have guessed from the “DNAString class”, classes can be more complex: for example, if you ever ran a linear regression in R the output was an object of class “lm”.
> x <- rnorm(5)
> y <- rnorm(5)
> my_lm <- lm(y ~ x)
> my_lm
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 0.1500 -0.9112
> class(my_lm)
## [1] "lm"It is this class property which will influence what functions do with their arguments: For example, when you use summary(), depending on the argument type (its class), you get a different output:
> x
## [1] -0.049786590 0.472937922 0.218420129 0.008936016 1.055637415
> summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.049787 0.008936 0.218420 0.341229 0.472938 1.055637
>
> summary(my_lm)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## 1 2 3 4 5
## -0.76021 0.62145 -1.75704 1.86229 0.03351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1500 0.9292 0.161 0.882
## x -0.9112 1.7635 -0.517 0.641
##
## Residual standard error: 1.583 on 3 degrees of freedom
## Multiple R-squared: 0.08172, Adjusted R-squared: -0.2244
## F-statistic: 0.267 on 1 and 3 DF, p-value: 0.641This is why functions in R “magically” seem to know what to do with their arguments. Many R functions will first look at the type of their first argument and then call a specialized function behind the scenes to do the actual work. So when you call summary() on the lm object it called the specialized summary.lm() method behind the scenes:
> summary.lm(my_lm)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## 1 2 3 4 5
## -0.76021 0.62145 -1.75704 1.86229 0.03351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1500 0.9292 0.161 0.882
## x -0.9112 1.7635 -0.517 0.641
##
## Residual standard error: 1.583 on 3 degrees of freedom
## Multiple R-squared: 0.08172, Adjusted R-squared: -0.2244
## F-statistic: 0.267 on 1 and 3 DF, p-value: 0.641
> # fails:
> summary.lm(x)
## Error in z$rank: $ operator is invalid for atomic vectors
>
> # in the case of x it's not summary.numeric but summary.default:
> summary.default(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.049787 0.008936 0.218420 0.341229 0.472938 1.055637We say that summary is a generic function that invokes particular methods depending on the class of its first argument. See also the help file for summary: ?summary. This mechanism is also called type-based dispatch.
This, in a nutshell, is how object oriented programming (OOP) works in R. It is a bit different from how OOP works in languages like Python or Java.
You may be surprised at the number of methods for a given generic function. You can find them by using the methods() function:
> methods("summary")
## [1] summary,ANY-method summary,CompressedIRangesList-method
## [3] summary,FilterResults-method summary,FilterRules-method
## [5] summary,Hits-method summary,IPos-method
## [7] summary,IPosRanges-method summary,mle-method
## [9] summary,Rle-method summary,Seqinfo-method
## [11] summary,Vector-method summary.aov
## [13] summary.aovlist* summary.aspell*
## [15] summary.check_packages_in_dir* summary.connection
## [17] summary.data.frame summary.Date
## [19] summary.default summary.ecdf*
## [21] summary.factor summary.glm
## [23] summary.Hits summary.infl*
## [25] summary.IPos summary.IPosRanges
## [27] summary.lm summary.loess*
## [29] summary.manova summary.matrix
## [31] summary.mlm* summary.nls*
## [33] summary.packageStatus* summary.POSIXct
## [35] summary.POSIXlt summary.ppr*
## [37] summary.prcomp* summary.princomp*
## [39] summary.proc_time summary.rlang_error*
## [41] summary.rlang_message* summary.rlang_trace*
## [43] summary.rlang_warning* summary.rlang:::list_of_conditions*
## [45] summary.Rle summary.Seqinfo
## [47] summary.srcfile summary.srcref
## [49] summary.stepfun summary.stl*
## [51] summary.table summary.tukeysmooth*
## [53] summary.Vector summary.warnings
## see '?methods' for accessing help and source codeTo find out which methods exist for a given class use the methods() function:
> methods(class="lm")
## [1] add1 alias anova case.names coerce
## [6] confint cooks.distance deviance dfbeta dfbetas
## [11] drop1 dummy.coef effects extractAIC family
## [16] formula hatvalues influence initialize kappa
## [21] labels logLik model.frame model.matrix nobs
## [26] plot predict print proj qr
## [31] residuals rstandard rstudent show simulate
## [36] slotsFromS3 summary variable.names vcov
## see '?methods' for accessing help and source codeThe previous paragraphs describe what is known as the S3 class system.
Bioconductor uses the more advanced S4 class system, which is similar to S3 but it offers some additional features. For example, classes can have multiple parent classes and dispatch can be based on the classes of multiple arguments. However, conceptually, from a user’s perspective, it works similar to S3.
As the name “parent class” suggests, classes can be related to each other. Subclasses (also called child classes or derived classes) are often more specialized versions of their parent classes (= superclasses = base classes).
How Does This Help Us?
Now let us find out how this knowledge can help us to navigate Bioconductor. For example, consider our “DNAString” object:
Code
> my_dna <- DNAString("ATGCTACGTATA")
> class(my_dna)
## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"
>
> # class = is necessary, the default first argument is generic.function
> methods(class = "DNAString")
## [1] !=
## [2] [
## [3] [<-
## [4] %in%
## [5] <
## [6] <=
## [7] ==
## [8] >
## [9] >=
## [10] aggregate
## [11] alphabetFrequency
## [12] anyDuplicated
## [13] anyNA
## [14] append
## [15] as.character
## [16] as.complex
## [17] as.data.frame
## [18] as.env
## [19] as.integer
## [20] as.list
## [21] as.logical
## [22] as.matrix
## [23] as.numeric
## [24] as.raw
## [25] as.vector
## [26] bindROWS
## [27] by
## [28] c
## [29] chartr
## [30] codons
## [31] coerce
## [32] compact
## [33] complement
## [34] countOverlaps
## [35] countPattern
## [36] countPDict
## [37] countPWM
## [38] duplicated
## [39] elementMetadata
## [40] elementMetadata<-
## [41] eval
## [42] expand
## [43] expand.grid
## [44] extract_character_from_XString_by_positions
## [45] extract_character_from_XString_by_ranges
## [46] extractAt
## [47] extractList
## [48] extractROWS
## [49] extractTranscriptSeqs
## [50] FactorToClass
## [51] findOverlaps
## [52] findPalindromes
## [53] hasOnlyBaseLetters
## [54] head
## [55] intersect
## [56] is.na
## [57] isMatchingEndingAt
## [58] isMatchingStartingAt
## [59] lcprefix
## [60] lcsubstr
## [61] lcsuffix
## [62] length
## [63] lengths
## [64] letter
## [65] letterFrequency
## [66] letterFrequencyInSlidingView
## [67] make_XString_from_string
## [68] maskMotif
## [69] masks
## [70] masks<-
## [71] match
## [72] matchLRPatterns
## [73] matchPattern
## [74] matchPDict
## [75] matchProbePair
## [76] matchPWM
## [77] mcols
## [78] mcols<-
## [79] merge
## [80] mergeROWS
## [81] metadata
## [82] metadata<-
## [83] mstack
## [84] nchar
## [85] neditEndingAt
## [86] neditStartingAt
## [87] oligonucleotideFrequency
## [88] overlapsAny
## [89] palindromeArmLength
## [90] palindromeLeftArm
## [91] palindromeRightArm
## [92] parallel_slot_names
## [93] pcompare
## [94] pmatchPattern
## [95] rank
## [96] relist
## [97] relistToClass
## [98] rename
## [99] rep
## [100] rep.int
## [101] replaceAt
## [102] replaceLetterAt
## [103] replaceROWS
## [104] rev
## [105] reverse
## [106] reverseComplement
## [107] selfmatch
## [108] seqlevelsInUse
## [109] seqtype
## [110] seqtype<-
## [111] setdiff
## [112] setequal
## [113] shiftApply
## [114] show
## [115] showAsCell
## [116] sort
## [117] split
## [118] split<-
## [119] subseq
## [120] subseq<-
## [121] subset
## [122] subsetByOverlaps
## [123] substr
## [124] substring
## [125] summary
## [126] table
## [127] tail
## [128] tapply
## [129] toComplex
## [130] toString
## [131] transform
## [132] translate
## [133] trimLRPatterns
## [134] twoWayAlphabetFrequency
## [135] union
## [136] unique
## [137] uniqueLetters
## [138] unmasked
## [139] updateObject
## [140] values
## [141] values<-
## [142] vcountPattern
## [143] vcountPDict
## [144] Views
## [145] vmatchPattern
## [146] vmatchPDict
## [147] vwhichPDict
## [148] which.isMatchingEndingAt
## [149] which.isMatchingStartingAt
## [150] whichPDict
## [151] window
## [152] window<-
## [153] with
## [154] xtabs
## [155] xtfrm
## [156] xvcopy
## see '?methods' for accessing help and source codeWe can see that DNAStrings support a translate method.
> # magic!
> translate(my_dna)
## 4-letter AAString object
## seq: MLRIWith methods(class = "DNAString") we could find out which methods the DNAString class supports (what we “can do with it”). But we can also ask in the other direction: Which other classes support a translate method, i.e. what other things can we translate? Let’s find out:
> methods("translate")
## [1] translate,DNAString-method translate,DNAStringSet-method
## [3] translate,MaskedDNAString-method translate,MaskedRNAString-method
## [5] translate,RNAString-method translate,RNAStringSet-method
## see '?methods' for accessing help and source codeTo find out if an object is of a certain class (or derives from it), use is():
> is(my_dna, "DNAString")
## [1] TRUE
> is(my_dna, "XString")
## [1] TRUETo get more help about a class, you can add some qualifiers to ? to be more specific about the class or method you are interested in, for example:
> # in this case the following to point to the same help page but this is not
> # always the case
> ?DNAString
> ?"DNAString-class"
> ?`translate,DNAString-method`Using the Vignettes
One of the great things about Bioconductor is that most packages have very detailed tutorials called “vignettes”. You can check them in your package inspector in RStudio or like this:
> browseVignettes(package = "Biostrings")Now that you know about Bioconductor classes and methods you will find those help files more useful.
The Ranges infrastructure: IRanges and GenomicRanges
Classes to work with DNA, RNA, and protein sequences are a good start but often the sequences you work with are embedded in a common context, for example a chromosome or a set of chromosomes.
There are many things you may be interested in when working with a genome: perhaps you are interested in finding all the introns of a specific gene or you may be looking for certain sequence motifs in the upstream sequence of a protein coding gene and many more.
To support such operations it is necessary to define classes that support the properties and operations required for such work. The “IRanges” and “GRanges” (GenomicRanges) classes provide these features.
IRanges
IRanges is the name for a class that supports generic operations on sequence data. They form the foundation for GenomicRanges, which is Bioconductor’s way of representing genome data. Let’s have a quick look at what IRanges can do:
> library(IRanges)
>
> ir1 <- IRanges(start = c(1, 5, 15),
+ width = c(3, 25, 10))
> ir1
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1 3 3
## [2] 5 29 25
## [3] 15 24 10
>
> ir2 <- IRanges(start = 7, width = 5)
> ir2
## IRanges object with 1 range and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 7 11 5
>
> start(ir1)
## [1] 1 5 15
> ir1[2:3]
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 5 29 25
## [2] 15 24 10The following function will make it easier to illustrate the different IRanges operations:
> # from the IRanges vignette:
> plotRanges <- function(x, xlim=x, main=deparse(substitute(x)), col="black", sep=0.5, ...) {
+ height <- 1
+ if (is(xlim, "IntegerRanges"))
+ xlim <- c(min(start(xlim)), max(end(xlim)))
+ bins <- disjointBins(IRanges(start(x), end(x) + 1))
+ plot.new()
+ plot.window(xlim, c(0, max(bins)*(height + sep)))
+ ybottom <- bins * (sep + height) - height
+ rect(start(x)-0.5, ybottom, end(x)+0.5, ybottom + height, col=col, ...)
+ title(main)
+ axis(1)
+ }
>
> plotRanges(ir1)> plotRanges(ir2)> plotRanges(reduce(ir1), xlim = c(1, 30))The following selection of operations give you a taste of what’s possible with IRanges. Can you think of the corresponding biological questions?
> findOverlaps(ir1, ir2)
## Hits object with 1 hit and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 2 1
## -------
## queryLength: 3 / subjectLength: 1
>
> plotRanges(disjoin(ir1))> gaps(ir1)
## IRanges object with 1 range and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 4 4 1
> ir1
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1 3 3
## [2] 5 29 25
## [3] 15 24 10
> flank(ir1, width = 5)
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] -4 0 5
## [2] 0 4 5
## [3] 10 14 5GenomicRanges (GRanges)
The GRanges class is how functional units (e.g. genes) on a chromosome are represented in Bioconductor. It has biologically meaningful properties such as the strand on which a sequence is located and supports metadata.
> library(GenomicRanges)
>
> gr <- GRanges(
+ seqnames = Rle(c("chr1", "chr2", "chr1"), c(1, 3, 2)),
+ ranges = IRanges(start=101:106, end = 111:116, names = letters[1:6]),
+ strand = strand(c("+", "+", "-", "-", "+", "+")),
+ score = runif(6),
+ GC = c(0.45, 0.47, 0.7, 0.68, 0.3, 0.33))
> gr
## GRanges object with 6 ranges and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## a chr1 101-111 + | 0.714799 0.45
## b chr2 102-112 + | 0.786448 0.47
## c chr2 103-113 - | 0.499891 0.70
## d chr2 104-114 - | 0.654050 0.68
## e chr1 105-115 + | 0.821489 0.30
## f chr1 106-116 + | 0.946777 0.33
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengthsThere are now a couple of accessor functions for the genomic coordinates as well as the metadata columns:
> # GenomicRanges are vector-like
> gr
## GRanges object with 6 ranges and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## a chr1 101-111 + | 0.714799 0.45
## b chr2 102-112 + | 0.786448 0.47
## c chr2 103-113 - | 0.499891 0.70
## d chr2 104-114 - | 0.654050 0.68
## e chr1 105-115 + | 0.821489 0.30
## f chr1 106-116 + | 0.946777 0.33
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
> gr[c("a", "c")]
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## a chr1 101-111 + | 0.714799 0.45
## c chr2 103-113 - | 0.499891 0.70
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
> strand(gr)
## factor-Rle of length 6 with 3 runs
## Lengths: 2 2 2
## Values : + - +
## Levels(3): + - *
> granges(gr)
## GRanges object with 6 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## a chr1 101-111 +
## b chr2 102-112 +
## c chr2 103-113 -
## d chr2 104-114 -
## e chr1 105-115 +
## f chr1 106-116 +
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
> names(gr)
## [1] "a" "b" "c" "d" "e" "f"
>
> # to get the metadata
> mcols(gr)
## DataFrame with 6 rows and 2 columns
## score GC
## <numeric> <numeric>
## a 0.714799 0.45
## b 0.786448 0.47
## c 0.499891 0.70
## d 0.654050 0.68
## e 0.821489 0.30
## f 0.946777 0.33
> mcols(gr)[c("a", "c"), ]
## DataFrame with 2 rows and 2 columns
## score GC
## <numeric> <numeric>
## a 0.714799 0.45
## c 0.499891 0.70
> mcols(gr)[1]
## DataFrame with 6 rows and 1 column
## score
## <numeric>
## a 0.714799
## b 0.786448
## c 0.499891
## d 0.654050
## e 0.821489
## f 0.946777
> mcols(gr)[[1]]
## [1] 0.7147989 0.7864476 0.4998908 0.6540496 0.8214889 0.9467766
>
> # to get a subset of the GRanges with only part of the metadata:
> gr[3:5, "score"]
## GRanges object with 3 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## c chr2 103-113 - | 0.499891
## d chr2 104-114 - | 0.654050
## e chr1 105-115 + | 0.821489
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengthsSeveral GenomicRanges can be put together into a list, the GRangesList. This makes sense because some sequences are made from disjoint sequences on a chromosome. Think of transcripts, for example, that can consist of several exons:
> gr1 <- GRanges(
+ seqnames = "chr2",
+ ranges = IRanges(103, 106),
+ strand = "+",
+ score = 5L, GC = 0.45)
> gr2 <- GRanges(
+ seqnames = c("chr1", "chr1"),
+ ranges = IRanges(c(107, 113), width = 3),
+ strand = c("+", "-"),
+ score = 3:4, GC = c(0.3, 0.5))
> grl <- GRangesList("txA" = gr1, "txB" = gr2)
> grl
## GRangesList object of length 2:
## $txA
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## [1] chr2 103-106 + | 5 0.45
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
##
## $txB
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## [1] chr1 107-109 + | 3 0.3
## [2] chr1 113-115 - | 4 0.5
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengthsLoading sequence data into R
There are several ways to import sequence data into R using Bioconductor. Depending on your need you will probably access them from either: - local files, perhaps as a result from sequencing e.g. fasta, fastq, BAM, gtf, bed, wig etc. files. - some sort of sequence database, including reference genomes.
You can read more on sequence imports on this page:
> browseVignettes("sequencing")Accessing Sequence Information from Reference Genomes
Imagine the following scenario: in a screening experiment you identified, say, 100 genes which responded to your treatment. Now you may be interested if certain sequence motifs, e.g. transcription factor binding sites, are present in the promoter regions of these genes. Or perhaps you want to do some comparative genomics and perform a BLAST search. How can you do this in Bioconductor?
To do this, you will need two things:
- A package containing the genome of the species you are interested in. These usually have the format “BSgenome.SPECIESNAME.UCSC…”.
- A package giving you information about the transcripts in the genome. These are often called “TxDb.SPECIESNAME.UCSC…”.
In addition, you will probably face the problem that you need to work with gene names and annotations. This topic is covered in more detail at the end of this course (section “Using Annotation Resources”).
> library("BSgenome.Scerevisiae.UCSC.sacCer3")
>
> sc_genome <- BSgenome.Scerevisiae.UCSC.sacCer3
> sc_genome
## | BSgenome object for Yeast
## | - organism: Saccharomyces cerevisiae
## | - provider: UCSC
## | - genome: sacCer3
## | - release date: April 2011
## | - 17 sequence(s):
## | chrI chrII chrIII chrIV chrV chrVI chrVII chrVIII chrIX
## | chrX chrXI chrXII chrXIII chrXIV chrXV chrXVI chrM
## |
## | Tips: call 'seqnames()' on the object to get all the sequence names, call
## | 'seqinfo()' to get the full sequence info, use the '$' or '[[' operator to
## | access a given sequence, see '?BSgenome' for more information.
> seqnames(sc_genome)
## [1] "chrI" "chrII" "chrIII" "chrIV" "chrV" "chrVI" "chrVII"
## [8] "chrVIII" "chrIX" "chrX" "chrXI" "chrXII" "chrXIII" "chrXIV"
## [15] "chrXV" "chrXVI" "chrM"
> seqinfo(sc_genome)
## Seqinfo object with 17 sequences (1 circular) from sacCer3 genome:
## seqnames seqlengths isCircular genome
## chrI 230218 FALSE sacCer3
## chrII 813184 FALSE sacCer3
## chrIII 316620 FALSE sacCer3
## chrIV 1531933 FALSE sacCer3
## chrV 576874 FALSE sacCer3
## ... ... ... ...
## chrXIII 924431 FALSE sacCer3
## chrXIV 784333 FALSE sacCer3
## chrXV 1091291 FALSE sacCer3
## chrXVI 948066 FALSE sacCer3
## chrM 85779 TRUE sacCer3
> ?'BSgenome'
> metadata(sc_genome)
## $organism
## [1] "Saccharomyces cerevisiae"
##
## $common_name
## [1] "Yeast"
##
## $provider
## [1] "UCSC"
##
## $genome
## [1] "sacCer3"
##
## $release_date
## [1] "April 2011"
##
## $source_url
## [1] "http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/"
> seqnames(sc_genome)
## [1] "chrI" "chrII" "chrIII" "chrIV" "chrV" "chrVI" "chrVII"
## [8] "chrVIII" "chrIX" "chrX" "chrXI" "chrXII" "chrXIII" "chrXIV"
## [15] "chrXV" "chrXVI" "chrM"
> seqlengths(sc_genome)
## chrI chrII chrIII chrIV chrV chrVI chrVII chrVIII chrIX chrX
## 230218 813184 316620 1531933 576874 270161 1090940 562643 439888 745751
## chrXI chrXII chrXIII chrXIV chrXV chrXVI chrM
## 666816 1078177 924431 784333 1091291 948066 85779
> mseqnames(sc_genome)
## NULL
>
> sapply(
+ list(metadata = metadata, seqnames = seqnames,
+ seqlengths = seqlengths, masknames = masknames,
+ mseqnames = mseqnames),
+ function(f) f(sc_genome)
+ )
## $metadata
## $metadata$organism
## [1] "Saccharomyces cerevisiae"
##
## $metadata$common_name
## [1] "Yeast"
##
## $metadata$provider
## [1] "UCSC"
##
## $metadata$genome
## [1] "sacCer3"
##
## $metadata$release_date
## [1] "April 2011"
##
## $metadata$source_url
## [1] "http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/"
##
##
## $seqnames
## [1] "chrI" "chrII" "chrIII" "chrIV" "chrV" "chrVI" "chrVII"
## [8] "chrVIII" "chrIX" "chrX" "chrXI" "chrXII" "chrXIII" "chrXIV"
## [15] "chrXV" "chrXVI" "chrM"
##
## $seqlengths
## chrI chrII chrIII chrIV chrV chrVI chrVII chrVIII chrIX chrX
## 230218 813184 316620 1531933 576874 270161 1090940 562643 439888 745751
## chrXI chrXII chrXIII chrXIV chrXV chrXVI chrM
## 666816 1078177 924431 784333 1091291 948066 85779
##
## $masknames
## NULL
##
## $mseqnames
## NULL
> chrI <- sc_genome[["chrI"]]
> chrI
## 230218-letter DNAString object
## seq: CCACACCACACCCACACACCCACACACCACACCACA...GGTGTGTGGGTGTGGTGTGGGTGTGGTGTGTGTGGG
> class(chrI)
## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"
> length(chrI)
## [1] 230218
> chrI[150:250]
## 101-letter DNAString object
## seq: CCCTGTCCCATTCAACCATACCACTCCGAACCACCA...ACCGTTACCCTCCAATTACCCATATCCAACCCACTGTo get information on transcripts in, for example, the yeast genome, we load the “TxDb.Scerevisiae.UCSC.sacCer3.sgdGene” package. It supports a wide range of interesting operations:
> library("TxDb.Scerevisiae.UCSC.sacCer3.sgdGene")
> sc_txdb <- TxDb.Scerevisiae.UCSC.sacCer3.sgdGene
> sc_txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: sacCer3
## # Organism: Saccharomyces cerevisiae
## # Taxonomy ID: 4932
## # UCSC Table: sgdGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Name of canonical transcript in cluster
## # Full dataset: yes
## # miRBase build ID: NA
## # transcript_nrow: 6692
## # exon_nrow: 7034
## # cds_nrow: 7034
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-10-07 18:20:42 +0000 (Wed, 07 Oct 2015)
## # GenomicFeatures version at creation time: 1.21.30
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1
> class(sc_txdb)
## [1] "TxDb"
## attr(,"package")
## [1] "GenomicFeatures"
>
> # Check the help: ?"TxDb-class"
> methods(class = "TxDb")
## [1] $ $<- annotatedDataFrameFrom
## [4] as.list asBED asGFF
## [7] assayData assayData<- cds
## [10] cdsBy cdsByOverlaps coerce
## [13] columns combine contents
## [16] dbconn dbfile dbInfo
## [19] dbmeta dbschema distance
## [22] exons exonsBy exonsByOverlaps
## [25] ExpressionSet extractUpstreamSeqs featureNames
## [28] featureNames<- fiveUTRsByTranscript genes
## [31] initialize intronsByTranscript isActiveSeq
## [34] isActiveSeq<- isNA keys
## [37] keytypes mapIds mapIdsToRanges
## [40] mappedkeys mapRangesToIds mapToTranscripts
## [43] metadata microRNAs nhit
## [46] organism promoters revmap
## [49] sample sampleNames sampleNames<-
## [52] saveDb select seqinfo
## [55] seqinfo<- seqlevels<- seqlevels0
## [58] show species storageMode
## [61] storageMode<- taxonomyId terminators
## [64] threeUTRsByTranscript transcripts transcriptsBy
## [67] transcriptsByOverlaps tRNAs updateObject
## see '?methods' for accessing help and source code
>
> seqinfo(sc_txdb)
## Seqinfo object with 17 sequences (1 circular) from sacCer3 genome:
## seqnames seqlengths isCircular genome
## chrI 230218 <NA> sacCer3
## chrII 813184 <NA> sacCer3
## chrIII 316620 <NA> sacCer3
## chrIV 1531933 <NA> sacCer3
## chrV 576874 <NA> sacCer3
## ... ... ... ...
## chrXIII 924431 <NA> sacCer3
## chrXIV 784333 <NA> sacCer3
## chrXV 1091291 <NA> sacCer3
## chrXVI 948066 <NA> sacCer3
## chrM 85779 TRUE sacCer3
> seqlevels(sc_txdb)
## [1] "chrI" "chrII" "chrIII" "chrIV" "chrV" "chrVI" "chrVII"
## [8] "chrVIII" "chrIX" "chrX" "chrXI" "chrXII" "chrXIII" "chrXIV"
## [15] "chrXV" "chrXVI" "chrM"
> seqlengths(sc_txdb)
## chrI chrII chrIII chrIV chrV chrVI chrVII chrVIII chrIX chrX
## 230218 813184 316620 1531933 576874 270161 1090940 562643 439888 745751
## chrXI chrXII chrXIII chrXIV chrXV chrXVI chrM
## 666816 1078177 924431 784333 1091291 948066 85779
>
> # two things are important (see help page ?"TxDb-class")
> # transcripts(), transcriptsBy(), transcriptsByOverlaps()
> # select-methods
> transcripts(sc_txdb)
## GRanges object with 6692 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chrI 335-649 + | 1 YAL069W
## [2] chrI 538-792 + | 2 YAL068W-A
## [3] chrI 2480-2707 + | 3 YAL067W-A
## [4] chrI 10091-10399 + | 4 YAL066W
## [5] chrI 12046-12426 + | 5 YAL064W-B
## ... ... ... ... . ... ...
## [6688] chrM 65770-66174 + | 6688 Q0182
## [6689] chrM 73758-74513 + | 6689 Q0250
## [6690] chrM 74495-75984 + | 6690 Q0255
## [6691] chrM 79213-80022 + | 6691 Q0275
## [6692] chrM 85554-85709 + | 6692 Q0297
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
>
> # how to find a gene?
> # Check the help: ?`transcriptsBy,TxDb-method`
> transcriptsBy(sc_txdb)
## GRangesList object of length 6534:
## $Q0010
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chrM 3952-4338 + | 6665 Q0010
## [2] chrM 4254-4415 + | 6666 Q0017
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
##
## $Q0032
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chrM 11667-11957 + | 6667 Q0032
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
##
## $Q0055
## GRanges object with 6 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chrM 13818-16322 + | 6668 Q0050
## [2] chrM 13818-18830 + | 6669 Q0055
## [3] chrM 13818-19996 + | 6670 Q0060
## [4] chrM 13818-21935 + | 6671 Q0065
## [5] chrM 13818-23167 + | 6672 Q0070
## [6] chrM 13818-26701 + | 6673 Q0045
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
##
## ...
## <6531 more elements>
>
> ## TO DO: how does this work exactly?
> transcriptsBy(sc_txdb, by = "gene")
## GRangesList object of length 6534:
## $Q0010
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chrM 3952-4338 + | 6665 Q0010
## [2] chrM 4254-4415 + | 6666 Q0017
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
##
## $Q0032
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chrM 11667-11957 + | 6667 Q0032
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
##
## $Q0055
## GRanges object with 6 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chrM 13818-16322 + | 6668 Q0050
## [2] chrM 13818-18830 + | 6669 Q0055
## [3] chrM 13818-19996 + | 6670 Q0060
## [4] chrM 13818-21935 + | 6671 Q0065
## [5] chrM 13818-23167 + | 6672 Q0070
## [6] chrM 13818-26701 + | 6673 Q0045
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
##
## ...
## <6531 more elements>
> transcriptsBy(sc_txdb, by = "gene") |> names() |> head()
## [1] "Q0010" "Q0032" "Q0055" "Q0075" "Q0080" "Q0085"
> transcriptsBy(sc_txdb, by = "gene") |> lengths() |> head()
## Q0010 Q0032 Q0055 Q0075 Q0080 Q0085
## 2 1 6 1 1 1
> # most of them have just one transcript but not all, e.g. YAL069W: 2 transcripts annotated for that gene
> transcriptsBy(sc_txdb, by = "gene")["YAL069W"]
## GRangesList object of length 1:
## $YAL069W
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chrI 335-649 + | 1 YAL069W
## [2] chrI 538-792 + | 2 YAL068W-A
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
>
> transcriptsBy(sc_txdb, by = "exon") |> names() |> head()
## [1] "1" "2" "3" "4" "5" "6"
> transcriptsBy(sc_txdb, by = "cds") |> names() |> head()
## [1] "1" "2" "3" "4" "5" "6"
>
> exonsBy(sc_txdb, by = "tx")
## GRangesList object of length 6692:
## $`1`
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chrI 335-649 + | 1 <NA> 1
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
##
## $`2`
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chrI 538-792 + | 2 <NA> 1
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
##
## $`3`
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chrI 2480-2707 + | 3 <NA> 1
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
##
## ...
## <6689 more elements>
> exonsBy(sc_txdb, by = "gene")
## GRangesList object of length 6534:
## $Q0010
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chrM 3952-4338 + | 6992 <NA>
## [2] chrM 4254-4415 + | 6993 <NA>
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
##
## $Q0032
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chrM 11667-11957 + | 6994 <NA>
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
##
## $Q0055
## GRanges object with 13 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chrM 13818-13986 + | 6995 <NA>
## [2] chrM 13818-16322 + | 6996 <NA>
## [3] chrM 16435-16470 + | 6997 <NA>
## [4] chrM 16435-18830 + | 6998 <NA>
## [5] chrM 18954-18991 + | 6999 <NA>
## ... ... ... ... . ... ...
## [9] chrM 21995-22246 + | 7003 <NA>
## [10] chrM 21995-23167 + | 7004 <NA>
## [11] chrM 23612-23746 + | 7005 <NA>
## [12] chrM 25318-25342 + | 7008 <NA>
## [13] chrM 26229-26701 + | 7009 <NA>
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
##
## ...
## <6531 more elements>
>
> ime4 <- transcriptsBy(sc_txdb, by = "gene")[["YGL192W"]]
> ime4
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chrVII 142246-144048 + | 2136 YGL192W
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
> getSeq(sc_genome, ime4)
## DNAStringSet object of length 1:
## width seq
## [1] 1803 ATGATTAACGATAAACTAGTACATTTTCTGATCC...CAAACACTAAATAACCTATATTTTGCTCAGTAA
>
> intronsByTranscript(sc_txdb)
## GRangesList object of length 6692:
## $`1`
## GRanges object with 0 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
##
## $`2`
## GRanges object with 0 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
##
## $`3`
## GRanges object with 0 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
##
## ...
## <6689 more elements>
> intronsByTranscript(sc_txdb, use.names = TRUE)["YAL067W-A"]
## GRangesList object of length 1:
## $`YAL067W-A`
## GRanges object with 0 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
> intronsByTranscript(sc_txdb, use.names = TRUE)[2035]
## GRangesList object of length 1:
## $YFL039C
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrVI 54378-54686 -
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
> intronsByTranscript(sc_txdb, use.names = TRUE)["YFL039C"]
## GRangesList object of length 1:
## $YFL039C
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrVI 54378-54686 -
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
>
> transcriptsBy(sc_txdb)["YFL039C"]
## GRangesList object of length 1:
## $YFL039C
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chrVI 53260-54696 - | 2035 YFL039C
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
>
> # intronsByTranscript(sc_txdb)["YFL039C"]
>
> fiveUTRsByTranscript(sc_txdb)
## GRangesList object of length 0:
## <0 elements>
>
> # Transcripts have id numbers
> cdsBy(sc_txdb, "tx")
## GRangesList object of length 6692:
## $`1`
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | cds_id cds_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chrI 335-649 + | 1 <NA> 1
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
##
## $`2`
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | cds_id cds_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chrI 538-792 + | 2 <NA> 1
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
##
## $`3`
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | cds_id cds_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chrI 2480-2707 + | 3 <NA> 1
## -------
## seqinfo: 17 sequences (1 circular) from sacCer3 genome
##
## ...
## <6689 more elements>
> names(cdsBy(sc_txdb, "tx")) |> head()
## [1] "1" "2" "3" "4" "5" "6"
> # use use.names = TRUE to get a more useful identifier:
> cdsBy(sc_txdb, "tx", use.names = TRUE) |> names() |> grep(pattern = "YFL039C")
## [1] 2035
>
> # ?id2name:
> # "this internal id was generated at the time the TxDb object was created and has
> # no meaning outside the scope of this object
> id2name(sc_txdb, feature.type = "tx")[2035]
## 2035
## "YFL039C"
>
> # some things were apparently not recorded for the yeast genome TxDb
> methods(class = "TxDb")
## [1] $ $<- annotatedDataFrameFrom
## [4] as.list asBED asGFF
## [7] assayData assayData<- cds
## [10] cdsBy cdsByOverlaps coerce
## [13] columns combine contents
## [16] dbconn dbfile dbInfo
## [19] dbmeta dbschema distance
## [22] exons exonsBy exonsByOverlaps
## [25] ExpressionSet extractUpstreamSeqs featureNames
## [28] featureNames<- fiveUTRsByTranscript genes
## [31] initialize intronsByTranscript isActiveSeq
## [34] isActiveSeq<- isNA keys
## [37] keytypes mapIds mapIdsToRanges
## [40] mappedkeys mapRangesToIds mapToTranscripts
## [43] metadata microRNAs nhit
## [46] organism promoters revmap
## [49] sample sampleNames sampleNames<-
## [52] saveDb select seqinfo
## [55] seqinfo<- seqlevels<- seqlevels0
## [58] show species storageMode
## [61] storageMode<- taxonomyId terminators
## [64] threeUTRsByTranscript transcripts transcriptsBy
## [67] transcriptsByOverlaps tRNAs updateObject
## see '?methods' for accessing help and source codeExercise: Performing DNA Sequence Motif Search with Bioconductor
It is now time to practice what we have learned so far! In the following exercise we will extract the downstream sequence of a gene and perform a motif search on it. Not all functions that you need have been covered yet, you will also need to consult the documentation: this is by design, a lot of the work when analysing data with Bioconductor is spent finding out how to do things - even for experienced programmers and bioinformaticians!
The setting: You have performed a genetic screen in Saccharomyces cerevisiae and identified several genes that are suppressed by antisense transcription. These antisense RNAs (asRNAs) originate in the 3’-intergenic region (3’-IGR) of the genes. To understand how these asRNAs are regulated you decide in a first step to check if the genes contain a TATA-binding protein (TBP) sequence motif (“TATA box”) in antisense direction relative to the coding sequence of the respective genes.
Your task: use Bioconductor packages and functions to find out if there is a TBP-binding motif in antisense direction in the 3’-IGR of the IME4 gene. Limit the search to the first 300 base pairs downstream of the gene. This will require a number of steps:
- Mapping annotations:
- As “IME4” is not a systematic gene name you first need to find out what is the systematic name of this gene. One option is to use the
org.Sc.sgd.dbpackage. Can you find out how to use it to map to the systematic name? After loading the package you can type its name in the console: read the output carefully as it gives you a hint on how to use it. Hint: You find trivial names in the “GENENAME” column and the systematic name in the “ENSEMBL” column. We will explore the usage of annotation packages a bit more at the end of the course.
Solution
> "YGL192W"- Extracting the sequence:
- Load the
TxDb.Scerevisiae.UCSC.sacCer3.sgdGenepackage. For easier typing, assign it to a variable:txdb <- TxDb.Scerevisiae.UCSC.sacCer3.sgdGene. Then check the output oftranscriptsBy(txdb). Can you guess how you could subset it to extract the genomic coordinates of the IME4 coding region? Afterwards, save it in a variableime4. - Create a new variable
ime4_downstreamthat contains the first 300 bp downstream of IME4. - Load the
Biostringsand theBSgenome.Scerevisiae.UCSC.sacCer3packages. The latter will load the yeast genome into memory, to access it you can save yourself some typing later by runningyeast_genome <- BSgenome.Scerevisiae.UCSC.sacCer3. - Then retrieve the DNA sequence of
ime4_downstreamfrom the yeast genome and convert it to its reverse complement.unlist()the result and store it in the variableime4_downstream_rc. - Tip to get help: Consult the vignettes for the GenomicRanges package (
browseVignettes("GenomicRanges")) for this exercise. Specifically, look at the introduction chapter and the HowTo chapter (section 2.13).
- Performing the motif search:
- Extract the position weight matrix for SPT15 from S. cerevisiae from the “jaspar2022” database using the
MotifDbpackage. You will need the functionquery()with the right values for theandStrings. Consult the help file and also the vignette and package description for help! At the end, you should get a MotifDb object of length 1. To then extract the position weight matrix (pwm) from this object simply extract it with list subsetting:[[1]]. - Use the
matchPWM()function with the DNAString and the PWM objects to find SPT15-binding sequences inime4_downstream_rc. Try different values for themin.scoreargument. - Bonus task 1: extract the exact sequence(s) that matched.
- Bonus task 2: visualize the position weight matrix of SPT15 using the
seqLogo()function in theseqLogopackage.
Note: After loading packages such as TxDb.Scerevisiae.UCSC.sacCer3.sgdGene a variable of the same name is created. You can copy this value to a new variable with a shorter name to save yourself some typing: txdb <- TxDb.Scerevisiae.UCSC.sacCer3.sgdGene.
Part II: Differential Gene Expression Analysis: DESeq2
The analysis of sequencing data is one of the more common types of analyses in molecular biology. In this course, we will cover the basics of differential gene expression analysis based on (bulk) RNA sequencing. At its core analyzing RNA sequencing data is about how to correctly analyze count data, which is relatively common in modern molecular biology techniques (e.g. all sequencing data but also mass spectrometry).
The theory behind RNA-seq analysis is complex and has many facets. Therefore, covering all the theoretical and statistical aspects is out of scope. Nevertheless, we will try to cover the most important concepts behind an RNA-sequencing analysis and how to run it using Bioconductor packages. This is possible because a lot of the details have been abstracted away so you can perform such analyses with just a few commands. We will use the DESeq2 package, one of the standard packages for this type of analyses. It was authored by Michael Love, Simon Anders, and Wolfgang Huber.
If you are interested in a more in-depth understanding of the topic I recommend the following resources:
- The DESeq2 vignette (
vignette(DESeq2)) and the “RNA-seq workflow” it refers to (link). - The DESeq2 paper: Love, M.I., Huber, W., Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. 10.1186/s13059-014-0550-8
- The chapters on count data from the book “Modern Statistics for Modern Biology” by Susan Holmes and Wolfang Huber. It is available for free online.
Note: parts of the following case study have been adapted from the following repository of the Software Carpentry (CC BY 4.0 license) as well as the “Modern Statistics for Modern Biology” (Holmes and Huber) book just mentioned (CC BY-NC-SA license).
The Setting: Impact of Influenza Infection on a Mouse Model of Multiple Sclerosis
We will discuss and explore the most important concepts behind RNA-seq data analysis using a dataset from a study where mice prone to autoimmune disease where infected with influenza virus to study if this can trigger the disease or exacerbate disease progression. Our RNA sequencing analysis will assume that quality control and read mapping have already been done. Our starting point is therefore an unnormalized count matrix. The dataset is taken from the following paper:
Blackmore, Stephen, et al. Influenza infection triggers disease in a genetic model of experimental autoimmune encephalomyelitis. PNAS 114.30 (2017):E6107-E6116. https://doi.org/10.1073/pnas.1620415114
For this course, I have already prepared the data you will need in the ./data directory.
First, let us load some packages that we will need in the subsequent analysis:
> library(DESeq2)
> library(SummarizedExperiment)
> library(ggplot2)
> library(tibble)Importing the Data
The original experimental design in Blackmore et al. 2017 was fairly complex: 8 week old male and female C57BL/6 mice were collected at Day 0 (before influenza infection), Day 4 and Day 8 after influenza infection. From each mouse, cerebellum and spinal cord tissues were taken for RNA-seq. There were originally 4 mice x 2 sexes x 3 time points x 2 tissue types groups (48 combinations), but a few were lost along the way resulting in a total of 45 samples. For this workshop, we are going to simplify the analysis by only using the 22 cerebellum samples. In addition to the counts per gene per sample, we also need information on which sample belongs to which Sex/Time point/Replicate. As for the genes, it is helpful to have extra information from an annotation dataset.
> # import the count matrix
> counts <- read.csv("./data/GSE96870_counts_cerebellum.csv", row.names = 1)
> dim(counts)
## [1] 41786 22
> counts[1:5, 1:5]
## GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340
## Xkr4 1891 2410 2159 1980 1977
## LOC105243853 0 0 1 4 0
## LOC105242387 204 121 110 120 172
## LOC105242467 12 5 5 5 2
## Rp1 2 2 0 3 2
>
> # import the sample annotations
> coldata <- read.csv("./data/GSE96870_coldata_cerebellum.csv")
> dim(coldata)
## [1] 22 11
> head(coldata)
## sample title geo_accession organism age sex
## 1 GSM2545336 CNS_RNA-seq_10C GSM2545336 Mus musculus 8 weeks Female
## 2 GSM2545337 CNS_RNA-seq_11C GSM2545337 Mus musculus 8 weeks Female
## 3 GSM2545338 CNS_RNA-seq_12C GSM2545338 Mus musculus 8 weeks Female
## 4 GSM2545339 CNS_RNA-seq_13C GSM2545339 Mus musculus 8 weeks Female
## 5 GSM2545340 CNS_RNA-seq_14C GSM2545340 Mus musculus 8 weeks Male
## 6 GSM2545341 CNS_RNA-seq_17C GSM2545341 Mus musculus 8 weeks Male
## infection strain time tissue mouse
## 1 InfluenzaA C57BL/6 Day8 Cerebellum 14
## 2 NonInfected C57BL/6 Day0 Cerebellum 9
## 3 NonInfected C57BL/6 Day0 Cerebellum 10
## 4 InfluenzaA C57BL/6 Day4 Cerebellum 15
## 5 InfluenzaA C57BL/6 Day4 Cerebellum 18
## 6 InfluenzaA C57BL/6 Day8 Cerebellum 6
>
> # import gene annotations
> rowranges <- read.delim("./data/GSE96870_rowranges.tsv",
+ sep = "\t",
+ colClasses = c(ENTREZID = "character"),
+ header = TRUE,
+ quote = "",
+ row.names = 5)
> dim(rowranges)
## [1] 41786 7
> head(rowranges)
## seqnames start end strand ENTREZID
## Xkr4 1 3670552 3671742 - 497097
## LOC105243853 1 3357323 3366505 + 105243853
## LOC105242387 1 3658847 3670456 - 105242387
## LOC105242467 1 4233436 4233728 - 105242467
## Rp1 1 4409170 4409241 - 19888
## Sox17 1 4496291 4497354 - 20671
## product
## Xkr4 X Kell blood group precursor related family member 4, transcript variant X1
## LOC105243853 uncharacterized LOC105243853, transcript variant X2
## LOC105242387 uncharacterized LOC105242387
## LOC105242467 lipoxygenase homology domain-containing protein 1, transcript variant X2
## Rp1 retinitis pigmentosa 1 (human), transcript variant 2
## Sox17 SRY (sex determining region Y)-box 17, transcript variant 5
## gbkey
## Xkr4 mRNA
## LOC105243853 ncRNA
## LOC105242387 ncRNA
## LOC105242467 mRNA
## Rp1 mRNA
## Sox17 mRNA
>
> # coerce to GRanges object, needed later
> rowranges <- as(rowranges, "GRanges")The SummarizedExperiment Class
When looking at the counts, coldata, and rowranges variables one can see that they serve different purposes in terms of recording measurement values or metadata. In the course of an analysis it is often desirable to keep data and metadata in sync. To do so, there is a special S4 class widely used in Bioconductor: the “SummarizedExperiment”.
The SummarizedExperiment consists of the following components:
- The assays: A matrix with genes in the rows and samples in the columns. “Sample” refers to an experimental sample (typically one sequencing library e.g. treated mouse with a certain genotype at a certain time point). The values in the matrix are the unnormalized transcript count values.
- The rowData or rowRanges: A table containing additional metadata about the genes in the assays table.
- The colData: A table containing information about the samples. This is important for building a statistical model about the count data.
The structure of SummarizedExperiment is shown in the following scheme:
Exercise: Creating a SummarizedExperiment Instance
In this exercise we will familiarize ourselves a bit with the SummarizedExperiment class. It forms the starting point for the analysis of RNA sequencing data with DESeq2 and it is also a good example of how Bioconductor provides classes that provide a meaningful abstraction for biological datasets.
- First, consult the vignette of the SummarizedExperiment package (
vignette("SummarizedExperiment")) and the help file?SummarizedExperimentto get familiar with the SummarizedExperiment class. - Using the data from the previous chunk (
counts,coldata,rowranges) as input: create an instance of the SummarizedExperiment class and save it in the variablese. Note: you will need to convert counts to a matrix first and then put it into a list (list(as.matrix(counts))). - Inspect the output of
seandsummary(se)in the console. - Use the functions
assays(),colData(),rowData()androwRanges()on yoursevariable to inspect the contents of your SummarizedExperiment instance. What do these functions show you?
- There are different options to subset a SummarizedExperiment object. Run the following code snippets:
se[, se$time == "Day0"]se[seqnames(rowRanges(se)) == "1", ]What do they do? Check the output. What happens with the assay data and other metadata after you do the subsetting?
- Save the SummarizedExperiment instance you created with the
saveRDS()function and load it back into R withreadRDS().
RNA-seq Analysis with DESeq2: An Overview
The final result of an RNA sequencing analysis is to make inferences about the effect of changing experimental factors on the average gene expression levels for every gene independently. The effect size is measured in log2-fold changes (l2fc for short). DESeq2 uses statistical models to test for the significance of the log2-fold changes and allows you to contrast the different experimental factors. Briefly, what DESeq2 does is the following:
- It normalizes the read counts with a correction factor estimated from library sizes.
- It estimates gene-wise dispersion (variability) estimates using a combination of replicate information and statistical modelling.
- These dispersion estimates make it possible to assign a p-value to the log2-fold changes observed between experimental conditions. The default option is to calculate them such that the false-discovery-rate is 10% but this can be adjusted.
RNA-seq Analysis with DESeq2 Step by Step
Step 1: Preparing the Dataset
First, we create a SummarizedExperiment instance using the data imported above in order to continue with our RNA-seq analysis:
Code
> se <- SummarizedExperiment(
+ # needed for subsequent conversion to DESeqDataSet
+ assays = list(counts = as.matrix(counts)),
+ colData = coldata,
+ rowRanges = rowranges
+ )In fact, while we can use a SummarizedExperiment instance to run the RNA-seq analysis we will use the specialized DESeqDataSet subclass because it makes some aspects of the analysis easier.
Note the design = ~ time argument in the following code chunk. Briefly, it means that DESeq2 is going to build a model that will calculate the log2-fold changes along with their p-values to compare the expression levels between time points.
> ms_dataset <- DESeqDataSet(se, design = ~ time)
> ms_dataset
## class: DESeqDataSet
## dim: 41786 22
## metadata(1): version
## assays(1): counts
## rownames(41786): Xkr4 LOC105243853 ... TrnT TrnP
## rowData names(3): ENTREZID product gbkey
## colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
## colData names(11): sample title ... tissue mouseStep 2: Normalizing the Read Counts
The first factor to take into account are differences in library sizes. Correcting for these differences is called normalization, we distinguish unnormalized (raw) read counts from normalized read counts.
In the analysis of high-throughput experiments the term “normalization” is used for different things and people often do not specify what they mean. Do not be shy to ask.
Rather than simply dividing the number of reads by the sum, DESeq2 uses something different.
Consider a dataset with 5 genes and two samples as displayed in Figure 8.1. If we estimate for each of the two samples by its sum of counts, then the slope of the blue line represents their ratio. According to this, gene C is down-regulated in sample 2 compared to sample 1, while the other genes are all somewhat up-regulated. If we now instead estimate such that their ratios correspond to the red line, then we will still conclude that gene C is down-regulated, while the other genes are unchanged. The second version is more parsimonious and is often preferred by scientists. The slope of the red line can be obtained by robust regression. This is what the DESeq2 method does.
Solution
> counts_raw <- assay(ms_dataset)
> size_factor <- estimateSizeFactorsForMatrix(counts_raw)
>
> comparison <- tibble(
+ size_factor = size_factor,
+ sum = colSums(counts_raw),
+ size_factor_from_sum = colSums(counts_raw) / mean(colSums(counts_raw))
+ )
>
> comparison
>
> ggplot(comparison, aes(x = size_factor, y = size_factor_from_sum)) +
+ geom_point() +
+ geom_abline(slope = 1) +
+ coord_fixed()
>
> norm_mat <- matrix(size_factor, byrow = TRUE,
+ ncol = ncol(counts_raw), nrow = nrow(counts_raw))
> counts_normalized <- counts_raw / norm_matLuckily, DESeq2 does a lot of the heavy lifting for us: when running an RNA-seq analysis you can simply run the following command to normalize the counts:
> ms_dataset <- estimateSizeFactors(ms_dataset)Step 3: Estimating dispersions
Next, DESeq2 estimates the variability for every gene based on the count data. A characteristic of RNA-seq datasets is that in the low expression range, the variance is proportional to the mean. For those genes, it would therefore be possible to model the variability based on a Poisson distribution, which is a distribution where the variance is equal to the mean. However, in the higher expression ranges, the variance is actually larger than the mean. Therefore, a negative binomial distribution (also known as a gamma-Poisson distribution) is applied to the data to account for this extra variability. Intuitively, what happens is that the more highly expressed genes vary more than what the Poisson distribution predicts. This increased variability is due to unaccounted sources of variance during sample preparation and the fact that these genes are subject to more complex regulatory patterns. Apart from taking the information from replicates into account, DESeq2 also uses the information from other genes in the dataset to get a model of how variance depends on the mean expression.
Solution
> day0_cols <- grep("Day0", colnames(counts_normalized))
> counts_normalized_day0 <- counts_normalized[, day0_cols]
>
> var_mean_tab <- tibble(
+ means = rowMeans(counts_normalized),
+ vars = rowVars(counts_normalized)
+ )
>
> ggplot(var_mean_tab, aes(x = log(means), y = log(vars))) +
+ geom_hex() +
+ coord_fixed(xlim = c(-5, 20), ylim = c(-5, 20)) +
+ geom_abline(slope = 1:2, color = (c("forestgreen", "red"))) +
+ NULLAgain, the previous code chunk was more to understand what is going on, a quicker way to achieve this is to simply run
> ms_dataset <- estimateDispersions(ms_dataset)Step 4: Testing for differential expression and retrieving results
This last step is done by using a version of the t-test called Wald test for the negative binomial distribution.
> ms_dataset <- nbinomWaldTest(ms_dataset)At the end, the results of the analysis can be retrieved with the results() function.
> res <- results(ms_dataset)
> res
## log2 fold change (MLE): time Day8 vs Day0
## Wald test p-value: time Day8 vs Day0
## DataFrame with 41786 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## Xkr4 1937.755565 -0.239387 0.107807 -2.220508 0.026384308
## LOC105243853 0.962588 0.518529 1.184125 0.437900 0.661458667
## LOC105242387 169.913400 0.455431 0.138380 3.291167 0.000997725
## LOC105242467 3.435073 0.288255 0.572181 0.503783 0.614414050
## Rp1 2.273456 -0.282147 0.668853 -0.421838 0.673143629
## ... ... ... ... ... ...
## TrnS2 1.76325 0.773975 0.697694 1.109333 2.67287e-01
## TrnL2 169.37651 0.499418 0.101147 4.937560 7.91063e-07
## TrnE 335.92910 0.194226 0.211981 0.916244 3.59539e-01
## TrnT 379.46721 0.418641 0.107507 3.894087 9.85692e-05
## TrnP 277.35435 0.111077 0.240277 0.462285 6.43877e-01
## padj
## <numeric>
## Xkr4 0.07721803
## LOC105243853 0.79602142
## LOC105242387 0.00494211
## LOC105242467 0.76160580
## Rp1 0.80425322
## ... ...
## TrnS2 4.44850e-01
## TrnL2 9.69265e-06
## TrnE 5.43421e-01
## TrnT 6.82194e-04
## TrnP 7.83182e-01Putting it all together: the DESeq() function
In fact, all the previous steps can be run at once using the function DESeq(), which is simply a wrapper that applies estimateSizeFactors(), estimateDispersions(), and nbinomWaldTest() one after the other. To demonstrate this, let’s construct our DESeqDataSet again and apply the DESeq() function to it:
> ms_dataset <- DESeqDataSet(se, design = ~ time)
> ms_dataset
## class: DESeqDataSet
## dim: 41786 22
## metadata(1): version
## assays(1): counts
## rownames(41786): Xkr4 LOC105243853 ... TrnT TrnP
## rowData names(3): ENTREZID product gbkey
## colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
## colData names(11): sample title ... tissue mouse
>
> ms_dataset <- DESeq(ms_dataset)
> res2 <- results(ms_dataset)
> res2
## log2 fold change (MLE): time Day8 vs Day0
## Wald test p-value: time Day8 vs Day0
## DataFrame with 41786 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## Xkr4 1937.755565 -0.239387 0.107807 -2.220508 0.026384308
## LOC105243853 0.962588 0.518529 1.184125 0.437900 0.661458667
## LOC105242387 169.913400 0.455431 0.138380 3.291167 0.000997725
## LOC105242467 3.435073 0.288255 0.572181 0.503783 0.614414050
## Rp1 2.273456 -0.282147 0.668853 -0.421838 0.673143629
## ... ... ... ... ... ...
## TrnS2 1.76325 0.773975 0.697694 1.109333 2.67287e-01
## TrnL2 169.37651 0.499418 0.101147 4.937560 7.91063e-07
## TrnE 335.92910 0.194226 0.211981 0.916244 3.59539e-01
## TrnT 379.46721 0.418641 0.107507 3.894087 9.85692e-05
## TrnP 277.35435 0.111077 0.240277 0.462285 6.43877e-01
## padj
## <numeric>
## Xkr4 0.07724757
## LOC105243853 0.79612012
## LOC105242387 0.00494419
## LOC105242467 0.76168873
## Rp1 0.80435586
## ... ...
## TrnS2 4.44882e-01
## TrnL2 9.69873e-06
## TrnE 5.43487e-01
## TrnT 6.82622e-04
## TrnP 7.83275e-01Exploring DESeq2 results
P-value histogram:
> res_df <- as.data.frame(res)
> head(res_df)
## baseMean log2FoldChange lfcSE stat pvalue
## Xkr4 1937.7555653 -0.2393865 0.10780710 -2.2205079 0.0263843077
## LOC105243853 0.9625883 0.5185287 1.18412550 0.4379002 0.6614586666
## LOC105242387 169.9134003 0.4554314 0.13837988 3.2911674 0.0009977252
## LOC105242467 3.4350732 0.2882549 0.57218095 0.5037828 0.6144140499
## Rp1 2.2734558 -0.2821471 0.66885267 -0.4218375 0.6731436288
## Sox17 251.0059632 -0.3178237 0.09472532 -3.3552132 0.0007930385
## padj
## Xkr4 0.077218031
## LOC105243853 0.796021417
## LOC105242387 0.004942109
## LOC105242467 0.761605800
## Rp1 0.804253221
## Sox17 0.004081573
>
> ggplot(res_df, aes(x = pvalue)) +
+ geom_histogram(binwidth = 0.01, boundary = 0)>
> ggplot(res_df, aes(x = padj)) +
+ geom_histogram(binwidth = 0.01, boundary = 0) +
+ scale_y_continuous(breaks = c(0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000))MA plot: mean log2-fold change vs. average expression, where the average is across both of the conditions compared.
> plotMA(res)>
> res_df$is_signif <- res_df$padj < 0.05
>
> # using ggplot2:
> ggplot(res_df, aes(x = log10(baseMean), y = log2FoldChange, colour = is_signif)) +
+ geom_point(shape = 1) +
+ scale_colour_manual(values = c("grey", "steelblue", "grey")) +
+ theme_bw()>
> # resons for NA values:
> # - low counts (i.e. exclusion by quality control)
> # - perfect separation: zero count in one of the two conditions
> # - outliers (based on Cook's distance)Two- and multi-factor analysis of the RNA-seq data
Remember our definition of the DESeq2 dataset above:
> DESeqDataSet(se, design = ~ time)And now closely inspect the output of our results object. Which factor levels are compared in the results table?
> res
## log2 fold change (MLE): time Day8 vs Day0
## Wald test p-value: time Day8 vs Day0
## DataFrame with 41786 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## Xkr4 1937.755565 -0.239387 0.107807 -2.220508 0.026384308
## LOC105243853 0.962588 0.518529 1.184125 0.437900 0.661458667
## LOC105242387 169.913400 0.455431 0.138380 3.291167 0.000997725
## LOC105242467 3.435073 0.288255 0.572181 0.503783 0.614414050
## Rp1 2.273456 -0.282147 0.668853 -0.421838 0.673143629
## ... ... ... ... ... ...
## TrnS2 1.76325 0.773975 0.697694 1.109333 2.67287e-01
## TrnL2 169.37651 0.499418 0.101147 4.937560 7.91063e-07
## TrnE 335.92910 0.194226 0.211981 0.916244 3.59539e-01
## TrnT 379.46721 0.418641 0.107507 3.894087 9.85692e-05
## TrnP 277.35435 0.111077 0.240277 0.462285 6.43877e-01
## padj
## <numeric>
## Xkr4 0.07721803
## LOC105243853 0.79602142
## LOC105242387 0.00494211
## LOC105242467 0.76160580
## Rp1 0.80425322
## ... ...
## TrnS2 4.44850e-01
## TrnL2 9.69265e-06
## TrnE 5.43421e-01
## TrnT 6.82194e-04
## TrnP 7.83182e-01What happened in the background when you ran DESeq() was that DESeq modelled gene expression levels according to the following formula:
y = \beta_0 + x_1\beta_1 + \epsilon.
This can be interpreted as follows: y is the experimental measurement of interest for a particular gene. \beta_0 is the baseline expression level. x_1 is an indicator variable that becomes 1 if the “time” variable was “Day8” or not. \beta_1 is then the average increase in expression from Day0 to Day8. Such a model is fitted for every gene independently. Of course, the expression level cannot be described to 100% by this linear model, which is why an error term, \epsilon is also added.
You can use the contrast argument to contrast other levels, for example:
> results(ms_dataset, contrast = c("time", "Day4", "Day0"))
## log2 fold change (MLE): time Day4 vs Day0
## Wald test p-value: time Day4 vs Day0
## DataFrame with 41786 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## Xkr4 1937.755565 -0.127984 0.104312 -1.226929 0.2198491
## LOC105243853 0.962588 0.703107 1.134552 0.619722 0.5354407
## LOC105242387 169.913400 0.240995 0.134806 1.787717 0.0738217
## LOC105242467 3.435073 -0.193479 0.575847 -0.335990 0.7368785
## Rp1 2.273456 0.223697 0.612173 0.365415 0.7148015
## ... ... ... ... ... ...
## TrnS2 1.76325 0.0454198 0.7271729 0.0624607 0.9501959
## TrnL2 169.37651 0.2254050 0.0993673 2.2684025 0.0233047
## TrnE 335.92910 0.0689015 0.2053937 0.3354605 0.7372778
## TrnT 379.46721 0.1853258 0.1046038 1.7716930 0.0764455
## TrnP 277.35435 -0.0962761 0.2329107 -0.4133607 0.6793424
## padj
## <numeric>
## Xkr4 0.718680
## LOC105243853 NA
## LOC105242387 0.476948
## LOC105242467 NA
## Rp1 NA
## ... ...
## TrnS2 NA
## TrnL2 0.286259
## TrnE 0.956626
## TrnT 0.481675
## TrnP 0.941030Notice that above we did not define “sex” as an experimental factor, so the following will fail:
> results(ms_dataset, contrast = c("sex", "Female", "Male"))
## Error in cleanContrast(object, contrast, expanded = isExpanded, listValues = listValues, : sex is not a factor, see ?resultsThis way of specifying the experimental design (the design argument of DESeq2) is problematic because it means that all the other experimental covariates that you know about will be “modelled” only by the error term \epsilon. So if for a particular gene male mice have a higher expression level than female ones or if its regulation is sex-specific this difference will be modelled as random variation. In the best case, this reduces the power of your statistical test, in the worst case your results will suffer from confounding and differences may be wrongly attributed to the change of an experimental factor when in fact they are the result of a confounding variable.
To make this more specific, let’s have a look at how the log2-fold change is actually calculated when using design = ~ time:
> ms_dataset_df <- as_tibble(assay(ms_dataset), rownames = "gene")
> ms_dataset_df
## # A tibble: 41,786 × 23
## gene GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
## <chr> <int> <int> <int> <int> <int> <int>
## 1 Xkr4 1891 2410 2159 1980 1977 1945
## 2 LOC1052438… 0 0 1 4 0 0
## 3 LOC1052423… 204 121 110 120 172 173
## 4 LOC1052424… 12 5 5 5 2 6
## 5 Rp1 2 2 0 3 2 1
## 6 Sox17 251 239 218 220 261 232
## 7 Gm7357 3 0 1 2 1 2
## 8 LOC1052438… 0 3 5 0 2 0
## 9 LOC1052438… 0 0 0 0 0 0
## 10 Gm7369 0 1 0 1 0 0
## # ℹ 41,776 more rows
## # ℹ 16 more variables: GSM2545342 <int>, GSM2545343 <int>, GSM2545344 <int>,
## # GSM2545345 <int>, GSM2545346 <int>, GSM2545347 <int>, GSM2545348 <int>,
## # GSM2545349 <int>, GSM2545350 <int>, GSM2545351 <int>, GSM2545352 <int>,
## # GSM2545353 <int>, GSM2545354 <int>, GSM2545362 <int>, GSM2545363 <int>,
## # GSM2545380 <int>
>
> colData(ms_dataset)
## DataFrame with 22 rows and 13 columns
## sample title geo_accession organism age
## <character> <character> <character> <character> <character>
## GSM2545336 GSM2545336 CNS_RNA-seq_10C GSM2545336 Mus musculus 8 weeks
## GSM2545337 GSM2545337 CNS_RNA-seq_11C GSM2545337 Mus musculus 8 weeks
## GSM2545338 GSM2545338 CNS_RNA-seq_12C GSM2545338 Mus musculus 8 weeks
## GSM2545339 GSM2545339 CNS_RNA-seq_13C GSM2545339 Mus musculus 8 weeks
## GSM2545340 GSM2545340 CNS_RNA-seq_14C GSM2545340 Mus musculus 8 weeks
## ... ... ... ... ... ...
## GSM2545353 GSM2545353 CNS_RNA-seq_3C GSM2545353 Mus musculus 8 weeks
## GSM2545354 GSM2545354 CNS_RNA-seq_4C GSM2545354 Mus musculus 8 weeks
## GSM2545362 GSM2545362 CNS_RNA-seq_5C GSM2545362 Mus musculus 8 weeks
## GSM2545363 GSM2545363 CNS_RNA-seq_6C GSM2545363 Mus musculus 8 weeks
## GSM2545380 GSM2545380 CNS_RNA-seq_9C GSM2545380 Mus musculus 8 weeks
## sex infection strain time tissue mouse
## <character> <character> <character> <factor> <character> <integer>
## GSM2545336 Female InfluenzaA C57BL/6 Day8 Cerebellum 14
## GSM2545337 Female NonInfected C57BL/6 Day0 Cerebellum 9
## GSM2545338 Female NonInfected C57BL/6 Day0 Cerebellum 10
## GSM2545339 Female InfluenzaA C57BL/6 Day4 Cerebellum 15
## GSM2545340 Male InfluenzaA C57BL/6 Day4 Cerebellum 18
## ... ... ... ... ... ... ...
## GSM2545353 Female NonInfected C57BL/6 Day0 Cerebellum 4
## GSM2545354 Male NonInfected C57BL/6 Day0 Cerebellum 2
## GSM2545362 Female InfluenzaA C57BL/6 Day4 Cerebellum 20
## GSM2545363 Male InfluenzaA C57BL/6 Day4 Cerebellum 12
## GSM2545380 Female InfluenzaA C57BL/6 Day8 Cerebellum 19
## sizeFactor replaceable
## <numeric> <logical>
## GSM2545336 1.134369 TRUE
## GSM2545337 0.909434 TRUE
## GSM2545338 0.856613 TRUE
## GSM2545339 0.914962 TRUE
## GSM2545340 0.929113 TRUE
## ... ... ...
## GSM2545353 1.10654 TRUE
## GSM2545354 0.97923 TRUE
## GSM2545362 1.08336 TRUE
## GSM2545363 1.00029 TRUE
## GSM2545380 1.13368 TRUE
> day8_cols <- colData(ms_dataset)$sample[colData(ms_dataset)$time == "Day8"]
> day0_cols <- colData(ms_dataset)$sample[colData(ms_dataset)$time == "Day0"]
>
> my_gene <- "Xkr4"
>
> # average count day8:
> ratio <- mean(counts_normalized[my_gene, day8_cols]) / mean(counts_normalized[my_gene, day0_cols])
> log2(ratio)
## [1] -0.2393478
>
> # which happens to be the value you see in the results table:
> res
## log2 fold change (MLE): time Day8 vs Day0
## Wald test p-value: time Day8 vs Day0
## DataFrame with 41786 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## Xkr4 1937.755565 -0.239387 0.107807 -2.220508 0.026384308
## LOC105243853 0.962588 0.518529 1.184125 0.437900 0.661458667
## LOC105242387 169.913400 0.455431 0.138380 3.291167 0.000997725
## LOC105242467 3.435073 0.288255 0.572181 0.503783 0.614414050
## Rp1 2.273456 -0.282147 0.668853 -0.421838 0.673143629
## ... ... ... ... ... ...
## TrnS2 1.76325 0.773975 0.697694 1.109333 2.67287e-01
## TrnL2 169.37651 0.499418 0.101147 4.937560 7.91063e-07
## TrnE 335.92910 0.194226 0.211981 0.916244 3.59539e-01
## TrnT 379.46721 0.418641 0.107507 3.894087 9.85692e-05
## TrnP 277.35435 0.111077 0.240277 0.462285 6.43877e-01
## padj
## <numeric>
## Xkr4 0.07721803
## LOC105243853 0.79602142
## LOC105242387 0.00494211
## LOC105242467 0.76160580
## Rp1 0.80425322
## ... ...
## TrnS2 4.44850e-01
## TrnL2 9.69265e-06
## TrnE 5.43421e-01
## TrnT 6.82194e-04
## TrnP 7.83182e-01Solution
> # colData(se)$mouse <- factor(colData(se)$mouse)
> ms_dataset_2factor <- DESeqDataSet(se, design = ~ sex + time)
> ms_dataset_2factor <- DESeq(ms_dataset_2factor)
> res_2factor <- results(ms_dataset_2factor)
> res_2factorSolution
> comp_tab <- tibble(
+ pvals_onefactor = res$padj,
+ pvals_twofactor = res_2factor$padj
+ )
>
>
> ggplot(comp_tab, aes(x = pvals_onefactor, y = pvals_twofactor)) +
+ geom_hex(bins = 75) +
+ #geom_point() +
+ coord_fixed() +
+ xlab("Single factor analysis (condition)") +
+ ylab("Two factor analysis (type + condition)") +
+ geom_abline(col = "orange")
>
> sum(res$padj < 0.1, na.rm = TRUE)
> sum(res_2factor$padj < 0.1, na.rm = TRUE)
>
> mean(res$padj < 0.1, na.rm = TRUE)
> mean(res_2factor$padj < 0.1, na.rm = TRUE)
>
> plotMA(res)
> plotMA(res_2factor)Part III: Using Annotation Resources
Working with annotations may not be the most exciting part of bioinformatics but a very important one. Instead of spending hours clicking your way through websites you can solve these problems programmatically with Bioconductor! Note also: when you are on a website with annotation resources, check if there is a section on API access if you are interested in programmatic access.
Overview of Bioconductor Annotation Resources
| Resource | Description | Use Cases |
|---|---|---|
| Annotation Packages | Provides comprehensive annotation data for various organisms, e.g., org.Hs.eg.db for humans. |
Gene annotation, mapping IDs, retrieving functions, linking genes to pathways. |
| Biostrings & BSgenome | Infrastructure for handling biological sequences and accessing full genome sequences, e.g. BSgenome.Scerevisiae.UCSC.sacCer3. |
Extracting and analyzing nucleotide/protein sequences, linking to annotations. |
| GenomicFeatures | Create, manipulate, and annotate genomic ranges, including transcript databases from various sources, e.g. TxDb.Hsapiens.UCSC.hg19.knownGene. |
Creating TxDb objects to find e.g. gene, transcript, or exon information. |
| AnnotationHub | Resource for discovering and using a variety of genomic annotation resources from different sources such as UCSC or Ensembl. | Finding up-to-date annotations, integrating data into analyses. |
| biomaRt | Interface to the BioMart system for retrieving genomic annotations from Ensembl and other databases. | Obtaining annotations like gene coordinates, names, homologs, variant info. |
| rtracklayer | Facilities for importing and exporting genome annotations in various formats, such as GFF, BED, and WIG. | Working with track-based data, integrating external annotations. |
| GO.db & KEGG.db | Access to Gene Ontology (GO) terms and KEGG pathways for annotating genes with functional and pathway information. | Functional enrichment analysis, identifying biological themes. |
org.db and TxDb packages
This tends to be useful to link annotation IDs to each other, for example to map between trivial and systematic gene names or to find out which Uniprot IDs are linked to a particular gene. One advantage of retrieving annotations programmatically rather than manually is that you have a record of which database or genome assembly version was used.
> library(org.Hs.eg.db)
>
> org <- org.Hs.eg.db
> org
## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: HUMAN_DB
## | ORGANISM: Homo sapiens
## | SPECIES: Human
## | EGSOURCEDATE: 2025-Feb22
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | CENTRALID: EG
## | TAXID: 9606
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
## | GOSOURCEDATE: 2025-02-06
## | GOEGSOURCEDATE: 2025-Feb22
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | KEGGSOURCENAME: KEGG GENOME
## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
## | KEGGSOURCEDATE: 2011-Mar15
## | GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
## | GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/database
## | GPSOURCEDATE: UTC-Mar1
## | ENSOURCEDATE: 2024-Oct17
## | ENSOURCENAME: Ensembl
## | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
## | UPSOURCENAME: Uniprot
## | UPSOURCEURL: http://www.UniProt.org/
## | UPSOURCEDATE: Sat Mar 1 19:14:02 2025Let us assume you want to find out which Uniprot IDs exist for the “EGFR” gene. You can solve many of these questions using the so-called “select interface”:
> gene <- "EGFR"
>
> columns(org)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [26] "UNIPROT"
> keytypes(org)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [26] "UNIPROT"
>
> # by default it will return the primary keys for the database but if used with the keytype argument,
> # it will return the keys from that keytype
>
> head(keys(org))
## [1] "1" "2" "3" "9" "10" "11"
>
> head(keys(org, keytype = "GENENAME"))
## [1] "alpha-1-B glycoprotein" "alpha-2-macroglobulin"
## [3] "alpha-2-macroglobulin pseudogene 1" "N-acetyltransferase 1"
## [5] "N-acetyltransferase 2" "N-acetyltransferase pseudogene"
> head(keys(org, keytype = "ENTREZID"))
## [1] "1" "2" "3" "9" "10" "11"
> head(keys(org, keytype = "SYMBOL"))
## [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "NATP"
> grep(keys(org, keytype = "SYMBOL"), pattern = gene)
## [1] 1553 36154
>
> keys(org, keytype = "SYMBOL")[grep(keys(org, keytype = "SYMBOL"), pattern = gene)]
## [1] "EGFR" "EGFR-AS1"
>
> # explicitly calling select from AnnotationDbi to prevent problems of name masking
> # (when a function of the same name from another package masks the intended function name)
> AnnotationDbi::select(org, keys = gene, keytype = "SYMBOL",
+ columns = c("SYMBOL", "ALIAS", "UNIPROT")) |>
+ head()
## SYMBOL ALIAS UNIPROT
## 1 EGFR ERBB F2YGG7
## 2 EGFR ERBB Q504U8
## 3 EGFR ERBB A9CB80
## 4 EGFR ERBB E7BSV0
## 5 EGFR ERBB B7Z2I3
## 6 EGFR ERBB C9JYS6
>
> AnnotationDbi::select(org, keys = gene, keytype = "SYMBOL",
+ columns = c("SYMBOL", "ENSEMBL"))
## SYMBOL ENSEMBL
## 1 EGFR ENSG00000146648
>
> AnnotationDbi::select(org, keys = "ENSG00000146648", keytype = "ENSEMBL",
+ columns = c("GO")) |>
+ head()
## ENSEMBL GO EVIDENCE ONTOLOGY
## 1 ENSG00000146648 GO:0000139 IEA CC
## 2 ENSG00000146648 GO:0000165 IEA BP
## 3 ENSG00000146648 GO:0001503 NAS BP
## 4 ENSG00000146648 GO:0001618 IEA MF
## 5 ENSG00000146648 GO:0001892 IEA BP
## 6 ENSG00000146648 GO:0001934 IDA BP
>
> search_res <- AnnotationDbi::select(org, keys = gene, keytype = "SYMBOL",
+ columns = c("ENTREZID"))
>
> gene_id <- search_res$ENTREZID
>
> AnnotationDbi::select(org, keys = gene, keytype = "SYMBOL",
+ columns = c("GO", "ONTOLOGY")) |>
+ head()
## SYMBOL GO EVIDENCE ONTOLOGY
## 1 EGFR GO:0000139 IEA CC
## 2 EGFR GO:0000165 IEA BP
## 3 EGFR GO:0001503 NAS BP
## 4 EGFR GO:0001618 IEA MF
## 5 EGFR GO:0001892 IEA BP
## 6 EGFR GO:0001934 IDA BP
>
>
> # ENTREZID is 1956 (GENEID in subsequent searches)The retrieved IDs can be useful for subsetting other annotation packages. For example, the ENTREZID for our EGFR was 1956. We can use this to subset the corresponding transcript from TxDb.Hsapiens.UCSC.hg38.knownGene.
> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
>
> txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
> txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg38
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # UCSC Track: GENCODE V47
## # Resource URL: https://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: NA
## # Nb of transcripts: 412034
## # Db created by: txdbmaker package from Bioconductor
## # Creation time: 2025-03-02 02:45:03 +0000 (Sun, 02 Mar 2025)
## # txdbmaker version at creation time: 1.3.1
## # RSQLite version at creation time: 2.3.9
## # DBSCHEMAVERSION: 1.2
>
> egfr_region <- genes(txdb, filter = list(GENEID = 1956))
> egfr_region
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 1956 chr7 55019017-55211628 + | 1956
## -------
## seqinfo: 711 sequences (1 circular) from hg38 genome
> exons(txdb, filter = list(GENEID = 1956))
## GRanges object with 53 ranges and 1 metadata column:
## seqnames ranges strand | exon_id
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr7 55019017-55019365 + | 340477
## [2] chr7 55019021-55019365 + | 340478
## [3] chr7 55019032-55019365 + | 340479
## [4] chr7 55019034-55019365 + | 340480
## [5] chr7 55019088-55019365 + | 340481
## ... ... ... ... . ...
## [49] chr7 55205256-55206016 + | 340525
## [50] chr7 55205256-55208067 + | 340526
## [51] chr7 55205256-55211536 + | 340527
## [52] chr7 55205256-55211628 + | 340528
## [53] chr7 55205347-55205865 + | 340529
## -------
## seqinfo: 711 sequences (1 circular) from hg38 genomeFinding Regulatory Regions with AnnotationHub
AnnotationHub allows you to access annotations from several sources. From the description file of the package: “This package provides a client for the Bioconductor AnnotationHub web resource. The AnnotationHub web resource provides a central location where genomic files (e.g., VCF, bed, wig) and other resources from standard locations (e.g., UCSC, Ensembl) can be discovered.”
> library(AnnotationHub)
>
> ah <- AnnotationHub()
> ah
## AnnotationHub with 70637 records
## # snapshotDate(): 2025-04-08
## # $dataprovider: Ensembl, BroadInstitute, UCSC, ftp://ftp.ncbi.nlm.nih.gov/g...
## # $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Rattus norv...
## # $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, OrgDb, ChainFile...
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## # rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH5012"]]'
##
## title
## AH5012 | Chromosome Band
## AH5013 | STS Markers
## AH5014 | FISH Clones
## AH5015 | Recomb Rate
## AH5016 | ENCODE Pilot
## ... ...
## AH121712 | Data.table for PubMed Author Information
## AH121713 | Data.table for PMC
## AH121714 | Data.table for MeSH (Descriptor)
## AH121715 | Data.table for MeSH (Qualifier)
## AH121716 | Data.table for MeSH (SCR)
>
> # Discover records in a hub using mcols(), query(), subset(), and [.
> mcols(ah)
## DataFrame with 70637 rows and 15 columns
## title dataprovider species taxonomyid
## <character> <character> <character> <integer>
## AH5012 Chromosome Band UCSC Homo sapiens 9606
## AH5013 STS Markers UCSC Homo sapiens 9606
## AH5014 FISH Clones UCSC Homo sapiens 9606
## AH5015 Recomb Rate UCSC Homo sapiens 9606
## AH5016 ENCODE Pilot UCSC Homo sapiens 9606
## ... ... ... ... ...
## AH121712 Data.table for PubMe.. NCBI NA NA
## AH121713 Data.table for PMC NCBI NA NA
## AH121714 Data.table for MeSH .. NCBI NA NA
## AH121715 Data.table for MeSH .. NCBI NA NA
## AH121716 Data.table for MeSH .. NCBI NA NA
## genome description coordinate_1_based
## <character> <character> <integer>
## AH5012 hg19 GRanges object from .. 1
## AH5013 hg19 GRanges object from .. 1
## AH5014 hg19 GRanges object from .. 1
## AH5015 hg19 GRanges object from .. 1
## AH5016 hg19 GRanges object from .. 1
## ... ... ... ...
## AH121712 NA Correspondence table.. 1
## AH121713 NA Correspondence table.. 1
## AH121714 NA Correspondence table.. 1
## AH121715 NA Correspondence table.. 1
## AH121716 NA Correspondence table.. 1
## maintainer rdatadateadded preparerclass
## <character> <character> <character>
## AH5012 Marc Carlson <mcarls.. 2013-03-26 UCSCFullTrackImportP..
## AH5013 Marc Carlson <mcarls.. 2013-03-26 UCSCFullTrackImportP..
## AH5014 Marc Carlson <mcarls.. 2013-03-26 UCSCFullTrackImportP..
## AH5015 Marc Carlson <mcarls.. 2013-03-26 UCSCFullTrackImportP..
## AH5016 Marc Carlson <mcarls.. 2013-03-26 UCSCFullTrackImportP..
## ... ... ... ...
## AH121712 Koki Tsuyuzaki <k.t... 2025-04-08 AHPubMedDbs
## AH121713 Koki Tsuyuzaki <k.t... 2025-04-08 AHPubMedDbs
## AH121714 Koki Tsuyuzaki <k.t... 2025-04-08 AHPubMedDbs
## AH121715 Koki Tsuyuzaki <k.t... 2025-04-08 AHPubMedDbs
## AH121716 Koki Tsuyuzaki <k.t... 2025-04-08 AHPubMedDbs
## tags rdataclass rdatapath
## <AsIs> <character> <character>
## AH5012 cytoBand,UCSC,track,... GRanges goldenpath/hg19/data..
## AH5013 stsMap,UCSC,track,... GRanges goldenpath/hg19/data..
## AH5014 fishClones,UCSC,track,... GRanges goldenpath/hg19/data..
## AH5015 recombRate,UCSC,track,... GRanges goldenpath/hg19/data..
## AH5016 encodeRegions,UCSC,track,... GRanges goldenpath/hg19/data..
## ... ... ... ...
## AH121712 data.table,NCBI,PubMed,... data.table AHPubMedDbs/v009/aut..
## AH121713 data.table,NCBI,PMC,... data.table AHPubMedDbs/v009/pmc..
## AH121714 data.table,Descriptor,MeSH,... data.table AHPubMedDbs/v009/des..
## AH121715 data.table,MeSH,NCBI,... data.table AHPubMedDbs/v009/qua..
## AH121716 data.table,MeSH,NCBI,... data.table AHPubMedDbs/v009/scr..
## sourceurl sourcetype
## <character> <character>
## AH5012 rtracklayer://hgdown.. UCSC track
## AH5013 rtracklayer://hgdown.. UCSC track
## AH5014 rtracklayer://hgdown.. UCSC track
## AH5015 rtracklayer://hgdown.. UCSC track
## AH5016 rtracklayer://hgdown.. UCSC track
## ... ... ...
## AH121712 https://github.com/r.. XML
## AH121713 https://github.com/r.. XML
## AH121714 https://github.com/r.. XML
## AH121715 https://github.com/r.. XML
## AH121716 https://github.com/r.. XMLSolution
> regulatory_regions <- query(ah, c("enhancer", "hg38"))
> ah["AH116724"]
## AnnotationHub with 1 record
## # snapshotDate(): 2025-04-08
## # names(): AH116724
## # $dataprovider: NA
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # $rdatadateadded: 2024-04-29
## # $title: TENET_consensus_enhancer_regions
## # $description: A composite GRanges object containing regions of putative en...
## # $taxonomyid: 9606
## # $genome: hg38
## # $sourcetype: Multiple
## # $sourceurl: https://github.com/rhielab/TENET.AnnotationHub/raw/devel/data-...
## # $sourcesize: NA
## # $tags: c("EpigenomeRoadMap", "FANTOM5", "GenomicSequence",
## # "Homo_sapiens", "TENET")
## # retrieve record with 'object[["AH116724"]]'
> mcols(ah["AH116724"])
## DataFrame with 1 row and 15 columns
## title dataprovider species taxonomyid
## <character> <character> <character> <integer>
## AH116724 TENET_consensus_enha.. NA Homo sapiens 9606
## genome description coordinate_1_based
## <character> <character> <integer>
## AH116724 hg38 A composite GRanges .. 1
## maintainer rdatadateadded preparerclass
## <character> <character> <character>
## AH116724 Rhie Lab at the Univ.. 2024-04-29 TENET.AnnotationHub
## tags rdataclass
## <AsIs> <character>
## AH116724 EpigenomeRoadMap,FANTOM5,GenomicSequence,... GRanges
## rdatapath sourceurl sourcetype
## <character> <character> <character>
## AH116724 TENET.AnnotationHub/.. https://github.com/r.. Multiple
> ah["AH116724"]$description
## [1] "A composite GRanges object containing regions of putative enhancer elements from a variety of sources, primarily for use in the TENET Bioconductor package. This dataset is composed of regions of strong enhancers as annotated by the Roadmap Epigenomics ChromHMM expanded 18-state model based on 98 reference epigenomes, lifted over to the hg38 genome, as well as regions of human permissive enhancers identified by the FANTOM5 project. For additional information on component datasets, see the manifest file hosted at https://github.com/rhielab/TENET.AnnotationHub/blob/devel/data-raw/TENET_consensus_datasets_manifest.tsv"
>
> enhancers <- regulatory_regions[["AH116724"]]
> enhancers
## GRanges object with 403602 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 10001-10400 *
## [2] chr1 14801-15200 *
## [3] chr1 16001-16600 *
## [4] chr1 20001-20400 *
## [5] chr1 79001-79800 *
## ... ... ... ...
## [403598] chrY 56856054-56856253 *
## [403599] chrY 56873201-56873229 *
## [403600] chrY 56873666-56873879 *
## [403601] chrY 56879705-56879876 *
## [403602] chrM 1-16398 *
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
>
> subsetByOverlaps(enhancers, egfr_region)
## GRanges object with 10 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr7 55020308-55035707 *
## [2] chr7 55036708-55114507 *
## [3] chr7 55115308-55162707 *
## [4] chr7 55163708-55164107 *
## [5] chr7 55165308-55165707 *
## [6] chr7 55166708-55167707 *
## [7] chr7 55167766-55167997 *
## [8] chr7 55168908-55169507 *
## [9] chr7 55169708-55195707 *
## [10] chr7 55198308-55210907 *
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
>
>
> # Could have done the same thing to find histone modifications:
>
> histone_mods <- query(ah, c("H3K27ac", "hg38"))
> histone_mods
## AnnotationHub with 3 records
## # snapshotDate(): 2025-04-08
## # $dataprovider: ENCODE, NA
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## # rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH116721"]]'
##
## title
## AH116721 | TENET_10_cancer_panel_enhancer_regions
## AH116727 | ENCODE_dELS_regions
## AH116728 | ENCODE_pELS_regions
>
> histone_data <- histone_mods[["AH116727"]]
> histone_data
## GRanges object with 786756 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 271227-271468 *
## [2] chr1 274330-274481 *
## [3] chr1 605331-605668 *
## [4] chr1 727122-727350 *
## [5] chr1 807737-807916 *
## ... ... ... ...
## [786752] chrY 26319537-26319816 *
## [786753] chrY 26319848-26320155 *
## [786754] chrY 26363541-26363721 *
## [786755] chrY 26670949-26671287 *
## [786756] chrY 26671293-26671478 *
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
>
>
> # Find overlaps with EGFR gene region
> histone_in_egfr <- subsetByOverlaps(histone_data, egfr_region)
>
>
> # Display histone modification data
> histone_in_egfr
## GRanges object with 201 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr7 55020968-55021132 *
## [2] chr7 55021188-55021347 *
## [3] chr7 55021444-55021635 *
## [4] chr7 55021651-55021842 *
## [5] chr7 55021954-55022284 *
## ... ... ... ...
## [197] chr7 55205299-55205450 *
## [198] chr7 55206264-55206550 *
## [199] chr7 55207491-55207780 *
## [200] chr7 55209422-55209744 *
## [201] chr7 55210237-55210526 *
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
>
>
> # We can also retrieve more richly annotated resources:
> # 77 refers to the version number of the ENSEMBL release
> query(ah, c("GTF", "77", "Ensembl", "Homo sapiens"))
## AnnotationHub with 1 record
## # snapshotDate(): 2025-04-08
## # names(): AH28812
## # $dataprovider: Ensembl
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # $rdatadateadded: 2015-03-25
## # $title: Homo_sapiens.GRCh38.77.gtf
## # $description: Gene Annotation for Homo sapiens
## # $taxonomyid: 9606
## # $genome: GRCh38
## # $sourcetype: GTF
## # $sourceurl: ftp://ftp.ensembl.org/pub/release-77/gtf/homo_sapiens/Homo_sap...
## # $sourcesize: 44454526
## # $tags: c("GTF", "ensembl", "Gene", "Transcript", "Annotation")
## # retrieve record with 'object[["AH28812"]]'
> ah[["AH28812"]]
## GRanges object with 2672001 ranges and 21 metadata columns:
## seqnames ranges strand | source type score
## <Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
## [1] 1 11869-14409 + | havana gene NA
## [2] 1 11869-14409 + | havana transcript NA
## [3] 1 11869-12227 + | havana exon NA
## [4] 1 12613-12721 + | havana exon NA
## [5] 1 13221-14409 + | havana exon NA
## ... ... ... ... . ... ... ...
## [2671997] MT 15888-15953 + | ensembl transcript NA
## [2671998] MT 15888-15953 + | ensembl exon NA
## [2671999] MT 15956-16023 - | insdc gene NA
## [2672000] MT 15956-16023 - | ensembl transcript NA
## [2672001] MT 15956-16023 - | ensembl exon NA
## phase gene_id gene_version gene_name gene_source
## <integer> <character> <numeric> <character> <character>
## [1] <NA> ENSG00000223972 5 DDX11L1 havana
## [2] <NA> ENSG00000223972 5 DDX11L1 havana
## [3] <NA> ENSG00000223972 5 DDX11L1 havana
## [4] <NA> ENSG00000223972 5 DDX11L1 havana
## [5] <NA> ENSG00000223972 5 DDX11L1 havana
## ... ... ... ... ... ...
## [2671997] <NA> ENSG00000210195 2 MT-TT insdc
## [2671998] <NA> ENSG00000210195 2 MT-TT insdc
## [2671999] <NA> ENSG00000210196 2 MT-TP insdc
## [2672000] <NA> ENSG00000210196 2 MT-TP insdc
## [2672001] <NA> ENSG00000210196 2 MT-TP insdc
## gene_biotype transcript_id transcript_version
## <character> <character> <numeric>
## [1] transcribed_unproces.. <NA> NA
## [2] transcribed_unproces.. ENST00000456328 2
## [3] transcribed_unproces.. ENST00000456328 2
## [4] transcribed_unproces.. ENST00000456328 2
## [5] transcribed_unproces.. ENST00000456328 2
## ... ... ... ...
## [2671997] Mt_tRNA ENST00000387460 2
## [2671998] Mt_tRNA ENST00000387460 2
## [2671999] Mt_tRNA <NA> NA
## [2672000] Mt_tRNA ENST00000387461 2
## [2672001] Mt_tRNA ENST00000387461 2
## transcript_name transcript_source transcript_biotype exon_number
## <character> <character> <character> <numeric>
## [1] <NA> <NA> <NA> NA
## [2] DDX11L1-002 havana processed_transcript NA
## [3] DDX11L1-002 havana processed_transcript 1
## [4] DDX11L1-002 havana processed_transcript 2
## [5] DDX11L1-002 havana processed_transcript 3
## ... ... ... ... ...
## [2671997] MT-TT-201 ensembl Mt_tRNA NA
## [2671998] MT-TT-201 ensembl Mt_tRNA 1
## [2671999] <NA> <NA> <NA> NA
## [2672000] MT-TP-201 ensembl Mt_tRNA NA
## [2672001] MT-TP-201 ensembl Mt_tRNA 1
## exon_id exon_version tag ccds_id protein_id
## <character> <numeric> <character> <character> <character>
## [1] <NA> NA <NA> <NA> <NA>
## [2] <NA> NA <NA> <NA> <NA>
## [3] ENSE00002234944 1 <NA> <NA> <NA>
## [4] ENSE00003582793 1 <NA> <NA> <NA>
## [5] ENSE00002312635 1 <NA> <NA> <NA>
## ... ... ... ... ... ...
## [2671997] <NA> NA <NA> <NA> <NA>
## [2671998] ENSE00001544475 2 <NA> <NA> <NA>
## [2671999] <NA> NA <NA> <NA> <NA>
## [2672000] <NA> NA <NA> <NA> <NA>
## [2672001] ENSE00001544473 2 <NA> <NA> <NA>
## protein_version
## <numeric>
## [1] NA
## [2] NA
## [3] NA
## [4] NA
## [5] NA
## ... ...
## [2671997] NA
## [2671998] NA
## [2671999] NA
## [2672000] NA
## [2672001] NA
## -------
## seqinfo: 270 sequences (1 circular) from GRCh38 genomeThe GO.db package allows retrieving pathway and GO term and gene function information:
> library(GO.db)
> columns(GO.db)
## [1] "DEFINITION" "GOID" "ONTOLOGY" "TERM"
>
> go_terms <- c("GO:0097421", "GO:0097489")
> go_terms
## [1] "GO:0097421" "GO:0097489"
>
> AnnotationDbi::select(GO.db, keys = go_terms, keytype = "GOID", columns = c("TERM", "ONTOLOGY"))
## GOID TERM ONTOLOGY
## 1 GO:0097421 liver regeneration BP
## 2 GO:0097489 multivesicular body, internal vesicle lumen CCSession Info
> sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.4.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Berlin
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] GO.db_3.21.0
## [2] AnnotationHub_3.16.0
## [3] BiocFileCache_2.16.0
## [4] dbplyr_2.5.0
## [5] TxDb.Hsapiens.UCSC.hg38.knownGene_3.21.0
## [6] org.Hs.eg.db_3.21.0
## [7] tibble_3.2.1
## [8] ggplot2_3.5.2
## [9] DESeq2_1.48.1
## [10] SummarizedExperiment_1.38.1
## [11] MatrixGenerics_1.20.0
## [12] matrixStats_1.5.0
## [13] TxDb.Scerevisiae.UCSC.sacCer3.sgdGene_3.2.2
## [14] GenomicFeatures_1.60.0
## [15] AnnotationDbi_1.70.0
## [16] Biobase_2.68.0
## [17] BSgenome.Scerevisiae.UCSC.sacCer3_1.4.0
## [18] BSgenome_1.76.0
## [19] rtracklayer_1.68.0
## [20] BiocIO_1.18.0
## [21] GenomicRanges_1.60.0
## [22] Biostrings_2.76.0
## [23] GenomeInfoDb_1.44.0
## [24] XVector_0.48.0
## [25] IRanges_2.42.0
## [26] S4Vectors_0.46.0
## [27] BiocGenerics_0.54.0
## [28] generics_0.1.4
## [29] BiocManager_1.30.25
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 dplyr_1.1.4 farver_2.1.2
## [4] blob_1.2.4 filelock_1.0.3 bitops_1.0-9
## [7] fastmap_1.2.0 RCurl_1.98-1.17 GenomicAlignments_1.44.0
## [10] XML_3.99-0.18 digest_0.6.37 mime_0.13
## [13] lifecycle_1.0.4 KEGGREST_1.48.0 RSQLite_2.3.11
## [16] magrittr_2.0.3 compiler_4.5.0 rlang_1.1.6
## [19] tools_4.5.0 utf8_1.2.5 yaml_2.3.10
## [22] knitr_1.50 labeling_0.4.3 S4Arrays_1.8.0
## [25] htmlwidgets_1.6.4 bit_4.6.0 curl_6.2.2
## [28] DelayedArray_0.34.1 RColorBrewer_1.1-3 abind_1.4-8
## [31] BiocParallel_1.42.0 purrr_1.0.4 withr_3.0.2
## [34] grid_4.5.0 scales_1.4.0 cli_3.6.5
## [37] rmarkdown_2.29 crayon_1.5.3 rstudioapi_0.17.1
## [40] httr_1.4.7 rjson_0.2.23 DBI_1.2.3
## [43] cachem_1.1.0 parallel_4.5.0 restfulr_0.0.15
## [46] vctrs_0.6.5 Matrix_1.7-3 jsonlite_2.0.0
## [49] bit64_4.6.0-1 locfit_1.5-9.12 glue_1.8.0
## [52] codetools_0.2-20 gtable_0.3.6 BiocVersion_3.21.1
## [55] UCSC.utils_1.4.0 pillar_1.10.2 rappdirs_0.3.3
## [58] htmltools_0.5.8.1 GenomeInfoDbData_1.2.14 R6_2.6.1
## [61] evaluate_1.0.3 lattice_0.22-7 png_0.1-8
## [64] Rsamtools_2.24.0 memoise_2.0.1 Rcpp_1.0.14
## [67] SparseArray_1.8.0 xfun_0.52 pkgconfig_2.0.3