DGEobj: An S3 Object to Capture and Annotate DGE Workflows

Introduction

The DGEobj package implements an S3 class data object, the DGEobj, that conceptually represents an extension of the capabilities of the RangedSummarizedExperiment (RSE) developed by Martin Morgan et al.. Both the DGEobj and the RSE object capture the initial data for a differential gene expression analysis, namely a counts matrix along with associated gene and sample annotation. Additionally, the DGEobj extends this concept to support capture of various downstream data objects, the intermediate steps in an analysis, and thus capture the entire workflow of an analysis project.

The availability of a collection of results in a specific structured data object has multiple advantages. Sharing DGE results with other colleagues is simplified because the entire analysis is encapsulated and documented within the DGEobj. A recipient of data in this format can examine details of the analysis based on the annotation built into the DGE object. Downstream users of a dataset can retrieve the original data or any intermediate data items the original analyst captured. Importantly, assembling a collection of DGE projects in a structured format enables facile automation of integrative multi-project meta-analyses.

The RSE object can capture as many assays as desired. An “assay” is defined as any matrix with n genes (rows) and m samples (columns). A limitation of the RSE object however is that the RSE only allows 1 instance of row data (typically gene annotation) and 1 instance of column data (sample annotation with one row for every column of data in the assay slot). This limits the RSE in terms of its ability to hold downstream data objects because many of those objects meet the definition of row data also (e.g. DGElist, Fit objects, topTable output). Other types of data (e.g. design matrices, sample QC) meet the definition of column data. Thus, the DGEobj was modeled after the RSE object, but extended to accommodate multiple row and column data types. The DGEobj is thus uniquely suited to capturing the entire workflow of a DGE analysis.

While the DGEobj was designed to support a RNA-Seq workflow, extensibility features enable the definition of other data types. Thus the DGEobj data structure may be easily configured for many proteomic or metabolomic applications, or an data domain where the starting data is a 2D matrix of assays and samples.

DGEobj Overview

DGEobj Nomenclature: Items, Types and Base Types

Multiple instances of a base type are accommodated by defining data “types” and “items”. Each data type is assigned a base type (e.g. geneData, GRanges, and Fit objects are all “types” of “baseType” = rowData).

Importantly, multiple instances of each type are allowed (exceptions described in Unique Items section) as long as each instance of a type is named uniquely. Each instance of a “type” is described as an “item” and each item must have a user-defined item name. The item name must be unique within a DGEobj.

The DGEobj package implements an S3 class data object, the DGEobj, that conceptually represents an extension of the capabilities of the RangedSummarizedExperiment (RSE) originally developed by Martin Morgan et al. Both the DGEobj and the RSE object capture the initial data for a differential gene expression analysis, namely a counts matrix along with associated gene and sample annotation. Additionally, the DGEobj extends this concept to support capture of various downstream data objects, the intermediate steps in an analysis, and thus capture the entire workflow of an analysis project.

The availability of a collection of results in a specific structured data object has multiple advantages. Sharing DGE results with other colleagues is simplified because the entire analysis is encapsulated and documented within the DGEobj. A recipient of data in this format can examine details of the analysis based on the annotation built into the DGE object. Downstream users of a dataset can drill back to the original data or tap any intermediate data items the original analyst captured. Importantly, assembling a collection of DGE projects in a structured format enables facile automation of integrative multi-project metanalyses.

The RSE object, which motivated building the DGEobj, can capture as many assays as desired. An “assay” is defined as any matrix with n genes (rows) and m samples (columns). A limitation of the RSE object however is that the RSE only allows 1 instance of row data (typically gene annotation) and 1 instance of column data (sample annotation with one row for every column of data in the assay slot). This limits the RSE in terms of its ability to hold downstream data objects because many of those objects meet the definition of row data also (e.g. DGElist, Fit objects, topTable output). Other types of data (e.g. design matrices, sample QC) meet the definition of column data. Thus, the DGEobj was modeled after the RSE object, but extended to accommodate multiple row and column data types. The DGEobj is thus uniquely suited to capturing the entire workflow of a DGE analysis.

DGEobj Structure

The DGEobj supports four distinct data types, that we refer to as “base types”:

  • assay data: dataframes or matrices of data with n rows (genes or transcripts) and m columns (samples)
  • rowData: a dataframe with n rows typically containing information about each gene with as many columns as needed (gene ID, gene symbols, chromosome information, etc). Other types of rowData include the edgeR DGElist object, design matrices, and Fit objects
  • colData: a dataframe with m rows, that is, one row for each sample column in the assay slot
  • metaData: anything that doesn’t align with row or col data

Fundamentally, the base type defines how an item should be subsetted (see the section on subsetting below).

The design goal of the DGEobj is that it should capture the workflow and analysis results of a single dataset. As such, certain items that constitute the “raw” or starting point data are assigned a “unique” attribute that limits such “types” to one instance per DGEobj.

Three items: counts, design (sample information), and geneData (or isoformData, or exonData) are defined as unique. If complete chromosome location data (Chr, Start, End, Strand) are supplied in the geneData item, then a GRanges item is also created upon initializing a DGEobj.

Importantly, multiple instances of each type are allowed (exceptions described in Unique Items section) as long as each instance of a type is named uniquely. Each instance of a “type” is described as an “item” and each item must have a user-defined item name. The item name must be unique within a DGEobj.

“Levels” are predefined for DGEobj objects to accommodate RNA-Seq and Affymetrix RNA expression applications (allowedLevels = c(“gene”, “isoform”, “exon”, “affy”)). A DGEobj may contain only one of these levels. Thus a user would need to create separate DGEobjs for gene and isoform level data in an RNA transcription analysis mode.

See the “Customizing the DGEobj” section below to define new levels.

Gene level data will be used within this vignette.

Several levels are predefined for DGEobj objects to accommodate RNA and proteomic applications (allowedLevels = c(“gene”, “isoform”, “exon”, “proteingroup”, “peptide”, “ptm”, “protein”)). A DGEobj may contain only one of these levels. Thus a user would need to create separate DGEobjs for gene and isoform level data in an RNA transcription analysis mode. Gene level data will be used Within this vignette.

An analysis can become multi-threaded. For example, multiple models can be fit to one dataset each with its own set of contrasts. Two features of the DGEobj serve to manage multi-threaded analyses. First, only one instance of the starting data types are allowed. Thus all analysis thread map back to the starting data. Secondly, parent/child relationships are captured. Each data item carries a parent attribute that holds the item name of the parent data item. In this way, for example, a topTable item can be linked to the design matrix and fit that produced it.

Preserving the Original Data

In the course of a workflow an analyst will often need to subset the DGEobj based on low signal or other characteristics. However, a downstream analyst may need to start their analysis with the original unsubsetted data. For this reason, when the DGEobj is initialized, a copy of the starting data is also stored in a metadata slot. Metadata slots are carried along without subsetting so the original data may always be retrieved from these metadata items. The item names of the original data in the metadata slot have a “_orig” suffix.

Building a DGE data object

A DGEobj is initialized from a set of three data frames containing the primary assay matrix (typically a counts matrix for RNA-Seq applications), a set of row annotation (gene data) and an experiment design table with information about the samples (columns) of the assay matrix. For the example below, a GEO dataset will be downloaded and encapsulated as a DGEobj.

Demonstration Dataset: Rat Liver Slice Compound Treatments

The following dataset was selected to demonstrate building and working with the DGEobj data structure.

Huang X, Cai H, Ammar R, Zhang Y et al. Molecular characterization of a precision-cut rat liver slice model for the evaluation of antifibrotic compounds. Am J Physiol Gastrointest Liver Physiol 2019 Jan 1;316(1):G15-G24. PMID: 30406699

Briefly, livers were removed from rats 4 weeks after bile duct ligation or sham operation. Rat liver slices were incubated in vitro with potential anti-fibrotic compounds. At the end of the incubation whole transcriptome RNA-Seq analysis was performed.

Files containing counts, sample annotations, and QC data associated with this project will be downloaded from the NCBI GEO resource.

The GEO data includes Ensembl gene IDs. Additional gene information such as chromosome positions, type of transcript, etc, will be downloaded from Ensembl using the biomaRt package.

Initializing a DGEobj from dataframes

Three properly formatted data frames are required to initialize a new DGEobj:

  1. Counts matrix: For RNA-Seq, this is typically a genes x samples matrix or dataframe of numbers. Rownames must be present and contain a unique identifier (e.g. Ensembl geneid). Colnames should be a unique sample identifier.

  2. Row Data: Additional annotations for the entities represented by each row. For the RNA-Seq case, this dataframe holds associated gene annotation. The rownames in the RowData must match the rownames in the counts matrix.

  3. Col Data: Accordingly the ColData dataframe contains information about the samples (columns) of the counts matrix. We often refer to this as the “design” table because it typically contains the experimental details of the treatment of each sample. The rownames of ColData must match the colnames of the Counts Matrix.

In addition to these three items, the user must specify the “level” of the data (allowedLevels = c(“gene”, “isoform”, “exon”, “affy”)).

See the “Customizing the DGEobj” section below to define new levels.

Retrieve GEO data for GSE120804

This GEO project includes a counts table, design table, and QC metrics in the supplemental section. The data is downloaded to temp files and imported into data frames to prepare for building a DGEobj.

# Get the raw counts, sample annotation ("design") and QC data from GEO
# Source: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE120804
getLocation <- "http://ftp.ncbi.nlm.nih.gov/geo/series/GSE120nnn/GSE120804/suppl"
countsFile  <- "GSE120804_counts.txt.gz"
designFile  <- "GSE120804_geo_sample_annotation_edit.csv.gz"
qcFile      <- "GSE120804_qc_metrics.txt.gz"

temp <- tempfile()
if (download.file(glue("{getLocation}/{countsFile}"), destfile = temp, mode = 'wb') > 0) {
    stop("Counts Download Failed")
}
counts <- read.delim(temp, stringsAsFactors = FALSE, row.names = 1)

temp <- tempfile()
if (download.file(glue("{getLocation}/{designFile}"), destfile = temp, mode = 'wb')) {
    stop("Design Download Failed")
}
design <- read.csv(temp, stringsAsFactors = FALSE)

temp <- tempfile()
if (download.file(glue("{getLocation}/{qcFile}"), destfile = temp, mode = 'wb')) {
    stop("Alignment QC Download Failed")
}
alignmentQC <- read.delim(temp, stringsAsFactors = FALSE)

Prepare the Design Table

Often the design table will need some manipulation or clean up to properly construct the columns required to support a planned analysis and also to meet the design constraints required for the DGEobj. It is recommended to restructure the design table as needed at the onset of the analysis.

In this example, the design file lacks a proper linking column to the counts data. The colnames from the counts data should be the rownames for the design table. Fortunately, the colnames from counts are present as a substring within the fastq filename column (design$raw.file). The samplenames will be parsed from the fastq filenames to create a proper linking column between the counts and design data.

rownames(design) <- str_sub(design$raw.file, start = 1, end = 21)

Additional manipulations to the design data include: * renaming a column * create a design$DiseaseStatus column to indicate whether a liver is derived from a BDL or sham animal * create an animal number column to capture which liver slice samples were derived from the same animal

#correct the desired case/spelling of one column
design <- design %>%
    rename(ReplicateGroup = Replicate.group)

# Create a DiseaseStatus column by parsing ReplicateGroup
design$DiseaseStatus <- rep("Sham", nrow(design))
idx <- str_detect(design$ReplicateGroup, "BDL")
design$DiseaseStatus[idx] <- "BDL"

# Create an animal# column.  The animal number is encoded in the sample.name column
design$AnimalNum <- str_match(design$Sample.name, "r[0-9]{1,3}")

Retrieve Gene Annotation from Ensembl

The GEO data for this project is supplied Ensembl gene IDs but no other gene information. Additional gene information will be collected by querying Ensembl directly.

# Now get the gene annotation from Ensembl/biomaRt
ens.ds      <- "rnorvegicus_gene_ensembl"
ens.mart    <- useMart(biomart = "ensembl", dataset = ens.ds, host = "https://may2021.archive.ensembl.org")
ens.columns <- c("ensembl_gene_id", "rgd_symbol", "chromosome_name", "start_position",
                 "end_position", "strand", "gene_biotype", "description")
ens.data    <- getBM(attributes = ens.columns,
                     values     = rownames(counts),
                     mart       = ens.mart,
                     useCache   = F) %>%
               distinct(ensembl_gene_id, .keep_all = T)

# Filter the list to the genes used in the test dataset and properly format gene
# information for GenomicRanges use
gene.data <- left_join(data.frame(ensembl_gene_id = rownames(counts), stringsAsFactors = F),
                       ens.data,
                       by = "ensembl_gene_id") %>%
    dplyr::rename(start = start_position, end = end_position) %>%
    mutate(strand = case_when(strand == -1 ~ "-",
                              strand == 1  ~ "+",
                              TRUE         ~ "*"))
rownames(gene.data) <- gene.data$ensembl_gene_id
# Get transcript level data and keep max for each gene, or alternatively, use the cds length

ens.ds          <- "rnorvegicus_gene_ensembl"
ens.mart        <- useMart(biomart = "ensembl", dataset = ens.ds, host = "https://may2021.archive.ensembl.org")
ens.columns     <- c("ensembl_gene_id", "ensembl_transcript_id", "transcript_length")
transcript.data <- getBM(attributes = ens.columns,
                         values     = rownames(counts),
                         mart       = ens.mart,
                         useCache   = F) %>%
                   arrange(desc(transcript_length)) %>%
                   distinct(ensembl_gene_id, .keep_all = T)

#Add a transcript_length column to gene.data
gene.data <- left_join(gene.data,
                       select(transcript.data, ensembl_gene_id, ExonLength = transcript_length))
rownames(gene.data) <- gene.data$ensembl_gene_id
gene.data$ensembl_gene_id <- NULL

# enforce same order as counts
gene.data <- gene.data[rownames(counts),]

Validate Dataframe Relationships

As mentioned earlier, rownames and column names provide links between the three starting data frames.

  • The rownames in the gene annotation must match the rownames in the counts matrix.
  • The rownames of the design table must match the colnames of the counts matrix.

These statements perform a reality check on these two constraints.

all(rownames(counts) == rownames(gene.data))
## [1] TRUE
all(colnames(counts) == rownames(design))
## [1] TRUE

Instantiate the DGEobj

With the above constraints met, the three data frames can be used to instantiate a DGEobj. In addition the “level” of the data must be specified in the “level” argument (Allowed levels include “gene”, “isoform”, “exon”, “affy”). Finally, two attributes are added via the customAttr argument to annotate the genome and gene model set used for the alignments used to produce the count data.

dgeObj <- DGEobj::initDGEobj(primaryAssayData  = counts,
                             rowData = gene.data,
                             colData = design,
                             level = "gene",
                             customAttr = list(Genome    = "Rat.B6.0",
                                               GeneModel = "Ensembl.R89"))

Although it is possible to add many or all project attributes with the customAttr argument, this can also be tedious when many more attributes are desired. A more convenient way to add project attributes to the DGEobj is provided using the annotateDGEobj function. This function reads key/value pairs from a text file. Thus a text file template can be provided to less technical collaborators to capture important experiment metadata.

The annotation file for this project looks like this:

# Some basic metadata about the project

level=gene source=https://github.com/cb4ds/DGEobj/blob/master/vignettes/DGEobj_Overview.Rmd ID=BDL_Rat_LiverSlice_03Dec2017 BMS_PID=P-20170808-0001 Title=Rat Liver Slices from Bile Duct Ligation animals Organism=Rat GeneModel=Ensembl.R89 PlatformType=RNA-Seq Description=Rat livers slices from sham or BDL +/- efficacious treatments incubated in vitro Keywords=Liver slices; Bile Duct Ligation Disease=Liver Fibrosis Tissue=Liver GEO=GSE120804

# additional descriptive attributes

Technology=Illumina-HiSeq LibraryPrep=TruSeq Stranded Total RNA AlignmentReference=Rat.B6.0 ReadType=PE Pipeline=RNA-Seq_BMS_v2.4.pscript AlignmentAlgorithm=OSA ScriptID=RNA-Seq_BMS_v2.4.pscript

# Institutional attributes

BusinessUnit=Discovery FunctionalArea=Fibrosis Vendor=BMS TBio_Owner=Ron Ammar TA_Owner=John Huang

# the annotation information is stored in a temporary .txt file
dgeObj <- annotateDGEobj(dgeObj, annotationFile)

The DGE object now contains the starting data and metadata and is analysis-ready.

Examining a DGE object

Inventory function

To view a listing of the items in the DGEobj:

kable(inventory(dgeObj))
ItemName ItemType BaseType Parent Class Row Col DateCreated
counts_orig counts_orig meta matrix 32883 48 2022-05-12 10:58:09
counts counts assay counts_orig matrix 32883 48 2022-05-12 10:58:09
design_orig design_orig meta data.frame 48 10 2022-05-12 10:58:09
design design col design_orig data.frame 48 10 2022-05-12 10:58:09
geneData_orig geneData_orig meta data.frame 32883 8 2022-05-12 10:58:09
geneData geneData row geneData_orig data.frame 32883 8 2022-05-12 10:58:09
granges_orig granges_orig meta geneData_orig GRanges 32883 NA 2022-05-12 10:58:10
granges granges row geneData GRanges 32883 NA 2022-05-12 10:58:10

Examine DGEobj Metadata

Use the showMeta function to list the meta data associated with a project.

kable(showMeta(dgeObj))
Attribute Value
class DGEobj
DGEobjDef.version 1.2
level gene
Genome Rat.B6.0
GeneModel Ensembl.R89
NA NA
source Omicsoft
ID BDL_Rat_LiverSlice_03Dec2017
Title Rat Liver Slices from Bile Duct Ligation animals
Organism Rat
PlatformType RNA-Seq
Description Rat livers slices from sham or BDL +/- efficacious treatments incubated in vitro
Keywords Liver slices; Bile Duct Ligation
Disease Liver Fibrosis
Tissue Liver
GEO GSE120804
Technology Illumina-HiSeq
LibraryPrep TruSeq Stranded Total RNA
AlignmentReference Rat.B6.0
ReadType PE
Pipeline RNA-Seq_BMS_v2.4.pscript
AlignmentAlgorithm OSA
ScriptID RNA-Seq_BMS_v2.4.pscript
BusinessUnit Discovery
FunctionalArea Fibrosis
Vendor BMS
TBio_Owner Ron Ammar
TA_Owner John Huang

DGEobj Length and Dimensions

The length of a DGEobj refers to the number of data items in the DGEobj.

length(dgeObj)
## [1] 8

The dimensions reported for a DGEobj are the dimensions of the assays contained in the DGEobj. That is, for the RNA-Seq application, the row dimension is the number of genes contained in the object and the column dimension is the number of samples contained in the object.

dim(dgeObj)
## [1] 32883    48
dim(dgeObj$counts)
## [1] 32883    48

Rownames and Colnames

The rownames and colnames of the DGEobj are defined by the primary “assay” matrix in a DGEobj, typically the counts matrix for RNA-Seq data.

Note, there is no facility provided to re-assign rownames or colnames of a DGEobj. The user should make any adjustments to row/colnames before instantiating the DGEobj.

Subsetting a DGEobj

A DGEobj may be square bracket subsetted similar the same way data frames and matrices may be subsetted. The subsetting function uses the base type to define how each data item is handled during subsetting. During subsetting, metadata items are carried along unchanged.

#subset to the first 100 genes
First100genes <- dgeObj[1:100,]
dim(First100genes)
## [1] 100  48
#subset to the first 10 samples
First10samples <- dgeObj[,1:10]
dim(First10samples)
## [1] 32883    10
#susbet genes and samples
AnotherSubset <- dgeObj[1:100, 1:10]
dim(AnotherSubset)
## [1] 100  10

Boolean vectors may also be used to subset dimensions of a DGEobj.

# select a subset of samples
idx <- dgeObj$design$DiseaseStatus == "BDL"
BDLonly <- dgeObj[,idx]
dim(BDLonly)
## [1] 32883    39

Inventory of a DGEobj

The inventory function prints a table of the data items present in a DGEobj. The output includes the item name, type, baseType, parent, class, and date created. If the verbose = TRUE argument is used, a funArgs column is also included.

kable(inventory(dgeObj))
ItemName ItemType BaseType Parent Class Row Col DateCreated
counts_orig counts_orig meta matrix 32883 48 2022-05-12 10:58:09
counts counts assay counts_orig matrix 32883 48 2022-05-12 10:58:09
design_orig design_orig meta data.frame 48 10 2022-05-12 10:58:09
design design col design_orig data.frame 48 10 2022-05-12 10:58:09
geneData_orig geneData_orig meta data.frame 32883 8 2022-05-12 10:58:09
geneData geneData row geneData_orig data.frame 32883 8 2022-05-12 10:58:09
granges_orig granges_orig meta geneData_orig GRanges 32883 NA 2022-05-12 10:58:10
granges granges row geneData GRanges 32883 NA 2022-05-12 10:58:10

If just the item names of data stored in the DGEobj are needed, use the base R names function.

names(dgeObj)
## [1] "counts_orig"   "counts"        "design_orig"   "design"        "geneData_orig" "geneData"      "granges_orig"  "granges"

Adding Items to a DGEobj

Alignment QC Example

Often other sample-related information is available. In this project, for example, the alignment statistics for the RNA-Seq were deposited in GEO and downloaded with the other data. This example adds the alignment statistics to the DGEobj. The alignment QC data as downloaded needs to be transposed to place samples in rows. Once transposed, this data meets the criteria for the colData type “alignQC” and is added to the DGEobj. As a new data item not derived directly from the existing items, the alignment QC lacks parent or funArgs values in this context.

rownames(alignmentQC) <- alignmentQC$Metric

alignmentQC <- alignmentQC %>%
    select(-Metric) %>%
    t() %>%
    as.data.frame()

dgeObj <- addItem(dgeObj,
                  item = alignmentQC,
                  itemName = "AlignmentQC",
                  itemType = "alignQC")

A Workflow Example

As an analysis progresses, the data analyst should capture intermediate data objects deemed important to properly documenting the workflow. In the example here, the counts will be normalized and the resulting DGElist data item added to the DGEobj.

# Perform TMM normalization on the raw counts
dgelist <- dgeObj$counts %>%
  DGEList() %>%
  calcNormFactors(method = "TMM")

# add the resulting edgeR DGEList item to the DGEobj
newdgeObj <- addItem(dgeObj,
                     item     = dgelist,
                     itemName = "normTMM",
                     itemType = "DGEList",
                     parent   = "counts",
                     funArgs  = "calcNormFactors; TMM"
)

kable(inventory(newdgeObj))
ItemName ItemType BaseType Parent Class Row Col DateCreated
counts_orig counts_orig meta matrix 32883 48 2022-05-12 10:58:09
counts counts assay counts_orig matrix 32883 48 2022-05-12 10:58:09
design_orig design_orig meta data.frame 48 10 2022-05-12 10:58:09
design design col design_orig data.frame 48 10 2022-05-12 10:58:09
geneData_orig geneData_orig meta data.frame 32883 8 2022-05-12 10:58:09
geneData geneData row geneData_orig data.frame 32883 8 2022-05-12 10:58:09
granges_orig granges_orig meta geneData_orig GRanges 32883 NA 2022-05-12 10:58:10
granges granges row geneData GRanges 32883 NA 2022-05-12 10:58:10
AlignmentQC alignQC col data.frame 48 172 2022-05-12 10:58:10
normTMM DGEList assay counts DGEList 32883 48 2022-05-12 10:58:11

item is the actual data object to add.

itemName is the user-defined name for that object.

itemType is the predefined type.

parent is the name of the item that this object is derived from.

funArgs is a text field intended to documents the processing parameters of a given step in the analysis.

The parent argument is analyst-defined and is particularly important for a multi-threaded analysis. For example, if more than one fit is applied to the data, the parent argument maintains the thread and unambiguously identifies the child objects of each branch in the workflow. The value assigned to parent should be the item name of the parent data object.

By default addItem will refuse to add an itemname that already exists. There is an overwrite argument to the addItem function although its use is discouraged. Use the overwrite = TRUE argument if it is necessary to make a post-hoc change to an item that already exists in the DGEobj.

Automating funArgs Entries

If a whole analysis step, together with the addItem call, is contained within a function, it becomes easy to automate capture of function arguments.

Within a function call, match.call returns the calling function name and arguments. Modifying the normalization example above:

runNorm <- function(x, method){
  # Perform TMM normalization on the raw counts
  dgelist <- x$counts %>%
    DGEList() %>%
    calcNormFactors(method = method)

  # add the resulting edgeR DGEList item to the DGEobj
  x <- addItem(x,
               item     = dgelist,
               itemName = "normTMM",
               itemType = "DGEList",
               parent   = "counts",
               funArgs  = match.call()
  )
}

newdgeObj <- runNorm(dgeObj, method = "TMM")

#use verbose=TRUE to see the funArgs column
inv <- inventory(newdgeObj, verbose = TRUE)[ ,c(1:3,9)]
kable(inv)
ItemName ItemType BaseType FunArgs
counts_orig counts_orig meta ::(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); DGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); initDGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”))
counts counts assay ::(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); DGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); initDGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”))
design_orig design_orig meta ::(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); DGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); initDGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”))
design design col ::(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); DGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); initDGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”))
geneData_orig geneData_orig meta ::(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); DGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); initDGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”))
geneData geneData row ::(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); DGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); initDGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”))
granges_orig granges_orig meta ::(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); DGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); initDGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”))
granges granges row ::(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); DGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); initDGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”))
AlignmentQC alignQC col addItem(dgeObj, alignmentQC, AlignmentQC, alignQC)
normTMM DGEList assay runNorm(dgeObj, TMM)

Batch addition of multiple items

The addItem function is typically used to capture each item during the course of an analysis. Another use case is constructing a DGEobj in a post-hoc fashion. In this scenario, it may be more efficient to collect and add multiple items. The addItems function supports this scenario.

In this example, add a normalized dgelist and a design matrix to the DGEobj:

# Perform TMM normalization on the raw counts
dgelist <- dgeObj$counts %>%
  DGEList() %>%
  calcNormFactors(method = "TMM")

# Formula must be composed of column names from the design table.
formula <- '~ 0 + ReplicateGroup'

# User-defined name for the designMatrix
designMatrixName <- "ReplicateGroupDesign"

# build the designMatrix
design <- getItem(dgeObj, "design")
designMatrix <- model.matrix (as.formula(formula), design)
#capture the formula as an attribute of the designMatrix
attr(designMatrix, "formula") <- formula

# Add both the dgelist and designMatrix to the DGEobj

NewDgeObj <- addItems(dgeObj,
                     itemList  = list(dgelist=dgelist, ReplicateGroupDesign=designMatrix),
                     itemTypes = list("DGEList", "designMatrix"),
                     parents   = list("counts", "design"))

itemList is a named list of the data objects to add. The names become the itemNames of the items in the DGEobj.

itemTypes is a list of the types for each item in itemList. see showTypes(DGEobj).

parents is a list of the parent item names associated with each item in itemList.

itemAttr is an optional named list of attributes that will be added to every item on the itemList. Note that the DGEobj has attributes and each item within the DGEobj can have its own attributes. Here the attributes are being added to the individual items.

Pre-defined Item Types

A number of pre-defined item types have been implemented to support the limma/voom DGE workflow for RNA-Seq and Affymetrix data types.

To see a complete list of pre-defined item types use the showTypes function.

kable(arrange(showTypes(dgeObj), BaseType, Type))
Type BaseType
AffyRMA AffyRMA assay
assay assay assay
counts counts assay
DGEList DGEList assay
effectiveLength effectiveLength assay
Elist Elist assay
intensities intensities assay
isoformFrac isoformFrac assay
alignQC alignQC col
col col col
design design col
designMatrix designMatrix col
AffyRMA_orig AffyRMA_orig meta
contrast_fit_treat contrast_fit_treat meta
contrastMatrix contrastMatrix meta
corFit corFit meta
counts_orig counts_orig meta
design_orig design_orig meta
effectiveLength_orig effectiveLength_orig meta
exonData_orig exonData_orig meta
geneData_orig geneData_orig meta
geneList geneList meta
granges_orig granges_orig meta
imputationMatrix imputationMatrix meta
intensities_orig intensities_orig meta
isoformData_orig isoformData_orig meta
meta meta meta
pathway pathway meta
proteinData_orig proteinData_orig meta
svobj svobj meta
topTreat topTreat meta
URL URL meta
affyData affyData row
contrast_fit contrast_fit row
exonData exonData row
fit fit row
geneData geneData row
granges granges row
isoformData isoformData row
proteinData proteinData row
row row row
topTable topTable row

Adding a New Data Type

The capability to define new data types is core to the extensibility of the DGEobj data structure.

A set of data types has already been defined based on data types encountered in a typical limma/Voom workflow. The newTypes function is used to define data types to support a novel workflow.

# Add a new colData datatype called "sampleQC"
newDgeObj <- newType(dgeObj,
                    itemType = "sampleQC",
                    baseType = "col",
                    uniqueItem = FALSE)

After executing this call, the newDgeObj now contains the newly defined type definition and is ready to accept data items of that type.

Accessing data in a DGEobj

Typically an analyst will need to access items stored in the DGEobj as the analysis progresses. Moreover, one of the main benefits of a well defined data object is that it facilitates data sharing. As such, there are several ways to extract one or more components from the DGEobj.

Retrieve a single item

The getItem function allows the user to retrieve any item in a DGEobj by referencing its item name. Use the inventory function (described above) to check the contents of the DGEobj including the list of item names that are available.

counts <- getItem(dgeObj, "counts")

#peek at the first 4 columns
head(counts[ ,1:4])
##                    T_20170823MAN1_C05P01 T_20170823MAN1_H05P01 T_20170823MAN1_C04P01 T_20170823MAN1_D03P01
## ENSRNOG00000046319                     2                     0                 0.000                0.0000
## ENSRNOG00000047964                     0                     0                 0.000                0.0000
## ENSRNOG00000050370                     0                     0                 0.000                1.0000
## ENSRNOG00000032365                     0                     0                 0.000                0.0000
## ENSRNOG00000040300                     1                     6                 5.783                6.8135
## ENSRNOG00000058808                     0                     0                 0.000                0.0000

Retrieve multiple items

There are several ways to retrieve multiple items as a list or data objects.

The user can supply a list of item names to retrieve specific items. The items requested are returned in a list.

MyItems <- getItems(dgeObj, list("counts", "geneData"))

If all the items the user wants to retrieve are of the same type, the getType function may be used to retrieve them as a named list. For example, an analyst may want to retrieve a set of contrast results stored as topTable output.

# The below example object is a post-workflow object that has multiple contrasts
FullWorkflowDGEobj <- readRDS(system.file("exampleObj.RDS", package = "DGEobj"))

MyContrasts <- getType(FullWorkflowDGEobj, "topTable")
names(MyContrasts)
## [1] "BDL_vs_Sham"    "EXT1024_vs_BDL" "Nint_vs_BDL"    "Sora_vs_BDL"

Similarly, all items of the same baseType may be retrieved as a named list with the getBaseType function.

MyColData <- getBaseType(FullWorkflowDGEobj, "col")
names(MyColData)
## [1] "design"               "ReplicateGroupDesign"

Finally, the DGEobj can be recast as a named list.

MyList <- as.list(dgeObj)
class(MyList)
## [1] "list"
names(MyList)
## [1] "counts_orig"   "counts"        "design_orig"   "design"        "geneData_orig" "geneData"      "granges_orig"  "granges"       "AlignmentQC"

Ancillary Functions

Function baseType

Returns the baseType for a given Type:

baseType(dgeObj, "DGEList")
## [1] "assay"

Querying Metadata Stored in Attributes

Certain metadata about a project can be stored in the attributes of the DGEobj. Similarly, each item deposited in a DGEobj can have it’s own set of attributes. Both of these types of attributes can be easily accessed with the base R attr function.

The showMeta function will return a named list of all attributes of a DGEobj.

To retrieve a specific attribute:

keywords <- attr(dgeObj, "Keywords")

To retrieve an attribute from an item, use dollar sign syntax to reference the item:

formula <- attr(FullWorkflowDGEobj$ReplicateGroupDesign, "formula")

Function rmItem

For common workflow usage, the DGEobj should be considered read-only. Items are added during the workflow and should not be modified further. Thus, under the normal intended usage, no items should be removed. Although, its use is discouraged, for completeness, a rmItem function exists.

To delete an item from a DGEobj:

MyDgeObj <- rmItem(dgeObj, "granges")

Function setAttributes

To add or update a single attribute, use the base Rattr function:

attr(dgeObj, "Tissue")  <- "Liver"

The setAttributes function allows attaching a named list of attributes. Unlike the base R attributes function, setAttributes will add or update the named attributes without deleting any existing attributes.

newDgeObj <- setAttributes(dgeObj, list(date=date(), analyst="JRT"))

Function showMeta

This function returns any project annotation attached as attributes to the DGEobj using function annotateDGEobj. It returns the attributes as a two column dataframe of Attribute/Value pairs. It only returns attributes with character values.

kable(showMeta(dgeObj))
Attribute Value
class DGEobj
DGEobjDef.version 1.2
level gene
Genome Rat.B6.0
GeneModel Ensembl.R89
NA NA
source Omicsoft
ID BDL_Rat_LiverSlice_03Dec2017
Title Rat Liver Slices from Bile Duct Ligation animals
Organism Rat
PlatformType RNA-Seq
Description Rat livers slices from sham or BDL +/- efficacious treatments incubated in vitro
Keywords Liver slices; Bile Duct Ligation
Disease Liver Fibrosis
Tissue Liver
GEO GSE120804
Technology Illumina-HiSeq
LibraryPrep TruSeq Stranded Total RNA
AlignmentReference Rat.B6.0
ReadType PE
Pipeline RNA-Seq_BMS_v2.4.pscript
AlignmentAlgorithm OSA
ScriptID RNA-Seq_BMS_v2.4.pscript
BusinessUnit Discovery
FunctionalArea Fibrosis
Vendor BMS
TBio_Owner Ron Ammar
TA_Owner John Huang

Technical Information

Fundamentally, a DGEobj is a named list of data items. R attributes are used to store information about the whole project as well as individual data items. Attributes are used to store project metadata supplied by the analyst. Attributes are also used to store data types and relationships in order to properly document the workflow.

Project Meta Data

User supplied meta data in the form of key/value pairs are attached to the DGEobj as attributes where the attribute name is the key and the attribute value is a character string. These metaData attributes can be loaded from a text file using function annotateDGEobj or can be manipulated individually with the base R attr function. The showMeta function displays these user supplied attributes.

Workflow attributes

Several attributes are used to capture information about each data item (“type”, “basetype”, “dateCreated” and “funArgs.”), as well as relationships between them (“parent”). The inventory function provides a convenient table view of these attributes. Operationally, the value of each of these attributes is a named list where the names are the item names and the value is the value for that item. For example, to query just the basetype attribute of the “design” itemName:

attr(dgeObj, "basetype")[["design"]]
## [1] "col"

Customizing the DGEobj Definition for New Data Types

A class DGEobjDef has been created to define the data types acceptable to a DGEobj and also allow for the definition of new datatypes as needed. A default DGEobjDef includes data types for gene expression (Affymetrix and RNA-Seq) applications. The user may define additional levels and data types when initializing a DGEobjDef object (initDGEobjDef function).

To return the default DGEobjDef:

defaultDGEobjDef <- initDGEobjDef()

The initDGEobj function uses a default DGEobj definition tailored for the limma/voom workflow. If new data types are needed to support a different workflow, the user should define the new data types during a call to initDGEobjDef, and before calling the DGEobj constructor (initDGEobj). The initDGEobjDef function returns a DGEobjDef class object that can then be passed explicitly to initDGEobj.

There are four optional arguments to initDGEobjDef:

  • levels: A character string or vector of new level names.
  • primaryAssayNames: A character string or vector of itemNames for the primary assay matrix for each level. Length must match length(levels).
  • types: A named character vector of new data types. Names represent the type name and values define the basetype for this type.
  • uniqueTypes: Adding a type name to this character vector restricts that data type to one instance per DGEobj.

The example below defines new types to accommodate metabolomics data.

# simulate some data
nr <- nrow(dgeObj)
nc <- ncol(dgeObj)

myMolecules <- dgeObj$geneData
design      <- dgeObj$design

myMSquant   <- rnorm(nr * nc) %>%
  exp() %>%
  matrix(nrow = nr, ncol = nc)

rownames(myMSquant) <- rownames(dgeObj)
colnames(myMSquant) <- colnames(dgeObj)

myDGEobjDef <- initDGEobjDef(levels            = "metabolite",
                             primaryAssayNames = "intensity",
                             types = c(normInt = "assay"))

MetabolomicObj <- initDGEobj(primaryAssayData = myMSquant,
                             rowData          = myMolecules,
                             colData          = design,
                             level            = "metabolite",
                             DGEobjDef        = myDGEobjDef)
kable(inventory(MetabolomicObj))
ItemName ItemType BaseType Parent Class Row Col DateCreated
intensity_orig intensity_orig meta matrix 32883 48 2022-05-12 10:58:14
intensity intensity assay intensity_orig matrix 32883 48 2022-05-12 10:58:14
design_orig design_orig meta data.frame 48 10 2022-05-12 10:58:14
design design col design_orig data.frame 48 10 2022-05-12 10:58:14
metaboliteData_orig metaboliteData_orig meta data.frame 32883 8 2022-05-12 10:58:14
metaboliteData metaboliteData row metaboliteData_orig data.frame 32883 8 2022-05-12 10:58:14

There are simple dimension-based criteria to determine the basetype of a new datatype.

  • If both row and column dimensions match the primary assay data, the new object is an assay.
  • If the row dimension of the new data matches the row dimension of the primary assay data, the new data type is a “row” basetype.
  • If the row dimension of the new data matches the column dimension of the primary assay data, the new data type is “col” basetype.
  • If the new data type does not match row or col dimensions, it is a “meta” basetype.

The DGEobj definitions created during initialization or added post-hoc by the newType function, are stored the “objDef” attribute of the DGEobj.

str(attr(dgeObj, "objDef"))
## List of 5
##  $ basetype         : Named chr [1:4] "row" "col" "assay" "meta"
##   ..- attr(*, "names")= chr [1:4] "row" "col" "assay" "meta"
##  $ type             : Named chr [1:42] "row" "col" "assay" "meta" ...
##   ..- attr(*, "names")= chr [1:42] "row" "col" "assay" "meta" ...
##  $ uniqueType       : chr [1:20] "counts" "counts_orig" "design" "design_orig" ...
##  $ allowedLevels    : chr [1:5] "gene" "exon" "isoform" "protein" ...
##  $ primaryAssayNames: Named chr [1:5] "counts" "counts" "intensities" "intensities" ...
##   ..- attr(*, "names")= chr [1:5] "gene" "exon" "isoform" "protein" ...
##  - attr(*, "class")= chr "DGEobjDef"