Read in Salmon quantification of allelic counts from a diploid transcriptome. Assumes that diploid transcripts are marked with the following suffix: an underscore and a consistent symbol for each of the two alleles, e.g. ENST123_M and ENST123_P, or ENST123_alt and ENST123_ref, etc. importAllelicCounts requires the tximeta package. Further information in Details below.

importAllelicCounts(
  coldata,
  a1,
  a2,
  format = c("wide", "assays"),
  tx2gene = NULL,
  ...
)

Arguments

coldata

a data.frame as used in tximeta

a1

the symbol for the effect allele

a2

the symbol for the non-effect allele

format

either "wide" or "assays" for whether to combine the allelic counts as columns (wide) or put the allelic count information in different assay slots (assays). For wide output, the data for the non-effect allele (a2) comes first, then the effect allele (a1), e.g. [a2 | a1]. The ref level of the factor variable se$allele will be "a2" (so by default comparisons will be: a1 vs a2). For assays output, all of the original matrices are renamed with a prefix, either a1- or a2-.

tx2gene

optional, a data.frame with first column indicating transcripts, second column indicating genes (or any other transcript grouping). Alternatively, this can be a GRanges object with required columns tx_id, and group_id (see makeTx2Tss). For more information on this argument, see Details.

...

any arguments to pass to tximeta

Value

a SummarizedExperiment, with allele counts (and other data) combined into a wide matrix [a2 | a1], or as assays (a1, then a2). The original strings associated with a1 and a2 are stored in the metadata of the object, in the alleles list element. Note the reference level of se$allele will be "a2", such that comparisons by default will be a1 vs a2 (effect vs non-effect).

Details

Requirements - There must be exactly two alleles for each transcript, and the --keep-duplicates option should be used in Salmon indexing to avoid removal of transcripts with identical sequence. The output object has half the number of transcripts, with the two alleles either stored in a "wide" object, or as re-named "assays". Note carefully that the symbol provided to a1 is used as the effect allele, and a2 is used as the non-effect allele (see the format argument description and Value description below).

tx2gene - The two columns should include the a1 and a2 suffix for the transcripts and genes/groups, or those will be added internally, if it is detected that the first transcript does not have these suffices. For example if _alt or _ref, or _M or _P (as indicated by the a1 and a2 arguments) are not present in the table, the table rows will be duplicated with those suffices added on behalf of the user. If tx2gene is not provided, the output object will be transcript-level. Do not attempt to set the txOut argument, it will conflict with internal calls to downstream functions. If the a1/a2 suffices are not at the end of the transcript name in the quantification files, e.g. ENST123_M|<metadata>, then ignoreAfterBar=TRUE can be used to match regardless of the string following | in the quantification files.

skipMeta=TRUE is used, as it is assumed the diploid transcriptome does not match any reference transcript collection. This may change in future iterations of the function, depending on developments in upstream annotations and software.

If tx2gene is a GRanges object, the rowRanges of the output will be the reduced ranges of the grouped input ranges, with tx_id collapsed into a CharacterList, and TSS positions saved as an IntegerList, if these are not equal among the transcripts of a group. Other metadata columns are not manipulated, just the metadata for the first range is returned.

References

Euphy Wu, Noor P. Singh, Kwangbom Choi, Mohsen Zakeri, Matthew Vincent, Gary A. Churchill, Cheryl L. Ackert-Bicknell, Rob Patro, Michael I. Love. "Detecting isoform-level allelic imbalance accounting for inferential uncertainty" bioRxiv (2022) https://doi.org/10.1101/2022.08.12.503785