Scruff

Process Single Cell RNA-Seq reads using scruff
Zhe Wang
2017-11-26
scruff is a toolkit for processing single cell RNA-seq fastq reads. It does demultiplexing, alignment,
Unique Molecular Identifier (UMI) correction, and transcript counting in an automated fashion and outputs
expression table. This vignette provides a brief introduction to the scruff package by walking through the
demultiplexing, alignment, and UMI-counting of a built-in example dataset.
Quick Start
# Run scruff on example dataset

# NOTE: Requires Rsubread index and TxDb objects for reference genome.
# For generation of these files, please refer to Stepwise Tutorial.
library(scruff)
expr <- scruff(exampleannot,
examplebc,
"GRCh38_MT",
"GRCh38_MT.gtf")
Stepwise Tutorial
Load Example Dataset
The scruff package contains a example single cell RNA-seq dataset examplefastq which is stored as a
list of ShortReadQ objects. Each object has 20,000 sequenced reads. They can be accessed via variable
examplefastq and exported to current working directory as fastq.gz files. scruff only accepts fastq/fastq.gz
files for demultiplexing so these objects have to be stored in fastq.gz format first.
library(scruff)
# write fastg.gz files to working directory

for (i in seq_len(length(examplefastq))) {
writeFastq(examplefastq[[i]], names(examplefastq)[i], mode = "w")
}
Demultiplex and Assign Cell Specific Reads
Now the fastq files are saved in current working directory and are ready to be demultiplexed. scruff
package provides built-in predetermined sample annotating table exampleannot and barcodes examplebc
for demultiplexing the example dataset. In the example fastq files, the barcode sequence of each read starts
at base 6 and ends at base 11. The UMI sequence starts at base 1 and ends at base 5. They can be set
via bc.start, bc.stop, and umi.start, umi.stop parameters. By default, reads with any nucleotide in the
barcode and UMI sequences with sequencing quality lower than 10 (Phred score) will be excluded. The
1
following command demultiplexes the example fastq reads and trims reads longer than 50 nucleotides. The
command returns a summary data.table containing the sample barcodes, filenames, number of reads, sample
name, file paths, etc. By default, the cell specific demultiplexed fastq.gz files are stored in ../Demultiplex
folder.
demultiplex.res <- demultiplex(

exampleannot,
examplebc,
bc.start = 6,
bc.stop = 11,
bc.edit = 1,
umi.start = 1,
umi.stop = 5,
keep = 50,
min.qual = 10,
yield.reads = 1e+06,
out.dir = "../Demultiplex",
cores = 2,
overwrite = TRUE,
verbose = TRUE
)
Alignment
scruff has a wrapper function align.rsubread which is a wrapper function to align in Rsubread package.
It aligns the reads to reference sequence index and outputs sequence alignment map files. For demonstration
purpose, the built-in mitochondrial DNA sequence from GRCh38 reference assembly GRCh38_MT will be used
to map the reads. First, a Rsubread index for the reference sequence needs to be generated.
ref <- "Homo_sapiens.GRCh38.dna.chromosome.MT.fa"

index <- "GRCh38_MT"
# write reference sequence to "./Homo_sapiens.GRCh38.dna.chromosome.MT.fa".

write.table(GRCh38_MT,
ref,
quote = FALSE,
sep = "\t",
row.names = FALSE,
col.names=FALSE)
# NOTE: Rsubread package does not support Windows environment.

library(Rsubread)
# Create index files for GRCh38_MT. For details, please refer to Rsubread user manual.
buildindex(
basename = index,
reference = ref,
indexSplit = FALSE,
memory = 16000
)
The following command maps the fastq files to GRCh38 mitochondrial reference sequence GRCh38_MT and
returns the file paths of the generated sequence alignment map files. By default, the files are stored in BAM
format in ../Demultiplex folder.
2
# get the paths to demultiplexed fastq.gz files
fastq.paths <- demultiplex.res[!(is.na(cell_num)), fastq_path]
# map the reads

align.res <- align.rsubread(
fastq.paths,
index,
format = "BAM",
out.dir = "../Alignment",
cores = 4,
overwrite = TRUE,
verbose = TRUE
)
UMI correction and Generation of Expression Table
Example GTF file GRCh38_MT_gtf will be used for feature counting. Currently, scruff applies the union
counting mode of the HTSeq Python package. The following command generates the UMI corrected expression
table for the example dataset.
# first write example GTF file to working directory

gtf.file <- "GRCh38_MT.gtf"
write.table(GRCh38_MT_gtf,
gtf.file,
quote = FALSE,
sep = "\t",
row.names = FALSE,
col.names = FALSE)
expr = count.umi(
align.res$Samples,
gtf.file,
format = "BAM",
out.dir = "../Count",
cores = 2,
verbose = TRUE
)
Collect gene annotation information from Biomart
Gene annotation information is needed before visualization of the data quality. In this tutorial, Ensembl
Biomart database is used for the query and collection of gene annotation information. The following codes
retrieve human gene names, biotypes, and chromosome names from latest version of biomart database.
# get gene names, biotypes, and chromosome names from biomart

host = "www.ensembl.org"
biomart = "ENSEMBL_MART_ENSEMBL"
dataset = "hsapiens_gene_ensembl"
GRCh = NULL
# get gene IDs
3
# exclude ERCC spike-ins and last two rows in expression table
geneID <- expr[!(gene.id %in% c("reads_mapped_to_genome",
"reads_mapped_to_genes")) &
!grepl("ERCC", expr[,gene.id]), gene.id]
# Get feature annotation data

ensembl <- biomaRt::useEnsembl(biomart = biomart,
dataset = dataset,
host = host,
GRCh = GRCh)
biomart.result <- data.table::data.table(biomaRt::getBM(

attributes = c("ensembl_gene_id", "external_gene_name",
"gene_biotype", "chromosome_name"),
filters = "ensembl_gene_id",
values = geneID,
mart = ensembl))
# remove rows containing duplicate gene IDs

biomart.result <- biomart.result[!base::duplicated(biomart.result,
by = "ensembl_gene_id"), ]
# Reorder/insert rows so that rows of Biomart query result match

# rows of the expression matrix
biomart.result <- data.table::data.table(biomart.result[
match(expr[!(gene.id %in% c("reads_mapped_to_genome",
"reads_mapped_to_genes")) &
!grepl("ERCC", expr[, gene.id]), gene.id],
biomart.result$ensembl_gene_id),])
Visualization of QC metrics
In order to plots data quality, a QC metrics table needs to be generated. collectqc function collects and
summarizes data quality information from results from previous steps and returns a QC metrics table for
plotting purpose.
qc <- collectqc(demultiplex.res, align.res, expr, biomart.result)
QC metrics can be visualized by running qcplot function.
qcplot(qc)
## [[1]]
4
Reads per cell
01
400
200
0
0 25 50 75 100
Reads
02
750
500
250
0
0 25 50 75 100
Cells in descending order
##
## [[2]]
5
Number of total reads
3
Log10reads
0
01 02
Number of aligned reads

2.5
Log10reads
2.0
1.5
1.0
0.5
0.0
01 02
sample
Number of reads aligned to an gene
2.5
Log10reads
2.0
1.5
1.0
0.5
0.0
01 02
Fraction of aligned reads to total reads

1.00
0.75
0.50
0.25
0.00
01 02
sample
6
Fraction of reads aligned to an gene out of total number of aligned reads
1.00
0.75
0.50
0.25
0.00
01 02
Fraction of reads aligned to an gene out of tatal number of reads

1.00
0.75
0.50
0.25
0.00
01 02
sample
Total number of transcripts after UMI correction
2.5
Log10transcripts
2.0
1.5
1.0
0.5
0.0
01 02
Number of mitochondrial transcripts after UMI correction

2.5
Log10transcripts
2.0
1.5
1.0
0.5
0.0
01 02
sample
7
Fraction of mitochondrial transcripts after UMI correction
1.00
0.75
0.50
0.25
0.00
01 02
Number of genes
1.2
Log10Genes
0.9
0.6
0.3
0.0
01 02
sample
Fraction of protein coding genes
1.00
0.75
0.50
0.25
0.00
01 02
Fraction of protein coding transcripts after UMI correction

1.00
0.75
0.50
0.25
0.00
01 02
sample
8
Log10(Genes x 1000000 / total reads)
Genes detected divided by total number of reads sequenced per million
0
01 02
sample

Scruff

Cargado por

Información del documento

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Scruff

Cargado por

Copyright:

Formatos disponibles

Process Single Cell RNA-Seq reads using scruff

# Run scruff on example dataset

Load Example Dataset

# write fastg.gz files to working directory

Demultiplex and Assign Cell Specific Reads

demultiplex.res <- demultiplex(

ref <- "Homo_sapiens.GRCh38.dna.chromosome.MT.fa"

# write reference sequence to "./Homo_sapiens.GRCh38.dna.chromosome.MT.fa".

# NOTE: Rsubread package does not support Windows environment.

# map the reads

UMI correction and Generation of Expression Table

# first write example GTF file to working directory

Collect gene annotation information from Biomart

# get gene names, biotypes, and chromosome names from biomart

# get gene IDs

# Get feature annotation data

biomart.result <- data.table::data.table(biomaRt::getBM(

# remove rows containing duplicate gene IDs

# Reorder/insert rows so that rows of Biomart query result match

qc <- collectqc(demultiplex.res, align.res, expr, biomart.result)

QC metrics can be visualized by running qcplot function.

Cells in descending order

Number of aligned reads

Fraction of aligned reads to total reads

Fraction of reads aligned to an gene out of tatal number of reads

Number of mitochondrial transcripts after UMI correction

Fraction of protein coding transcripts after UMI correction

Genes detected divided by total number of reads sequenced per million

También podría gustarte