Skip to content

Bash wrapper modules for the remote analysis of RNA-Seq data, with persistency features.

License

Notifications You must be signed in to change notification settings

TCP-Lab/x.FASTQ

Repository files navigation

x.FASTQ

x.FASTQ is a suite of Bash wrappers for original and third-party software designed to make RNA-Seq data analysis more automated, but also accessible to wet biologists without a specific bioinformatics background.

Modules

x.FASTQ provides several modules to cover the entire RNA-Seq data analysis workflow, from raw read retrieval to count matrix generation. Each module is started with a different CLI-executable bash command:

Module Name Performed Task
getFASTQ downloads NGS raw data in FASTQ format from the ENA database
trimFASTQ performs adapter and quality trimming by running BBDuk
anqFASTQ aligns reads and quantifies transcript abundance by running STAR and RSEM
qcFASTQ runs quality-control tools, such as FastQC and MultiQC
tabFASTQ merges counts from multiple samples into a single expression table
metaharvest fetches metadata from GEO and/or ENA databases
x.FASTQ performs common tasks of general utility (disk usage monitor, dependency report...)

Usage

Assuming that you have identified a study of interest from GEO (e.g., GSE138309), have already created a project folder somewhere (mkdir '<anyPath>'/GSE138309), and have moved into it (cd '<anyPath>'/GSE138309), here are some possible sample workflows.

Minimal Step-by-Step Workflow

As an example of a minimal workflow, we can think of the following command set to retrieve the FASTQs, align and quantify them, and generate the gene-level count matrix.

# Download FASTQs, align, quantify, and assemble a gene-level count matrix
getfastq -u GSE138309 > ./GSE138309_wgets.sh
getfastq GSE138309_wgets.sh
anqfastq .
tabfastq .

Complete Step-by-Step Workflow

A more complete workflow might include the download of metadata, a read trimming step, multiple quality control steps, and the inclusion of gene annotations and experimental design information in the count matrix.

# Download 12 (PE) FASTQs in parallel and fetch GEO-ENA cross-referenced metadata
getfastq --urls GSE138309 > ./GSE138309_wgets.sh
getfastq --multi GSE138309_wgets.sh
metaharvest --geo --ena GSE138309 > GSE138309_meta.csv

# Trim and QC
qcfastq --out=FastQC_raw .
trimfastq .
qcfastq --out=FastQC_trim .

# Align, quantify, and QC
anqfastq .
qcfastq --tool=QualiMap .
qcfastq --tool=MultiQC .

# Clean up
rm *.fastq.gz

# Assemble an isoform-level count matrix with annotation and experimental design
groups=(Ctrl Ctrl Ctrl Treat Treat Treat)
tabfastq --isoforms --names=human --design="${groups[*]}" --metric=expected_count .

# Explore samples through PCA
qcfastq --tool=PCA .

Complete Workflow in Batch Mode

Due to the typical hardware requirements for read alignment and subsequent transcript abundance quantification, x.FASTQ has been designed to be installed on one (or a few) remote Linux servers and accessed by multiple client users via SSH. Accordingly, each x.FASTQ module runs by default in the background and persistently (i.e., ignoring the HUP hangup signal), so that the user is not forced to keep the connection active for the entire duration of the analysis, but only for job scheduling. In this way, each x.FASTQ module can be run independently as a single analysis step.

Alternatively, multiple modules can be chained together can be chained together in a single pipeline to automate the entire analysis workflow by using the -w | --workflow option for foreground execution. Here is the batched version of the previous workflow

#!/bin/bash
## Prototypical x.FASTQ pipeline

# Download 12 (PE) FASTQs in parallel and fetch GEO-ENA cross-referenced metadata
getfastq --urls GSE138309 > ./GSE138309_wgets.sh
getfastq -w --multi GSE138309_wgets.sh
metaharvest --geo --ena GSE138309 > GSE138309_meta.csv

# Check FASTQ fileset completeness before going on
if ! getfastq --progress-complete; then
   echo "FASTQ file possibly missing! Aborting the pipeline..."
   exit 1
fi

# Trim and QC
qcfastq -w --out=FastQC_raw .
trimfastq -w .
qcfastq -w --out=FastQC_trim .

# Align, quantify, and QC
anqfastq -w .
qcfastq -w --tool=QualiMap .
qcfastq -w --tool=MultiQC .

# Clean up
rm *.fastq.gz

# Assemble an isoform-level count matrix with annotation and experimental design
groups=(Ctrl Ctrl Ctrl Treat Treat Treat)
tabfastq -w --isoforms --names=human --design="${groups[*]}" --metric=expected_count .

# Explore samples through PCA
qcfastq -w --tool=PCA .

Just save this pipeline as a single script file (e.g., pipeline.xfastq) and run the entire workflow with nohup and in the background

nohup bash pipeline.xfastq &

Complete Workflow with Moliere

Alternatively, a similar workflow can be performed in a single command using Moliere, a "precasted" Python script that runs, in order, getfastq, qcfastq, trimfastq, qcfastq (again), anqfastq, and tabfastq, covering the whole analysis process with sensible defaults.

nohup moliere analyse GSE138309 &

Documentation

Each module (including Moliere) has its own -h | --help option, which provides detailed information on possible arguments and command syntax.

x.FASTQ full documentation, including the installation procedure on the server machine, can be found in the docs folder instead.

A PDF version is also available as a preprint from Prerpints.org with DOI: 10.20944/preprints202507.0213

About

Bash wrapper modules for the remote analysis of RNA-Seq data, with persistency features.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •