x.FASTQ is a suite of Bash wrappers for original and third-party software designed to make RNA-Seq data analysis more automated, but also accessible to wet biologists without a specific bioinformatics background.
x.FASTQ provides several modules to cover the entire RNA-Seq data analysis workflow, from raw read retrieval to count matrix generation. Each module is started with a different CLI-executable bash command:
Module Name | Performed Task |
---|---|
getFASTQ | downloads NGS raw data in FASTQ format from the ENA database |
trimFASTQ | performs adapter and quality trimming by running BBDuk |
anqFASTQ | aligns reads and quantifies transcript abundance by running STAR and RSEM |
qcFASTQ | runs quality-control tools, such as FastQC and MultiQC |
tabFASTQ | merges counts from multiple samples into a single expression table |
metaharvest | fetches metadata from GEO and/or ENA databases |
x.FASTQ | performs common tasks of general utility (disk usage monitor, dependency report...) |
Assuming that you have identified a study of interest from GEO (e.g., GSE138309
), have already created a project folder somewhere (mkdir '<anyPath>'/GSE138309
), and have moved into it (cd '<anyPath>'/GSE138309
), here are some possible sample workflows.
As an example of a minimal workflow, we can think of the following command set to retrieve the FASTQs, align and quantify them, and generate the gene-level count matrix.
# Download FASTQs, align, quantify, and assemble a gene-level count matrix
getfastq -u GSE138309 > ./GSE138309_wgets.sh
getfastq GSE138309_wgets.sh
anqfastq .
tabfastq .
A more complete workflow might include the download of metadata, a read trimming step, multiple quality control steps, and the inclusion of gene annotations and experimental design information in the count matrix.
# Download 12 (PE) FASTQs in parallel and fetch GEO-ENA cross-referenced metadata
getfastq --urls GSE138309 > ./GSE138309_wgets.sh
getfastq --multi GSE138309_wgets.sh
metaharvest --geo --ena GSE138309 > GSE138309_meta.csv
# Trim and QC
qcfastq --out=FastQC_raw .
trimfastq .
qcfastq --out=FastQC_trim .
# Align, quantify, and QC
anqfastq .
qcfastq --tool=QualiMap .
qcfastq --tool=MultiQC .
# Clean up
rm *.fastq.gz
# Assemble an isoform-level count matrix with annotation and experimental design
groups=(Ctrl Ctrl Ctrl Treat Treat Treat)
tabfastq --isoforms --names=human --design="${groups[*]}" --metric=expected_count .
# Explore samples through PCA
qcfastq --tool=PCA .
Due to the typical hardware requirements for read alignment and subsequent transcript abundance quantification, x.FASTQ has been designed to be installed on one (or a few) remote Linux servers and accessed by multiple client users via SSH.
Accordingly, each x.FASTQ module runs by default in the background and persistently (i.e., ignoring the HUP
hangup signal), so that the user is not forced to keep the connection active for the entire duration of the analysis, but only for job scheduling.
In this way, each x.FASTQ module can be run independently as a single analysis step.
Alternatively, multiple modules can be chained together can be chained together in a single pipeline to automate the entire analysis workflow by using the -w | --workflow
option for foreground execution.
Here is the batched version of the previous workflow
#!/bin/bash
## Prototypical x.FASTQ pipeline
# Download 12 (PE) FASTQs in parallel and fetch GEO-ENA cross-referenced metadata
getfastq --urls GSE138309 > ./GSE138309_wgets.sh
getfastq -w --multi GSE138309_wgets.sh
metaharvest --geo --ena GSE138309 > GSE138309_meta.csv
# Check FASTQ fileset completeness before going on
if ! getfastq --progress-complete; then
echo "FASTQ file possibly missing! Aborting the pipeline..."
exit 1
fi
# Trim and QC
qcfastq -w --out=FastQC_raw .
trimfastq -w .
qcfastq -w --out=FastQC_trim .
# Align, quantify, and QC
anqfastq -w .
qcfastq -w --tool=QualiMap .
qcfastq -w --tool=MultiQC .
# Clean up
rm *.fastq.gz
# Assemble an isoform-level count matrix with annotation and experimental design
groups=(Ctrl Ctrl Ctrl Treat Treat Treat)
tabfastq -w --isoforms --names=human --design="${groups[*]}" --metric=expected_count .
# Explore samples through PCA
qcfastq -w --tool=PCA .
Just save this pipeline as a single script file (e.g., pipeline.xfastq
) and run the entire workflow with nohup
and in the background
nohup bash pipeline.xfastq &
Alternatively, a similar workflow can be performed in a single command using Moliere, a "precasted" Python script that runs, in order, getfastq
, qcfastq
, trimfastq
, qcfastq
(again), anqfastq
, and tabfastq
, covering the whole analysis process with sensible defaults.
nohup moliere analyse GSE138309 &
Each module (including Moliere) has its own -h | --help
option, which provides detailed information on possible arguments and command syntax.
x.FASTQ full documentation, including the installation procedure on the server machine, can be found in the docs
folder instead.
A PDF version is also available as a preprint from Prerpints.org with DOI: 10.20944/preprints202507.0213