# 从零开始：基于Linux的RNA-Seq数据分析全流程实战

零点119官方团队2026-01-022026-01-02

从零开始：基于Linux的RNA-Seq数据分析全流程实战

引言

生物信息学分析已成为现代生命科学研究不可或缺的工具。随着高通量测序技术的普及，RNA-Seq（转录组测序）成为研究基因表达、发现新转录本和可变剪接事件的主流方法。本教程将带领你从原始测序数据开始，完成一个完整的RNA-Seq分析流程，涵盖质量控制、比对、定量和差异表达分析等关键步骤。

环境准备

1. 系统要求与软件安装

首先确保你有一个Linux环境（Ubuntu 20.04或CentOS 7+）。我们将使用conda进行软件管理：

# 安装miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

# 创建生物信息学环境
conda create -n rna-seq python=3.8
conda activate rna-seq

# 安装必要工具
conda install -c bioconda fastqc multiqc trimmomatic hisat2 samtools subread
conda install -c bioconda bioconductor-deseq2 r-base=4.1

2. 数据准备

假设我们有两个处理组（Control和Treatment），每组三个生物学重复：

# 创建项目目录结构
mkdir -p rna_seq_project/{raw_data,cleaned_data,alignment,counts,results,qc_reports}
cd rna_seq_project

# 下载示例数据（这里使用模拟数据路径）
# 实际项目中，你的数据可能来自SRA或其他来源
# 假设原始数据文件命名格式：sample_R1.fastq.gz, sample_R2.fastq.gz

完整分析流程

步骤1：原始数据质量评估

使用FastQC评估原始测序数据质量：

# 批量运行FastQC
for file in raw_data/*.fastq.gz; do
    fastqc $file -o qc_reports/raw/
done

# 使用MultiQC汇总所有QC报告
multiqc qc_reports/raw/ -o qc_reports/raw_multiqc/

步骤2：数据预处理与质量控制

使用Trimmomatic去除低质量序列和接头：

# 创建适配器文件（Illumina通用接头）
echo ">PrefixPE/1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PrefixPE/2
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT" > adapters.fa

# 批量处理双端测序数据
for sample in Control1 Control2 Control3 Treatment1 Treatment2 Treatment3; do
    trimmomatic PE -threads 4 \
    raw_data/${sample}_R1.fastq.gz \
    raw_data/${sample}_R2.fastq.gz \
    cleaned_data/${sample}_R1_paired.fastq.gz \
    cleaned_data/${sample}_R1_unpaired.fastq.gz \
    cleaned_data/${sample}_R2_paired.fastq.gz \
    cleaned_data/${sample}_R2_unpaired.fastq.gz \
    ILLUMINACLIP:adapters.fa:2:30:10 \
    LEADING:3 \
    TRAILING:3 \
    SLIDINGWINDOW:4:15 \
    MINLEN:36
    
    echo "已完成样本 ${sample} 的质控处理"
done

步骤3：参考基因组比对

使用HISAT2进行序列比对：

# 下载参考基因组和注释文件（以人类GRCh38为例）
wget ftp://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
wget ftp://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/Homo_sapiens.GRCh38.104.gtf.gz

# 解压文件
gunzip *.gz

# 建立HISAT2索引
hisat2-build Homo_sapiens.GRCh38.dna.primary_assembly.fa grch38_index

# 批量比对
for sample in Control1 Control2 Control3 Treatment1 Treatment2 Treatment3; do
    hisat2 -p 8 \
    -x grch38_index \
    -1 cleaned_data/${sample}_R1_paired.fastq.gz \
    -2 cleaned_data/${sample}_R2_paired.fastq.gz \
    -S alignment/${sample}.sam \
    --summary-file alignment/${sample}_alignment_summary.txt
    
    echo "已完成样本 ${sample} 的比对"
done

步骤4：SAM文件处理与排序

将SAM文件转换为BAM格式并排序：

for sample in Control1 Control2 Control3 Treatment1 Treatment2 Treatment3; do
    # 转换SAM到BAM
    samtools view -@ 4 -bS alignment/${sample}.sam > alignment/${sample}.bam
    
    # 按坐标排序
    samtools sort -@ 4 alignment/${sample}.bam -o alignment/${sample}_sorted.bam
    
    # 建立索引
    samtools index alignment/${sample}_sorted.bam
    
    # 删除中间文件
    rm alignment/${sample}.sam alignment/${sample}.bam
    
    echo "已完成样本 ${sample} 的BAM文件处理"
done

步骤5：基因表达定量

使用featureCounts进行基因计数：

# 准备样本列表
echo "alignment/Control1_sorted.bam
alignment/Control2_sorted.bam
alignment/Control3_sorted.bam
alignment/Treatment1_sorted.bam
alignment/Treatment2_sorted.bam
alignment/Treatment3_sorted.bam" > bam_files.txt

# 运行featureCounts
featureCounts -T 8 \
-p \
-t exon \
-g gene_id \
-a Homo_sapiens.GRCh38.104.gtf \
-o counts/gene_counts.txt \
--extraAttributes gene_name,gene_biotype \
-s 0 \
$(cat bam_files.txt)

# 提取计数矩阵
grep -v '^#' counts/gene_counts.txt | cut -f1,7-12 > counts/count_matrix.txt

步骤6：差异表达分析

使用R和DESeq2进行差异表达分析：

# 创建R脚本 differential_expression.R
cat > scripts/differential_expression.R << 'EOF'
# 加载必要的库
library(DESeq2)
library(ggplot2)
library(pheatmap)
library(EnhancedVolcano)

# 设置工作目录
setwd(".")

# 读取计数矩阵
count_data <- read.table("counts/count_matrix.txt", 
                         header = TRUE, 
                         row.names = 1,
                         sep = "\t")

# 准备样本信息
colnames(count_data) <- c("Control1", "Control2", "Control3",
                          "Treatment1", "Treatment2", "Treatment3")

sample_info <- data.frame(
  row.names = colnames(count_data),
  condition = factor(c(rep("Control", 3), rep("Treatment", 3))),
  batch = factor(rep(c("1", "2", "3"), 2))
)

# 创建DESeq2数据集
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_info,
                              design = ~ condition)

# 过滤低表达基因
keep <- rowSums(counts(dds) >= 10) >= 3
dds <- dds[keep,]

# 运行DESeq2分析
dds <- DESeq(dds)

# 获取结果
res <- results(dds, contrast = c("condition", "Treatment", "Control"))
res_ordered <- res[order(res$padj),]

# 保存结果
write.csv(as.data.frame(res_ordered), 
          file = "results/differential_expression_results.csv")

# 提取显著差异表达基因（padj < 0.05, |log2FC| > 1）
significant_genes <- subset(res_ordered, padj < 0.05 & abs(log2FoldChange) > 1)
write.csv(as.data.frame(significant_genes),
          file = "results/significant_genes.csv")

# 数据标准化用于可视化
vsd <- vst(dds, blind = FALSE)

# 1. PCA图
pdf("results/PCA_plot.pdf")
plotPCA(vsd, intgroup = "condition") + 
  theme_minimal() +
  ggtitle("PCA Plot - Control vs Treatment")
dev.off()

# 2. 热图（前50个差异基因）
top_genes <- head(order(res$padj), 50)
mat <- assay(vsd)[top_genes, ]
mat <- mat - rowMeans(mat)

pdf("results/heatmap_top50_genes.pdf", width = 8, height = 10)
pheatmap(mat,
         annotation_col = as.data.frame(colData(vsd)["condition"]),
         show_rownames = FALSE,
         main = "Top 50 Differentially Expressed Genes")
dev.off()

# 3. 火山图
pdf("results/volcano_plot.pdf", width = 10, height = 8)
EnhancedVolcano(res_ordered,
                lab = rownames(res_ordered),
                x = 'log2FoldChange',
                y = 'padj',
                title = 'Control vs Treatment',
                pCutoff = 0.05,
                FCcutoff = 1,
                pointSize = 2.0,
                labSize = 4.0,
                colAlpha = 0.6,
                legendPosition = 'right',
                drawConnectors = TRUE,
                widthConnectors = 0.5)
dev.off()

# 4. MA图
pdf("results/MA_plot.pdf")
plotMA(res, main = "MA Plot", ylim = c(-5, 5))
dev.off()

# 输出分析摘要
cat("分析完成！\n")
cat("总基因数：", nrow(res), "\n")
cat("显著差异基因数（padj < 0.05, |log2FC| > 1）：", nrow(significant_genes), "\n")
cat("结果文件已保存到 results/ 目录\n")
EOF

# 运行R脚本
Rscript scripts/differential_expression.R

步骤7：功能富集分析（可选）

# 安装clusterProfiler（在R环境中）
R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('clusterProfiler'); BiocManager::install('org.Hs.eg.db')"

# 创建富集分析脚本
cat > scripts/enrichment_analysis.R << 'EOF'
library(clusterProfiler)
library(org.Hs.eg.db)
library(ggplot2)

# 读取显著差异基因
sig_genes <- read.csv("results/significant_genes.csv", row.names = 1)

# 提取基因ID（假设是Ensembl ID）
gene_ids <- rownames(sig_genes)

# 转换为ENTREZ ID
entrez_ids <- mapIds(org.Hs.eg.db,
                     keys = gene_ids,
                     column = "ENTREZID",
                     keytype = "ENSEMBL",
                     multiVals = "first")

# 去除NA值
entrez_ids <- na.omit(entrez_ids)

# GO富集分析
go_enrich <- enrichGO(gene = entrez_ids,
                      OrgDb = org.Hs.eg.db,
                      keyType = "ENTREZID",
                      ont = "BP",
                      pAdjustMethod = "BH",
                      qvalueCutoff = 0.05,
                      readable = TRUE)

# KEGG通路富集分析
kegg_enrich <- enrichKEGG(gene = entrez_ids,
                          organism = 'hsa',
                          pAdjustMethod = "BH",
                          qvalueCutoff = 0.05)

# 保存结果
write.csv(as.data.frame(go_enrich), "results/GO_enrichment.csv")
write.csv(as.data.frame(kegg_enrich), "results/KEGG_enrichment.csv")

# 可视化
pdf("results/GO_dotplot.pdf", width = 12, height = 8)
dotplot(go_enrich, showCategory = 20, title = "GO Enrichment Analysis")
dev.off()

pdf("results/KEGG_barplot.pdf", width = 10, height = 8)
barplot(kegg_enrich, showCategory = 15, title = "KEGG Pathway Enrichment")
dev.off()
EOF

# 运行富集分析
Rscript scripts/enrichment_analysis.R

结果解读与最佳实践

1. 质量控制指标

每个样本的测序深度应至少达到20M reads
Q30碱基比例应大于85%
GC含量应在物种正常范围内

2. 比对统计

通常期望比对率在70-90%之间
多比对率应控制在合理范围内（<10%）

3. 差异表达结果验证

检查PCA图中样本是否按处理组聚类
验证housekeeping基因的表达是否稳定
考虑使用qPCR验证关键差异基因

4. 性能优化建议

# 使用并行处理加速分析
# 安装GNU Parallel
conda install -c conda-forge parallel

# 并行运行FastQC示例
ls raw_data/*.fastq.gz | parallel -j 4 "fastqc {} -o qc_reports/raw/"

# 使用多线程处理大文件
samtools sort -@ 8 input.bam -o output.bam

🌟 故障排除

常见问题及解决方案：

内存不足错误

1 2	# 增加Java堆大小（针对Trimmomatic等Java工具） export JAVA_OPTS="-Xmx16g -Xms4g"

磁盘空间不足

1 2	# 使用管道减少中间文件 hisat2 ... \| samtools view -bS - \| samtools sort -o sorted.bam

基因计数为0
- 检查GTF文件版本是否与参考基因组匹配
- 确认featureCounts参数设置正确

💡 总结

本教程详细介绍了RNA-Seq数据分析的完整流程，从原始数据到差异表达基因的发现。通过这个流程，你可以：

评估测序数据质量并进行适当的质量控制
将reads比对到参考基因组
定量基因表达水平
识别差异表达基因
进行功能富集分析理解生物学意义

实际分析中，你可能需要根据具体实验设计调整分析参数。建议始终保存完整的分析脚本和参数记录，以确保分析的可重复性。随着技术的不断发展，保持学习新的工具和方法也是生物信息学分析的重要部分。

记住，生物信息学分析不仅是技术操作，更需要生物学洞察力。始终将统计结果与生物学背景相结合，才能得出有意义的科学结论。

[up主专用，视频内嵌代码贴在这]