仓库源文站点原文


title: 与 Deseq2 斗智斗勇 tags:


今天要做一个差异表达分析,理所当然的用上 Deseq2 包。

<!--more-->

此次的分析方式与之前有差别,需要先进行 fpkm 标准化后再进行差异表达分析,于是读入 counts 矩阵:

rm(list = ls())

counts <- read.csv(
    'rna-seq.counts',
    header = TRUE,
    sep = '\t',
    # row.names = "Geneid",
    comment.char = '#',
    check.names = FALSE
)

然后使用一个 for 循环进行 fpkm 计算:

for (clm in colnames(counts)[6:ncol(counts)]) {
    col_fpkm <- paste0(clm, "_FPKM")     # 新列的名称,加上"_FPKM"后缀
    total <- sum(counts[, clm])          # 计算每个样本的总读数
    counts[col_fpkm] <- (counts[, clm] * 10^6) / (counts[, "Length"] / 1000)  # 使用相应样本的长度值计算 FPKM 并添加 FPKM 列
}

计算之后,删去矩阵里没用的数据和原来的 counts 矩阵:

counts = counts[,-c(2:19)]

看了一下数据,有很多的 0,为了防止意外,全部给它加一个不怎么影响结果的数值 1,然后再写入文件:

numeric_mask <- sapply(counts, is.numeric)
counts[numeric_mask] <- lapply(counts[numeric_mask], function(x) ifelse(x < 1, x + 1, x))

write.table(counts, file = 'fpkm_output', sep = '\t', row.names = FALSE)

拿到 fpkm 矩阵后开始进行差异表达分析,还是使用之前的代码:

# 数据示例:
# "Geneid"  "Normal-1-1_FPKM"   "Normal-1-2_FPKM"   "Normal-1-3_FPKM"   "HS-1-1_FPKM"   "HS-1-2_FPKM"   "HS-1-3_FPKM"   "Normal-2-1_FPKM"   "Normal-2-2_FPKM"   "Normal-2-3_FPKM"   "HS-2-1_FPKM"   "HS-2-2_FPKM"   "HS-2-3_FPKM"
# "Os01g0100100"    254855195.911414    242248722.316865    273253833.049404    587052810.902896    594548551.959114    539011925.042589    186030664.39523 145826235.093697    125383304.940375    103918228.279387    110391822.827939    100851788.756388
# "Os01g0100200"    1   887311.446317657    887311.446317657    1774622.89263531    3549245.78527063    5323868.67790595    1   887311.446317657    1   1   1   1
# "Os01g0100300"    1   1   1   1   1   1   1   1   1   1   1   1
# "Os01g0100400"    48588616.3813049    32855159.648311 51365108.7459509    136973623.322536    130032392.410921    124479407.681629    75890791.3003239    51827857.473392 59694585.8398889    9254974.54881999    16658954.187876 15270708.005553
# "Os01g0100466"    1   1   1   1   1   1   1   1   1   1   1   1
# "Os01g0100500"    610261026.10261 571107110.711071    669216921.692169    919441944.19442 886138613.861386    851485148.514851    698019801.980198    548154815.481548    536903690.369037    458595859.585959    527452745.274527    504500450.045004
# "Os01g0100600"    188775510.204082    179591836.734694    202040816.326531    398469387.755102    369387755.102041    339795918.367347    218367346.938776    152551020.408163    142346938.77551 1.75e+08    186224489.795918    137755102.040816
# 使用 fpkm 标准化之后的数据

# setwd('./pub_share/')
rm(list = ls())  
Sys.setenv(LANGUAGE = "en")

library(DESeq2)

fpkm = read.csv(
    'fpkm_output', 
    header = T,  
    sep = '\t', 
    row.names = "Geneid", 
    comment.char = '#', 
    check.name = F
)

samples = data.frame(
    sampleID = c("Normal-1-1_FPKM", "Normal-1-2_FPKM", "Normal-1-3_FPKM", "HS-1-1_FPKM", "HS-1-2_FPKM", "HS-1-3_FPKM", "Normal-2-1_FPKM", "Normal-2-2_FPKM", "Normal-2-3_FPKM", "HS-2-1_FPKM", "HS-2-2_FPKM", "HS-2-3_FPKM"), 
    sample = c("sample1", "sample1", "sample1", "sample2", "sample2", "sample2", "sample3", "sample3", "sample3", "sample4", "sample4", "sample4")
)

rownames(samples) = samples$sampleID

samples$sample = factor(samples$sample, levels = c('sample1', 'sample2', 'sample3', 'sample4'))

dds = DESeqDataSetFromMatrix(countData = fpkm, colData = samples, design = ~sample)

dds_count <- DESeq(dds, fitType = 'mean', minReplicatesForReplace = 7, parallel = FALSE)

sampl1_vs_sample2 <- results(dds_count, contrast = c('sample', 'sample1', 'sample2'))

sampl3_vs_sample4 <- results(dds_count, contrast = c('sample', 'sample3', 'sample4'))

result1 <- data.frame(sampl1_vs_sample2, stringsAsFactors = FALSE, check.names = FALSE)
result2 <- data.frame(sampl3_vs_sample4, stringsAsFactors = FALSE, check.names = FALSE)

write.table(result1, 'sampl1_vs_sample2.DESeq2.txt', col.names = NA, sep = '\t', quote = FALSE)
write.table(result2, 'sampl3_vs_sample3.DESeq4.txt', col.names = NA, sep = '\t', quote = FALSE)

报错说是 Deseq2 处理的数据得是整数,于是加上:

fpkm <- round(fpkm)

将数值四舍五入为整数。

这里突然报错说有 NA 空值converting counts to integer mode Error in validObject(.Object) : invalid class “DESeqDataSet” object: NA values are not allowed in the count matrix In addition: Warning message: In mde(x) : NAs introduced by coercion to integer range,那就奇怪了,我不是全部加了 1 吗?于是这里判断空值,给空值加 1,再加一个判断,如果加 1 后还是空值那就直接删掉该行:

numeric_mask <- sapply(fpkm, is.numeric)
fpkm[numeric_mask] <- lapply(fpkm[numeric_mask], function(x) ifelse(is.numeric(x) & x < 1, x + 100, x))

missing_values <- sum(is.na(fpkm))
if (missing_values > 0) {
  fpkm <- na.omit(fpkm)
}

还是会报同样的错误 ... 于是 Google 了一下,发现原来是某些比较大的数值仍然会保留小数点,例如:

as.character(x)
[1] "8903.00000000001" "946.999999999999"
[3] "9113.99999999999" "9451.00000000001"
[5] "9156.99999999999" "875.999999999999"

直接给我无语住了,于是根据网上的资料加上:

fpkm = apply(fpkm, 2, as.integer)

成功跑出差异表达分析,查看分析结果的时候突然发现,原本第一列的基因 ID 居然全部变成了数字??

排查了半天才发现,哦,原来fpkm = apply(fpkm, 2, as.integer)会直接把 ID 也给转化数字 ... 于是把这一句改成:

fpkm[-1, ] <- apply(fpkm[-1, ], 2, as.integer)

问题就解决了~真的费劲~