Last updated: 2021-08-20
Checks: 7 0
Knit directory: SCENIC_pipeline/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20210818)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 0110fe0. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rproj.user/
Unstaged changes:
Modified: analysis/_site.yml
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/Step2_motif_discovery.Rmd
) and HTML (docs/Step2_motif_discovery.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 728c832 | lily123920 | 2021-08-18 | diyici |
html | 728c832 | lily123920 | 2021-08-18 | diyici |
背景:经过上述分析每个转录因子都找到了强相关的靶基因,很多基因调控网络分析到此结束。但是这些基于共表达构建的强相关关系就代表一定存在靶向调控作用吗?SCENIC团队对此结果提出了质疑
目的:通过一定策略修剪共表达模块
形成有生物学意义的调控单元regulon
。
方法:借助RcisTarget包的算法。计算过程大致可以分为两步:
1. 选择motifs。
该项分析依赖于gene-motif评分(排行)数据库
,其行为基因、列为motif、值为排名。也就是我们下载的cirsTarget
数据库,该数据库包含每个motif的全基因组范围的跨物种排名。
①motif富集分析
对于一个基因集,选择在该基因集内的所有基因的TSS周围显著over-represented的motifs。
实现算法:recovery-based method;
结果指标:NES 标准化的富集分数。
②motif的TF注释
注释结果分高、低可信度:
①数据库直接注释和同源基因推断
的TF是高可信
结果;
②使用motif序列相似性
注释的TF是低可信
结果。
筛选标准:①注释到相应的TFs;②富集分数>3.0。满足以上两个条件的motif被保留。
小tip: ①脑中要有数据处理过程的图哦。。。。矩阵(行为基因,列为motif,值为排名) →→→ 矩阵(行为regulon, 列为motif,值转化为NES值)→→→ 矩阵(行为regulon, 列为motif, 值为NES值及TF注释结果)→→→ 保留value满足(NES和TF标准)的motif。②以每个细胞的每个潜在regulon为数据处理单元哦。。。
2. 预测潜在的靶基因。
策略: 用保留的motif对共表达模块内的基因进行进行打分,识别显著高分的基因(理解为motif离这些基因的TSS很近)。
筛选过滤:删除共表达模块内与motif评分不高的基因,剩下的基因集则为调控单元regulon。
##推断转录调控网络(regulon)
runSCENIC_2_createRegulons(scenicOptions)
#以上代码可增加参数coexMethod=c("w001", "w005", "top50", "top5perTarget", "top10perTarget", "top50perTarget"))
#默认6种方法的共表达网络都计算,可以少选几种方法以减少计算量
注意:由于这一部涉及到对每个共表达模块的所有涉及motif的富集分析及TF注释,计算量非常大;且细胞数量越多,计算量越大。因此,为第一个耗时环节:1000个细胞耗时3-4h。
拓展——函数可选参数介绍
runSCENIC_2_createRegulons(
scenicOptions,
minGenes = 20, # 基因数量>20的基因集纳入分析
coexMethods = NULL,
minJakkardInd = 0.8, # 基于Jakkard index合并overlapping模块【减少计算量】
signifGenesMethod = "aprox",
onlyPositiveCorr = TRUE, # 至纳入positive correlated targets。。。
onlyBestGsPerMotif = TRUE
)
### 该函数通过进一步筛选基因列表、减少基因集数量等参数减少计算量,降低运行时间。
./int目录下的输出文件包括:
./output目录下的输出文件包括:
重点关注./output
目录下的三个文件,储存结果对应分析过程中的motif富集分析和注释
、regulon确定
。
motifenrichment_preview.html和MotifEnrichment.tsv展示的信息一致,前者为网页可视化版,后者为文本文件格式。均储存各个共表达模块显著富集的motif的注释信息
。
geneSet:基因集的名字
motif:motif的ID
NES:基因集内的motif标准化的富集分数
AUC:曲线下面积,用于计算NES
TFinDB:标记highlighted TFs属于高可信度注释(**)还是低可信度注释(*)
TF_highConf:基于motifAnnot_highConfCat注释的TF
TF_lowConf: 基于motifAnnot_lowConfCat注释的TF
enrichedGenes:对于给定motif排名靠前的基因
nErnGenes:上述排名靠前基因的数量。
rankAtMax: Ranking at the maximum enrichment, used to determine the number of enriched genes.
RegulonTargetsInfo.tsv是对上述信息的整合,但是以调控网络的形式组织数据。
TF:转录因子名称
gene:TF靶基因名称
nMotif:靶基因在数据库的motif数量
bestMotif:最显著富集的motif名称
NES:标准富集分数,分值越高越显著
highConfAnnot:是不是高可信注释
Genie3Weight:TF与靶基因的相关性权重
注意:该表示最重要的一个表。。。。后续筛选到感兴趣的regulon后,需要用到此表查询具体的TF和Targets信息。
regulon名称有两种形式,分别为:
①TF + 靶基因数量:TF与高可信靶基因(即highConfAnnot = TRUE)的基因组成的基因调控网络
②TF + extended + 靶基因数目:TF与所有靶基因组成的基因调控网络
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936
[2] LC_CTYPE=Chinese (Simplified)_China.936
[3] LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.936
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] workflowr_1.6.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 whisker_0.4 knitr_1.33 magrittr_2.0.1
[5] R6_2.5.0 rlang_0.4.11 fansi_0.5.0 stringr_1.4.0
[9] tools_4.0.2 xfun_0.24 utf8_1.2.1 git2r_0.28.0
[13] jquerylib_0.1.4 htmltools_0.5.1.1 ellipsis_0.3.2 rprojroot_2.0.2
[17] yaml_2.2.1 digest_0.6.27 tibble_3.1.2 lifecycle_1.0.0
[21] crayon_1.4.1 later_1.2.0 sass_0.4.0 vctrs_0.3.8
[25] promises_1.2.0.1 fs_1.5.0 glue_1.4.2 evaluate_0.14
[29] rmarkdown_2.9 stringi_1.5.3 bslib_0.2.5.1 compiler_4.0.2
[33] pillar_1.6.1 jsonlite_1.7.2 httpuv_1.6.1 pkgconfig_2.0.3