Unix, R and python tools for genomics and data science
If you are from fields outside of biology, places to get you started:
A Bioinformatician's UNIX Toolbox from Heng Li
command line bootcamp teaches you unix command step by step
Unix in your browser. Maybe useful for teaching bash?
sshx A secure web-based, collaborative terminal https://sshx.io/ useful for teaching
Use the Unofficial Bash Strict Mode (Unless You Looove Debugging)
Defensive BASH Programming very good read for bash programming.
process substitution: Using Names Pipes and Process Substitution in Bioinformatics Handy Bash feature: Process Substitution
Commonly used commands for PBS scheduler:Monitoring and Managing Your Job
test your unix skills at cmd challenge
people say awk is not part of bioinformats :) Still very useful parsing plain text files. Steve's Awk Academy
intro-bioinformatics: Website and slides for intro to bioinformatics class at Fred Hutch
tmate:Instant terminal sharing
tmux is a terminal multiplexer similar to screen
but have more features.
tmux cheatsheet
tmux config
tmux install without root
Theory and quick reference
There are 3 file descriptors, stdin, stdout and stderr (std=standard).
Basically you can:
redirect stdout to a file redirect stderr to a file redirect stdout to a stderr redirect stderr to a stdout redirect stderr and stdout to a file redirect stderr and stdout to stdout redirect stderr and stdout to stderr 1 'represents' stdout and 2 stderr. A little note for seeing this things: with the less command you can view both stdout (which will remain on the buffer) and the stderr that will be printed on the screen, but erased as you try to 'browse' the buffer.
This will cause the ouput of a program to be written to a file.
ls -l > ls-l.txt
Here, a file called 'ls-l.txt' will be created and it will contain what you would see on the screen if you type the command 'ls -l' and execute it.
This will cause the stderr ouput of a program to be written to a file.
grep da * 2> grep-errors.txt
Here, a file called 'grep-errors.txt' will be created and it will contain what you would see the stderr portion of the output of the 'grep da *' command.
This will cause the stderr ouput of a program to be written to the same filedescriptor than stdout.
grep da * 1>&2
Here, the stdout portion of the command is sent to stderr, you may notice that in differen ways.
This will cause the stderr ouput of a program to be written to the same filedescriptor than stdout.
grep * 2>&1
Here, the stderr portion of the command is sent to stdout, if you pipe to less, you'll see that lines that normally 'dissapear' (as they are written to stderr) are being kept now (because they're on stdout).
This will place every output of a program to a file. This is suitable sometimes for cron entries, if you want a command to pass in absolute silence.
rm -f $(find / -name core) &> /dev/null
This (thinking on the cron entry) will delete every file called 'core' in any directory. Notice that you should be pretty sure of what a command is doing if you are going to wipe it's output.
chmod 754 myfile
: this means the user has read, write and execute permssion; member in the same group has read and execute permission but no write permission; other people in the world only has read permission.
4 stands for "read", 2 stands for "write", 1 stands for "execute", and 0 stands for "no permission." So 7 is the combination of permissions 4+2+1 (read, write, and execute), 5 is 4+0+1 (read, no write, and execute), and 4 is 4+0+0 (read, no write, and no execute).
It is sometimes hard to remember. one can use the letter:The letters u, g, and o stand for "user", "group", and "other"; "r", "w", and "x" stand for "read", "write", and "execute", respectively.
chmod u+x myfile
chmod g+r myfile
Samtools
, BWA
and many others.It is really important to name your files correctly! see a ppt by Jenny Bryan.
Three principles for (file) names:
Put something numeric first
Use the ISO 8601 standard for dates (YYYY-MM-DD)
Left pad other numbers with zeros
If you have to rename the files...
Good naming of your files can help you to extract meta data from the file name
> dir("examples/dataset_1/")
[1] "2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A01.csv"
[2] "2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A02.csv"
[3] "2014-02-26_BRAFWTNEG_FFPEDNA-CRC-1-41_D08.csv"
[4] "2014-03-05_BRAFWTNEG_FFPEDNA-CRC-REPEAT_H03.csv"
[5] "2016-04-01_BRAFWTNEG_FFPEDNA-CRC-1-41_E12.csv"
> library("dirdf")
> dirdf("examples/dataset_1/", template="date_assay_experiment_well.ext")
date assay experiment well ext pathname
1 2013-06-26 BRAFWTNEG Plasmid-Cellline-100 A01 csv 2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A01.csv
2 2013-06-26 BRAFWTNEG Plasmid-Cellline-100 A02 csv 2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A02.csv
3 2014-02-26 BRAFWTNEG FFPEDNA-CRC-1-41 D08 csv 2014-02-26_BRAFWTNEG_FFPEDNA-CRC-1-41_D08.csv
4 2014-03-05 BRAFWTNEG FFPEDNA-CRC-REPEAT H03 csv 2014-03-05_BRAFWTNEG_FFPEDNA-CRC-REPEAT_H03.csv
Using these tool will greatly improve your working efficiency and get rid of most of your for loops
.
brename
and csvtk
.a blog post by Mark Ziemann http://genomespot.blogspot.com/2018/03/share-and-backup-data-sets-with-dat.html
# Install new version of R (lets say 3.5.0 in this example)
# Create a new directory for the version of R
fs::dir_create("~/Library/R/3.5/library")
# Re-start R so the .libPaths are updated
# Lookup what packages were in your old package library
pkgs <- fs::dirname(fs::dir_ls("~/Library/R/3.4/library"))
# Filter these packages as needed
# Install the packages in the new version
install.packages(pkgs)
pmap
is your friend :)Let
function. blog post wrapr: for sweet R code
If you already know the mapping in advance (like the above example) you should use the .data pronoun from rlang to make it explicit that you are referring to the drv in the layer data and not some other variable named drv (which may or may not exist elsewhere). To avoid a similar note from the CMD check about .data, use #' @importFrom rlang .data in any roxygen code block (typically this should be in the package documentation as generated by usethis::use_package_doc()).
- If you know the mapping or facet specification is col in advance, use aes(.data$col) or vars(.data$col).
- If col is a variable that contains the column name as a character vector, use aes(.data[[col]] or vars(.data[[col]]).
- If you would like the behaviour of col to look and feel like it would within aes() and vars(), use aes({{ col }}) or vars({{ col }}).
d3heatmap
for interactive heatmaps.ggforce
has geom_sina
for the same purpose.ComplexHeatmap
. Not as flexiable as ComplexHeatmap, but can be handy when the function you want has been implemented.geom_parallel_sets()
data.table
, feather
.dtplyr
and tidyfast
are teaming up (well, at least in this blog post)devtools:spell_check()
goodpractice:gp()
and pkgdown:build_site()
.There are many online web based tools for visualization of (cancer) genomic data. I put my collections here. I use R for visulization. see a nice post by using python by Radhouane Aniba:Genomic Data Visualization in Python
paper: Outlier Preservation by Dimensionality Reduction Techniques
"MDS best choice for preserving outliers, PCA for variance, & T-SNE for clusters"
Rtsne R package for T-SNE
rtsne An R package for t-SNE (t-Distributed Stochastic Neighbor Embedding)
a bug was in rtsne
: https://gist.github.com/mikelove/74bbf5c41010ae1dc94281cface90d32
t-SNE-Heatmaps Beta version of 1D t-SNE heatmaps to visualize expression patterns of hundreds of genes simultaneously in scRNA-seq.
PHATE dimensionality reduction method paper: http://biorxiv.org/content/early/2017/03/24/120378
Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data. Run from R: https://gist.github.com/crazyhottommy/caa5a4a4b07ee7f08f7d0649780832ef
umapr UMAP dimensionality reduction in R
Understanding UMAP very nice one to read!
Survival analysis of TCGA patients integrating gene expression (RNASeq) data
Tutorial: Machine Learning For Cancer Classification. It has four parts.
Automation wins in the long run.
STEP 6 is usually missing!
The pic was downloaded from http://biobungalow.weebly.com/bio-bungalow-blog/everybody-knows-the-scientific-method
I am using snakemake and so far is very happy about it!
A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility
A must read: Parallel sequencing lives, or what makes large sequencing projects successful
A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker https://github.com/aaronpeikert/reproducible-research
Reproducibility starts at home A series of blog posts by Jon Zelner.
Practical Computational Reproducibility in the Life Sciences from Cell Systems.
Analysis validation has been neglected in the Age of Reproducibility
The Life & Times of a Reproducible Clinical Project https://jenthompson.me/slides/rmedicine2018/rmedicine2018#1
Automate testing of your R package using Travis CI, Codecov, and testthat by Jean Fan.
docker intro by Cyverse and singularity by upendra devisetty. I met him in UC Davis during 2018 ANGUS :)
rocker/binder Adds binder abilities on top of the rocker/tidyverse images.
Embedding containerized workflows inside data science notebooks enhances reproducibility
workflowr: organized + reproducible + shareable data science in R
Singularity Singularity enables users to have full control of their environment. Singularity containers can be used to package entire scientific workflows, software and libraries, and even data. This means that you don’t have to ask your cluster admin to install anything for you - you can put it in a Singularity container and run.
countinous analysis Reproducibility of computational workflows is automated using continuous analysis
The hard road to reproducibility commentary on Science Magzine.
Five selfish reasons to work reproducibly Genome Biology paper.
biomake GNU-Make-like utility for managing builds and complex workflows.
drake An R-focused pipeline toolkit for reproducibility and high-performance computing. Snakemake in R.
biostar post:Job Manager to parallelize otherwise consecutive bash scripts
Deepnote - Better UI for Jupyter and enables collaboration & working online without installing anything.
CoCAL Collaborative Calculation in the Cloud
[nteract] notebook (https://nteract.io/)
A video by Dr.Keith A. Baggerly from MD Anderson The Importance of Reproducible Research in High-Throughput Biology very interesting, and Keith is really a fun guy!
paper: Ten Simple Rules for Reproducible Computational Research
Best Practice Data Life Cycle Approaches for the Life Sciences
Good Enough Practices in Scientific Computing We present a set of computing tools and techniques that every researcher can and should adopt. These recommendations synthesize inspiration from our own work, from the experiences of the thousands of people who have taken part in Software Carpentry and Data Carpentry workshops over the past six years, and from a variety of other guides. Unlike some other guides, our recommendations are aimed specifically at people who are new to research computing. Well worth reading!
A Quick Guide to Organizing Computational Biology Projects A must read for computational biologists!
avoid setwd()
in your R script. here_here()
comes to rescue.
Have you ever had problem to reuse one of your own published figures due to copyright of the journal? Here is the solution! from @LorenaABarba
As an early adopter of the Figshare repository, I came up with a strategy that serves both our open-science and our reproducibility goals, and also helps with this problem: for the main results in any new paper, we would share the data, plotting script and figure under a CC-BY license, by first uploading them to Figshare.