Getting Started
To get started you will need to get familiar with the command line (aka shell or bash) and either R or Python. I have used R mainly so my recommendations will be biased towards R. See [Sanbiomics](https://www.youtube.com/@sanbomics/featured) for Python advice. Below I will outline the typical analysis steps when analyzing omics data and highlight which tools you can use to do them. Again, these are my recommendations and there are other ways/tools to accomplish these steps
Basic sequencing/bioinformatics analytics pipeline
| Step | Language | Tool |
|---|---|---|
| download data/fastqs | shell |
sra fastq dump (if published) or download for sequencing core. It is really important for data quality purposes you DO NOT change the filenames. |
| align fastqs to reference genome | shell |
bowtie for DNAse, bowtie2 for any other chromatin based assays (i.e. ChIP-seq), STAR for RNA-seq |
| Call Peaks (for chromatin based assays) | shell / R |
MACS2 for ChIP-seq, seacr for CUTNRUN |
| Extract RNA-seq Counts or Chromatin-based assay reads under peaks | shell / R |
HTSeq or salmon R package for RNA-seq, Rsubreads::featureCounts() for chromatin based assays |
| perform QC & filtering | shell / R |
|
| Differential analysis | R |
DESeq2 for many types of genomic data, diffbind for chromatin based assays |
| Map peaks to nearest genes | R |
ChIPseeker |
| Enrichment testing for differential genes or genes near differential peaks | R / web-based tools |
Most enrichment testing does a version of a fisher's exact test to look at the proportion of significant genes compared to the proportion of genes annotated to a specific pathway or GO term. ClusterProfiler for GO enrichment, KEGG pathway enrichment, Reactome pathway enrichment. |
| TF enrichment analysis | shell |
HOMER or MEME |
| Gene Correlation Analysis | R |
WGCNA |
After all your analyses are done you pour over your genes and make connections and interpret data
Resources
Bioinformatics Tutorials/Courses
This Data Carpentries Genomics Curriculum course is taught to the incoming UPGG Students every year. IMO, these modules are the most important:
Applied Computational Genomics Course at UU: Spring 2022. Taught by a bioinformatics legend. I find his raw genomic data processing content very insightful. This includes:
- Feb 10, 2022: FASTQ format and tools (slides)
- Feb 15, 2022: Sequence mapping and alignment (slides), youtube
- Feb 17, 2022: Sequence alignment and SAM/BAM format samtools, and IGV (slides), youtube
- Feb 22, 2022: Samtools and IGV (slides), youtube
- March 1, 2022: Uncertainty in RNA-seq data (slides), youtube
HarvardX Biomedical Data Science Open Online Training. This is a really good front to back course on bioinformatics. I would focus on:
- Statistics and R
- Statistical Inference and Modeling for High-throughput Experiments (Week 1-3)
- High-Dimensional Data Analysis
- Introduction to Bioconductor: Annotation and Analysis of Genomes and Genomic Assays (Week 1 and Week 4)
- Case Studies in Functional Genomics
Bioinformatics YouTube Channels
- Bioinformagician
- Chatomics
- StatQuest
- Hypothesis Testing
- p-values: what they are and how to interpret them
- False Discovery Rate
- Statistical Power
- Design Matrices For Linear Models
- Gentle Introduction to RNA-seq
- RPKM, FPKM, and TPM
- DESeq2 - Library Normalization
- DESeq2 - Independent Filtering
- UMAP Dimension Reduction
- Principal Component Analysis
- Hierarchal Clustering
- High throughput sequencing playlist
- Prof. Jean Fan - She does live coding sessions which I find to be very helpful for realistic coding. But she mainly focuses on spatial transcriptomics.
- Duke’s computational biology reading group playlist on the Duke Center for Computational Thinking. This whole playlist is live coding and they have an R package with code and data to walk through each tutorial.
Bioinformatics Books
- Bioinformatic Data Skills (Official Website: here, Online PDF: here) - good all around
- Modern Statistics for Modern Biology - great for stats concepts
- Computational Biology - good all around for concepts and code