What is Machine Learning

Table of contents

  1. Key learning outcomes
  2. ML/LLM Basic
  3. Relevant Biostatistcs
  4. Linux: Learn it, Love it
  5. ML/LLM Training - Image Analysis

Key learning outcomes

In this module we will focus on the basics - how to use linux, basic relevant biostatistical theory (data, statistical approaches, etc), and what machine learning is (and importantly, is not) conceptually.

ML/LLM: The Basics

To be added.

Questions?

  • To be added

Conceptual Exercise: PCA

Principle component analysis is a useful technique that allows researchers to visualize high dimensional data in lower space by rotating the axes in such a way that the lower dimensions (or components) maximize the total variance explained. In statistical genetics this involves "rotating" million-dimensional data - something that is very hard to visualize! For this reason, we begin with a simpler exercise. For the following three two dimensional shapes, spend some time identifying the principle components or sketching the line across for which variance is maximized. Check your answers below:

Screenshot

❓QUESTIONS:

What line represents the principle component for the first shape?
What line represents the principle component for the second shape?
What line represents the principle component for the third shape?

Below you can view the shapes in principal component space.

Screenshot

Now that we understand how PCA works in two dimensions we will consider a higher dimensional example. In the three dimensional space below, see if you can visualize a plane that maximizes the variance across two dimensions:

Screenshot

Did you get it right? If so, realize that this is equivalent to what we do in genetics - we find rotate the data through millions of dimensions of space to find the plane that maximizes the variance in two dimensions:

Screenshot

To run PCA with real data please enter the exercise3 directory, and type the following command to run PCA on the 1000 Genome data:

./code/plink --bfile data/chr1-22 --indep-pairwise 250 25 0.1 --maf 0.1 --threads 30 --out chr1-22.ldpruned_all_1kgv2
./code/plink --bfile data/chr1-22 --extract chr1-22.ldpruned_all_1kgv2.prune.in  --pca --threads 30

This will generate the principal components that maximize the variance in the data. To plot the result run the following commands from with an R-terminal:

R-Code: Generate a PCA Plot

require('RColorBrewer')
options(scipen=100, digits=3)
eigenvec <- read.table('plink.eigenvec', header = F, skip=0, sep = ' ')
rownames(eigenvec) <- eigenvec[,2]
eigenvec <- eigenvec[,3:ncol(eigenvec)]
colnames(eigenvec) <- paste('Principal Component ', c(1:20), sep = '')

PED <- read.table("data/all_phase3.king.psam", header = TRUE, skip = 0, sep = '\t')
PED <- PED[which(PED$IID %in% rownames(eigenvec)), ]
PED <- PED[match(rownames(eigenvec), PED$IID),]

PED$Population <- factor(PED$Population, levels=c("ACB","ASW","ESN","GWD","LWK","MSL","YRI","CLM","MXL","PEL","PUR","CDX","CHB","CHS","JPT","KHV","CEU","FIN","GBR","IBS","TSI","BEB","GIH","ITU","PJL","STU"))

col <- colorRampPalette(c("yellow","yellow","yellow","yellow","yellow","yellow","yellow","forestgreen","forestgreen","forestgreen","forestgreen","grey","grey","grey","grey","grey",
"royalblue","royalblue","royalblue","royalblue","royalblue","black","black","black","black","black"))(length(unique(PED$Population)))[factor(PED$Population)]

project.pca <- eigenvec
par(mar = c(5,5,5,5), cex = 2.0,cex.main = 7, cex.axis = 2.75, cex.lab = 2.75, mfrow = c(1,2))

plot(project.pca[,1], project.pca[,2],
     type = 'n',
     main = 'A',
     adj = 0.5,
     xlab = 'First component',
     ylab = 'Second component',
     font = 2,
     font.lab = 2)
points(project.pca[,1], project.pca[,2], col = col, pch = 20, cex = 2.25)
legend('bottomright',
       bty = 'n',
       cex = 3.0,
       title = '',
       c('AFR', 'AMR', 'EAS',
         'EUR', 'SAS'),
       fill = c('yellow', 'forestgreen', 'grey', 'royalblue', 'black'))

plot(project.pca[,1], project.pca[,3],
     type="n",
     main="B",
     adj=0.5,
     xlab="First component",
     ylab="Third component",
     font=2,
     font.lab=2)
points(project.pca[,1], project.pca[,3], col=col, pch=20, cex=2.25)

❓QUESTIONS:

What is distinct about the PC projections of the AMR group relative to other populations?
Why does this occur? What does it tell us about ancestry of this group?

Back to table of contents

Basic Biostats

To be added

Questions?

Linux Primer

To be added

Questions?

ML/LLM Training

To be added

Questions?