title: "coxphSGD"
output: github_document

library(knitr)
opts_chunk$set(
	comment = "",
	fig.width = 12, 
	message = FALSE,
	warning = FALSE,
	tidy.opts = list(
		keep.blank.line = TRUE,	
		width.cutoff = 150
		),
	options(width = 200),
	eval = TRUE
)

Overview
Simple example
Genomic data analysis
[Mathematical formulas]
useR 2017

Overview

Applications

Know the survival::coxph() function that calculates the estimates of the Cox Proportional Hazards model? It uses the gradient descent order II method (known also as Newton-Raphson method) to optimize the partial log-likelihood function of the Cox PH model.

The coxphSGD::coxphSGD() is an equivalent that uses the stochastic gradient descent order I method in the optimization process, which can be beneficial in situations like

the data is in a streaming structure and appear in batches
the data is of the great volume (computations and RAM issues) and one can't simply use all data for computations and needs to go block by block of the data

The stochastic gradient descent order I method can be used with already calculated estimates to generate updated ones in the situation in which new data appear and one can't recalculate the whole model with all historical data.

Installation

install.packages('coxphSGD') # once it is on CRAN
devtools::install_github('MarcinKosinski/coxphSGD') # development version

Usage.

library(coxphSGD)
help(package = 'coxphSGD') # manual reference

Assumptions

The Cox Proportional Hazards model assumes (mainly) that the explanatory variables are constant over time and that the hazard (one of the survival measures) estimates between considered groups are proportional over the time.

This model calculates the estimates that can be interpreted as proportion of the hazard change while moving from one level of the considered explanatory variable to another (while all other variables are fixed) - explained with details in the Mathematical formulas section.

Simple example

Simulated data

Let's simulate artificial/fake data that follows the Cox Proportional Hazards model assumptions. Such data can be simulated from the Weibull distribution of the survival times (times during which the patient/item is under observation till the event or censoring) with some exponentially distributed conditions on the occurance of the censoring.

The approach was taken from the StackOverflow question How to create a toy survival (time to event) data with right censoring and is based on Generating survival times to simulate Cox proportional hazards models

library(survival)
set.seed(456)
x <- matrix(sample(0:1, size = 20000, replace = TRUE), ncol = 2)
head(x)
dCox <- dataCox(10^4, lambda = 3, rho = 2, x,
                beta = c(2,2), cens.rate = 5)
head(dCox)

Fit the model

One can fit the model with the coxphSGD() function. Below is the explanation of parameters:

formula - the formula object passed in the same way you would with the coxph function
data - the list of data.frame's (containing the same columns) corresponding to batches of data
epsilon - the convergence parameter - optimization will stop when the difference of the estimates in subsequent steps will be smaller than epsilon (euclidean distance)
learn.rates - the function to be used to determine the step length in subsequent steps of the optimization process
beta.zero - vector containing start points for the optimization process
max.iter - the maximal number of algorithm iterations, when the iterations number exceeds the batches number then the data are used again (one full data usage / all batches usage is called an epoch)

batch_id <- sample(1:90, size = 10^4, replace = TRUE)
dCox_split <- split(dCox, batch_id)
results <-
  coxphSGD(formula     = Surv(time, status) ~ x.1+x.2,
           data        = dCox_split,
           epsilon     = 1e-5,
           learn.rates = function(x){1/(100*sqrt(x))},
           beta.zero   = c(0,0),
           max.iter    = 10*90)

Track estimates during each iteration on the contour lines plot

One can extract the estimaes of the Cox PH model for each of the iteration with the following code.

coeff_by_iteration <-
  as.data.frame(
    do.call(
      rbind,
      results$coefficients
      )
    )
head(coeff_by_iteration)

Then having the estimates in each iteration of the optimization process, one can create a plot of the contour lines of the partial log-likelihood calculated on the full data to compare the paths of optimization process with the usage of stochastic gradient descent order I method to get the final Cox PH model estimates.

Here with a black triangle the original beta.zero (set to iteration 0) are marked and with a red square the point that was estimated by SGD I algorithm and with blue crossed dot the estimates calculated by the GD II (Newton-Raphson algorithm).

library(reshape2)
coxph_loglik <- function(beta, formula, data) {
  coxph(formula, init=beta, control=list('iter.max'=0), data =data)$loglik[2]
}
coxph_loglik <- function(beta, formula, data) {
  coxph(formula, init=beta, control=list('iter.max'=0), data =data)$loglik[2]
}
calculate_outer_cox_3 <- function(dCox){
  ## contours
  outer_res <- outer(seq(0,4, length = 25),
                     seq(0,4, length = 25),
                     Vectorize( function(beta1,beta2){
                       coxph_loglik(beta=c(beta1,beta2), Surv(time, status)~x.1+x.2-1, dCox)
                     } )
  )
  outer_res_melted <- melt(outer_res)
  outer_res_melted$Var1 <- as.factor(outer_res_melted$Var1)
  levels(outer_res_melted$Var1) <- as.character(seq(0,4, length = 25))
  outer_res_melted$Var2 <- as.factor(outer_res_melted$Var2)
  levels(outer_res_melted$Var2) <- as.character(seq(0,4, length = 25))
  outer_res_melted$Var1 <- as.numeric(as.character(outer_res_melted$Var1))
  outer_res_melted$Var2 <- as.numeric(as.character(outer_res_melted$Var2))
  return(outer_res_melted)
}
calculate_outer_cox_3(dCox) -> outerCox
save(outerCox, file = 'dev/outerCox.rda')
#d2ggplot <- coeff_by_iteration
beta.zero <- c(0,0)
solution <- c(2,2)
library(ggplot2)
ggplot() +
stat_contour(aes(x=outerCox$Var1,
                 y=outerCox$Var2,
                 z=outerCox$value),
             bins = 40, alpha = 0.25) +
  geom_path(aes(coeff_by_iteration[['x.1']],
                coeff_by_iteration[['x.2']]),
                #group = d2ggplot$version,
                #colour = d2ggplot$version),
            size = 1) +
  theme_bw(base_size = 20) +
  theme(panel.border = element_blank(),
        legend.key = element_blank(),
        legend.position = "top") +
  scale_colour_brewer(palette="Dark2",
                      name = 'Algorithm \n & Steps') +
  geom_point(aes(x = beta.zero[1], y = beta.zero[2]),
             col = "black",
             size = 4, shape = 17) +
  geom_point(aes(x = solution[1], y = solution[2]),
             col = "red", size = 4, shape = 15) +
  geom_point(aes(x = summary(coxph(Surv(time, status) ~ x.1+x.2, data = dCox))$coeff[1,1],
                 y = summary(coxph(Surv(time, status) ~ x.1+x.2, data = dCox))$coeff[2,1]),
             col = "blue", size = 4, shape = 13) +
  xlab("X1") +
  ylab("X2") -> p
save(p, file = 'dev/p.rda')

beta.zero <- c(0,0)
solution <- c(2,2)
library(ggplot2)
load('dev/p.rda')
load('dev/outerCox.rda')
p

Used code

The code to reproduce the graph can be found in this gist.

Genomic data analysis

The Cancer Genome Atlas and RTCGA

The Cancer Genome Atlas data included in the RTCGA family of R packages

The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing - http://cancergenome.nih.gov/.

I have converted selected datasets from this study into a few separate packages that are hosted on Bioconductor. These R packages make selected datasets easier to access and manage. Datasets in RTCGA data packages are large and cover complex relations between clinical outcomes and genetic background.

Fit the complex model

Data overview

In the real life genetic problems for the algorithm application we will use the clinical outcomes (times and death status) for patients suffering from Breast Cancer and the information about genes that were mutated in the cancer tissue collected from the patient.

Data were processed during my master thesis and are stored in the archivist like repository and can be loaded with the following command

library(archivist)
testCox <- archivist::aread('MarcinKosinski/MasterThesis/3eebc99bd231b16a3ea4dbeec9ab5edb')
trainCox<- archivist::aread('MarcinKosinski/MasterThesis/1a06bef4a60a237bb65ca3e2f3f23515')
class(trainCox)
length(trainCox)
head(
  trainCox[[1]][ # train data is a list of data frames that corresponds to batches
    c(7, 10),    # pick few rows
    c(210,302,356,898,911,1092:1093) # pick few columns, and survival outcomes
    ]
  )

Actual fit

formulaSGD <- archivist::aread('MarcinKosinski/MasterThesis/064277e1c2a1fbea36d7d0ac518b9c8d')
# Surv(times, patient.vital_status ~ all posible genes
# how many genes are there in the whole formula?
length(
  strsplit(
    as.character(formulaSGD)[3],
    split = "+",
    fixed = T
  )[[1]]
)

One can evalute the code to fit the model or can download the already calculated model

coxphSGD(
  formula = formulaSGD,
  data = trainCox,
  learn.rates = function(x){1/x},
  max.iter = 490 # 98*5 / 5 full epoches
) -> model

model <- archivist::aread('MarcinKosinski/MasterThesis/446ac4dcb7d65bf39057bb341b296f1a')

Coefficients/insights

Please find the top 20 absolute values of the coefficients from the model with the names of genes mutations they correspond to.

names(tail(sort(abs(tail(model$coefficients,1)[[1]])),20)) -> top40
tail(model$coefficients,1)[[1]][which(names(tail(model$coefficients,1)[[1]]) %in% top40)] -> real_top40
knitr::kable(
  data.frame(
    beta = 
      round(    sort(real_top40, decreasing = TRUE),3),
    exp_beta =
      round(exp(sort(real_top40, decreasing = TRUE)),3)
    )
  )

Estimated survival curves

One could see, from the previous table that the BRAF gene if mutated has the negative impact on the hazard ratio (hazard is decreased 0.385 times), which means that people with that mutation will live longer. In the same time the EGFR gene mutation has a positive impact on the hazard ratio (hazard is increased 2.657 times) so that people with that gene mutation has a high death rate and short survival times.

Let's see how the results of the Cox Proportional Hazards model where coefficients where estimated with the Stochastic Gradient Descent order I method corresponds to the trends in data. One visual way of testing the output is creating Kaplan-Meier estimates of survival curves for groups that corresponds to those genes mutations.

library(survminer)
fit_braf <- survfit(
  Surv(times, patient.vital_status) ~ BRAF,
  data = do.call(rbind, trainCox)
)
ggsurvplot(
  fit_braf,
  palette = c("#E7B800", "#2E9FDF"),
  surv.median.line = "hv",  # add the median survival pointer.
  legend.labs =             # change legend labels.
    c("No mutation in BRAF", "Mutation in BRAF"),    
  risk.table = TRUE,       # show risk table.
   pval = TRUE,             # show p-value of log-rank test.
   conf.int = TRUE,         # show confidence intervals for 
                            # point estimates of survival curves.
   xlim = c(0,2500),         # present narrower X axis, but not affect
                            # survival estimates.
   xlab = "Time in days",   # customize X axis label.
   break.time.by = 500,     # break X axis in time intervals by 500.
   ggtheme = theme_light(), # customize plot and risk table with a theme.
 risk.table.y.text.col = T, # colour risk table text annotations.
  risk.table.y.text = FALSE # show bars instead of names in text annotations
                            # in legend of risk table
)

fit_egfr <- survfit(
  Surv(times, patient.vital_status) ~ EGFR,
  data = do.call(rbind, trainCox)
)
ggsurvplot(
  fit_egfr,
  palette = c("#E7B800", "#2E9FDF"),
  surv.median.line = "hv",  # add the median survival pointer.
  legend.labs =             # change legend labels.
    c("No mutation in EGFR", "Mutation in EGFR"),
  risk.table = TRUE,       # show risk table.
   pval = TRUE,             # show p-value of log-rank test.
   conf.int = TRUE,         # show confidence intervals for 
                            # point estimates of survival curves.
   xlim = c(0,2500),         # present narrower X axis, but not affect
                            # survival estimates.
   xlab = "Time in days",   # customize X axis label.
   break.time.by = 500,     # break X axis in time intervals by 500.
   ggtheme = theme_light(), # customize plot and risk table with a theme.
 risk.table.y.text.col = T, # colour risk table text annotations.
  risk.table.y.text = FALSE # show bars instead of names in text annotations
                            # in legend of risk table
)

The code used for the whole analysis can be found in this gist

Mathematical formulas

Cox model specification

Cox model assumes that the hazard function is of the form (for $i$th obserwation / $X_i$) $$\lambda_i(t) = \lambda_0(t)e^{X_i(t)'\beta}$$, where $\lambda_0$ is not specified nonnegative base hazard function and $\beta$ is a vector of coefficients.

Cox model, for variables constat over time, is called the model of proportional hazards, due to the fact that proportion of hazards for two observations $X_i$ and $X_j$ is constant over the time: $$\dfrac{\lambda_i(t)}{\lambda_j(t)} = \dfrac{\lambda_0(t)e^{X_i\beta}}{\lambda_0(t)e^{X_j\beta}} = \dfrac{e^{X_i\beta}}{e^{X_j\beta}} = e^{(X_i-X_j)\beta}.$$

Cox model assumptions

Coefficients of the model $\beta_k, k = 1,\cdots,p$ are constant over the time.
Functional form of independent variable effect - the form of the model $\lambda_i(t) = \lambda_0(t)e^{X_i(t)'\beta}$.
Observations are independent.
The censoring of event is non-informative.
The censoring is independent (of the mechanism of events).

Cox model (partial) log-likelihood function

$$\dfrac{e^{X_i'\beta}}{\sum\limits_{l\in \mathscr{R}(t_i)}^{}e^{X_l'\beta}}$$

The full log-likelihood

$$L(\beta,\varphi) = {}_{p}L(\beta)\cdot L^{*}(\beta,\varphi)$$

The partial log-likelihood

$${}{p}L(\beta) = \prod\limits{i=1}^{K}\dfrac{e^{X_i'\beta}}{\sum\limits_{l=1}^{n}Y_l(t_i)e^{X_l'\beta}}$$

The Newton-Raphson method of optimization

Newtona-Raphsona - the regular method in coxph function Minimization of the $Q(w)$ function

Start with a start solution, e.g. $w_{0} = 0$. Then for $k = 1, 2, \dots$ till convergence

Calculate gradient in the point $w_{k-1}, \nabla_{Q}(w_{k-1})$ and the inverse of hesian $(D_{Q}^{2}(w_{k-1}))^{-1}$.
Do a step in the direction og the negative gradient with the distance of the step corresponding to the Hesian
$$w_{k} = w_{k-1} - (D_{Q}^{2}(w_{k-1}))^{-1}\nabla_{Q}(w_{k-1})$$

The Stochastic Gradient Descent (order I) method of optimization

Stochastic Gradient Descent - the regular method in coxphSGD function Minimization of the $Q(w)$ function

Start with a start solution, e.g. $w_{0} = 0$. Then for $k = 1, 2, \dots$ till convergence

Sample $i \in {1,\dots,n}$ (one point or more).
Calculate the gradient $Q_{i}$ (for corrsponding observations) in the point $w_{k-1}, \nabla_{Q_{i}}(w_{k-1})$.
Do a step in the direction og the negative gradient
$$w_{k} = w_{k-1} - \alpha_{k}\nabla_{Q_{i}}(w_{k-1})$$

where the step $\alpha_k$ is given by a user.

Remarks for the analysis

Model can be used for streaming data or data stored in blocks
We did not checked the assumptions of the Cox PH model during the analysis.
We didn't provide any diagnostic of residuals after the analysis.
There is no clear way on how could one choose the distances of learning rates.
No solution (yet) for stratified models.
No solution (yet) for models where patients (could) have more than 1 event (of the same or various types).