The NB-COMO project: R code

In this paper, we estimated the association between 90 pairs of mental disorders. In this document, the R code for one specific pair is provided: the association between Mood Disorders (as a prior-disorder) and Neurotic Disorders (as a later-disorder). Follow the steps 1-7 below in order to replicate the analyses.

1. Libraries and data

Load necessary libraries

library(lubridate)
library(survival)
library(knitr)
library(multcomp)
library(gdata)
library(data.table)
library(mstate)
library(plyr)

Load necessary data

load("../basics.RData")

There are two necessary datasets for the estimations: data_disease and land.

dataset data_disease

data_disease <- stam_diag[, c("pnr", "fdato", "fdatoD", "Byear", "kqn", "stat", "statd", "statdD", "statdA", "D00", "D10", "D20", "D30", "D41", "D51", "D61", "D70", "D81", "D91")]
nrow(data_disease)

## [1] 7387479

The Danish Civil Registration System was established on April 1, 1968, where all persons alive and living in Denmark were registered. After this date, all persons born or immigrating to Denmark were included. Dataset data_disease contains information on the subsample of this group who were born in Denmark between January 1, 1900 and December 31, 2015 (N=7387479). Information includes a unique personal identification number used to link information from several registers (pnr), birthdate (fdato), sex (kqn: “M” for males and “K” for females), legal status (stat: 1 for active citizens, 70 for disappeared, 80 for emigrated and 90 for dead) and date of disappearance, emigration or death (statd). In addition, information on the 10 mental disorders used in this publication were included from the Danish Psychiatric Central Research Register: Organic Disorders (D00), Substance Use disorders (D10), Schizophrenia and related disorders (D20), Mood Disorders (D30), Neurotic Disorders (D41), Eating Disorders (D51), Personality Disorders (D61), Intelectual disabilities (D70), Developmental Disorders (D81) and Behavioral Disorders (D91).

According to Danish regulations, it is not allowed to share personal data, so an invented case is used here to show the data structure. This invented person will be used throughout the whole code example. She was born on May 31, 1985 and emigrated from Denmark on March 13, 2015. She received a diagnosis of Substance Use disorders for the first time on October 15, 2005; Mood Disorders on December 1, 2010; Neurotic Disorders on November 10, 2014; and Eating Disorders on April 13, 2011.

data_disease: This table includes invented data (not real person)
pnr	fdato	fdatoD	Byear	kqn	stat	statd	statdD	statdA	D00	D10	D20	D30	D41	D51	D61	D70	D81	D91
invented_person	1985-05-31	1985.411	1985	K	80	2015-03-13	2015.195	29.784	NA	2005-10-15	NA	2010-12-01	2014-11-10	2011-04-13	NA	NA	NA	NA

dataset land

Dataset land contains information on all persons in data_disease regarding periods of living in Denmark. Each period of Danish residence is defined with a starting date (tflytd) and ending date (fflytd)

Invented data for the same invented person above is used to show the data structure. This person was living in Denmark since birth (May 31, 1985) until emigration (March 13, 2015) except for a period that she moved abroad (from August 1, 1997 to January 5, 1998).

land_disease: This table includes invented data (not real person)
pnr	tflytd	fflytd
invented_person	1985-05-31	1997-08-01
invented_person	1998-01-05	2015-03-13

2. Population, follow-up time, prior- and later-disorders.

With these two datasets, we can set the study population, the specific follow-up time for each person, and information on the later-disorder (Neurotic Disorders) and all prior-disorders (all disorders except Neurotic Disorders).

Set outcome of interest (later-disorder)

The outcome of interest for this code example is Neurotic Disorders (variable D41). The possible minimum age of onset for this disorder is 5 years.

disease <- "D41"
minage <- 5

Left truncation

For every person, follow-up starts on age at January 1, 2000 or at 5 years, whichever occurs later.

data_disease$entryA <- pmax(2000-data_disease[, "fdatoD"], minage)
data_disease$entryD <- as.numeric(data_disease$fdatoD) + as.numeric(data_disease$entryA)
data_disease$entry <- as.Date(date_decimal(data_disease$entryD))

Right censoring

For every person, follow-up ends on December 31, 2016, death, emigration or disappearance, whichever occurs first. Follow-up for persons developing Neurotic Disorders will end on the date of onset of the disorder, but this information will be included later.

data_disease$exit <- pmin(as.Date("2016-12-31"), data_disease[, "statd"], na.rm = TRUE)
data_disease$exitD <- decimal_date(data_disease$exit)
data_disease$exitA <- data_disease$exitD - data_disease$fdatoD

Eliminate those with “negative” follow-up time

Some persons died or emigrated before the follow-up started on January 1, 2000 (or age 5 years). These persons need to be excluded from the sample. An easy way to identify them is because their follow-up time finishes before it starts, so we keep in the sample only those whose follow-up starts before it finishes.

data_disease <- data_disease[data_disease$entryA < data_disease$exitA, ]

Include only those living in Denmark at beginning of follow-up

Some persons were living outside Denmark on the date of start of follow-up (variable entry). We need to keep only those residing in the country on that date. We use information from the dataset land to obtain this information.

data_disease_DK <- merge(data_disease, land, all.x = TRUE, all.y = FALSE)
data_disease_DK <- data_disease_DK[(!is.na(data_disease_DK$tflytd)) &
                                     (data_disease_DK$tflytd <= data_disease_DK$entry) &
                                     (data_disease_DK$fflytd > data_disease_DK$entry) 
                                   , ]
nrow(data_disease_DK)

## [1] 5699944

The sample now consists on 5699944 persons who were living in Denmark on the date of start of follow-up (January 1, 2000 or the date they turned 5 years, if it happened afterwards).

This table includes invented data (not real person)
pnr	fdato	fdatoD	Byear	kqn	stat	statd	statdD	statdA	D00	D10	D20	D30	D41	D51	D61	D70	D81	D91	entryA	entryD	entry	exit	exitD	exitA	tflytd	fflytd
invented_person	1985-05-31	1985.411	1985	K	80	2015-03-13	2015.195	29.784	NA	2005-10-15	NA	2010-12-01	2014-11-10	2011-04-13	NA	NA	NA	NA	14.589	2000	2000-01-01	2015-03-13	2015.195	29.78352	1998-01-05	2015-03-13

Information on onset of later-disorder (Neurotic Disorders)

Information on the later-disorder is included at this stage. Variable onsetD contains the date of onset of the specific disorder, and variable onsetA contains the age at onset.

data_disease_DK[!is.na(data_disease_DK[, disease]), "onsetD"] <- decimal_date((data_disease_DK[!is.na(data_disease_DK[, disease]), disease] + 1))
data_disease_DK$onsetA <- data_disease_DK$onsetD - data_disease_DK$fdatoD

Follow-up now needs to finish on date of onset, for those who develop this disorder.

data_disease_DK$exit <- pmin(data_disease_DK$exit, (data_disease_DK[, disease] + 1), na.rm = TRUE)
data_disease_DK$exitD <- decimal_date(data_disease_DK$exit)
data_disease_DK$exitA <- data_disease_DK$exitD - data_disease_DK$fdatoD

Note that for someone developing Neurotic Disorders before the start of follow-up, now exit will occur before entry. These are the prevalent cases and will be identified later.

A variable fail can be created indicating the type of outcome at end of follow-up (0 = censored; 1 = onset of neurotic disorders;70 = disappearance; 80 = emigration; 90 = death).

data_disease_DK$fail <- (90 * (!is.na(data_disease_DK$statdA) & (data_disease_DK$statdA == data_disease_DK$exitA) & (data_disease_DK$stat == 90))) + 
  (80 * (!is.na(data_disease_DK$statdA) & (data_disease_DK$statdA == data_disease_DK$exitA) & (data_disease_DK$stat == 80))) +
  (70 * (!is.na(data_disease_DK$statdA) & (data_disease_DK$statdA == data_disease_DK$exitA) & (data_disease_DK$stat == 70))) +
  (01 * (!is.na(data_disease_DK$onsetA) & (data_disease_DK$onsetA == data_disease_DK$exitA))) 
table(data_disease_DK$fail)

## 
##       0       1      70      80      90      91 
## 4497071  294575    1069   69417  837798      14

Note that some persons can have two outcomes on the same day (e.g. 91 = death and onset of Neurotic Disorders).

This table includes invented data (not real person)
pnr	fdato	fdatoD	Byear	kqn	stat	statd	statdD	statdA	D00	D10	D20	D30	D41	D51	D61	D70	D81	D91	entryA	entryD	entry	exit	exitD	exitA	tflytd	fflytd	onsetD	onsetA	fail
invented_person	1985-05-31	1985.411	1985	K	80	2015-03-13	2015.195	29.784	NA	2005-10-15	NA	2010-12-01	2014-11-10	2011-04-13	NA	NA	NA	NA	14.589	2000	2000-01-01	2014-11-11	2014.86	29.44927	1998-01-05	2015-03-13	2014.86	29.44927	1

Information on onset of prior-disorders

The later-disorder of interest is Neurotic Disorders (variable D41). The other nine disorders (variables D00, D10, D20, D30, D51, D61, D70, D81, D91) will be treated as prior-disorders. For example, variable D30D contains the date of onset of Mood Disorders, and variable D30A contains the age of onset.

comorbidities <- c("D00", "D10", "D20", "D30", "D51", "D61", "D70", "D81", "D91")
comorbiditiesD <- paste(comorbidities,"D",sep="")
comorbiditiesA <- paste(comorbidities,"A",sep="")

for(i in 1:length(comorbidities)) {
  data_disease_DK[!is.na(data_disease_DK[, comorbidities[i]]), comorbiditiesD[i]] <- decimal_date((data_disease_DK[!is.na(data_disease_DK[, comorbidities[i]]), comorbidities[i]] + 1))
  data_disease_DK[, comorbiditiesA[i]] <- as.numeric(data_disease_DK[, comorbiditiesD[i]]) - as.numeric(data_disease_DK[, "fdatoD"])
}

Finally, it is important to identify the persons who developed Neurotic Disorders before the start of follow-up (prevalent cases) and consider only those at risk of developing Neurotic Disorders.

prevalent <- data_disease_DK[data_disease_DK$entryA >= data_disease_DK$exitA, 
                             c("pnr", "fdatoD", "entry", "entryD", "entryA", "exitD", "exitA", "fail", "statdA", "kqn", "Byear", comorbiditiesA)]
at_risk <- data_disease_DK[data_disease_DK$entryA < data_disease_DK$exitA, 
                           c("pnr", "fdatoD", "entry", "entryD", "entryA", "exitD", "exitA", "fail", "statdA", "kqn", "Byear", comorbiditiesA)]

nrow(prevalent); nrow(at_risk)

## [1] 89348

## [1] 5610596

From the total of 5699944 persons who were living in Denmark on the date of start of follow-up, 89348 had a previous diagnosis of Neurotic Disorders, and were excluded from the analysis as prevalent cases. The remaining 5610596 did not have a previous diagnosis of Neurotic Disorders and were at risk of becoming an incident case during the follow-up, and thus included in the analysis (this information is included in Table 1 in the publication).

This table includes invented data (not real person)
pnr	fdatoD	entry	entryD	entryA	exitD	exitA	fail	statdA	kqn	Byear	D00A	D10A	D20A	D30A	D51A	D61A	D70A	D81A	D91A
invented_person	1985.411	2000-01-01	2000	14.589	2014.86	29.44927	1	29.784	K	1985	NA	20.37804	NA	25.50681	25.87119	NA	NA	NA	NA

3. Break ties

For some persons, the prior-disorder and later-disorder occurred on the same day (ties). In a time-to-event setting, individuals are followed until the outcome (later-disorder), and these persons would not be considered to be exposed to the prior-disorder. This would underestimate all associations, given that concurrant pairs of disorders could never occur. To fix it, we broke some of these ties, as explained below.

exposures <- comorbidities
exposures

## [1] "D00" "D10" "D20" "D30" "D51" "D61" "D70" "D81" "D91"

This is the list of variables containing the nine possible prior-disorders.

temp <- at_risk
temp$outcome <- (temp$fail %in% c(1,71,81,91))*1
table(temp$outcome)

## 
##       0       1 
## 5405355  205241

In total, there are 205241 who develop the later-disorder during the follow-up (this information is included in Table 1 of the publication). These are the persons that we need to observe to identify ties (a prior-disorder with onset on the same day). We will do it for each of the nine prior-disorders:

for(i in 1:length(exposures)) {
  # First we identify those with both the prior- and later-disorder
  both_dx <- temp[!is.na(temp[, paste(exposures[i],"A",sep="")]) & temp$outcome==1, c("pnr", paste(exposures[i],"A",sep=""), "exitA")]  
  both_dx$diff <- both_dx[, "exitA"] - both_dx[, paste(exposures[i],"A",sep="")]
  
  # Only if the two disorders happen within 5 years
  both_dx <- both_dx[abs(both_dx$diff)<5, ]
  
  # How many the prior-disorder occurs before or after the later-disorder
  before <- dim(both_dx[both_dx$diff>0, ])[1]
  after <- dim(both_dx[both_dx$diff<0, ])[1]
  
  # prop_before is the proportion of persons who develop first the prior-disorder among those who 
  # develop the two disorders within 5 years (but not on the same day)
  prop_before <- before/(before+after)
  if(before==0) {
    prop_before <- 0
  }
  
  # This proportion is used as a probability to break ties. Each tie will be broken with probability
  # prop_before by moving the prior-disorder to one day earlier, following a random uniform distribution.
  temp$random <- runif(n=dim(temp)[1])
  temp[temp$random <= prop_before & !is.na(temp[, paste(exposures[i],"A",sep="")]) & temp$outcome==1 & temp[, paste(exposures[i],"A",sep="")]==temp$exitA, paste(exposures[i],"A",sep="")] <-
    temp[temp$random <= prop_before & !is.na(temp[, paste(exposures[i],"A",sep="")]) & temp$outcome==1 & temp[, paste(exposures[i],"A",sep="")]==temp$exitA, paste(exposures[i],"A",sep="")] - (1/365.25)          
  
  temp <- temp[, names(temp) != "random"]
}

Data for the invented person do not change compared to previous step, because there are no ties.

This table includes invented data (not real person)
pnr	fdatoD	entry	entryD	entryA	exitD	exitA	fail	statdA	kqn	Byear	D00A	D10A	D20A	D30A	D51A	D61A	D70A	D81A	D91A
invented_person	1985.411	2000-01-01	2000	14.589	2014.86	29.44927	1	29.784	K	1985	NA	20.37804	NA	25.50681	25.87119	NA	NA	NA	NA

4. Time-varying variables

Prepare data for time-to-event analysis

We create new variables id, tstart, tstop and endpoint (equivalent to pnr, entryA, exitA, outcome), but that will change later.

temp <- tmerge(temp, temp, id = pnr, tstart = entryA, tstop = exitA)
temp <- tmerge(temp, temp, id = pnr, endpoint = event(exitA, outcome))

This table includes invented data (not real person)
pnr	fdatoD	entry	entryD	entryA	exitD	exitA	fail	statdA	kqn	Byear	D00A	D10A	D20A	D30A	D51A	D61A	D70A	D81A	D91A	outcome	id	tstart	tstop	endpoint
invented_person	1985.411	2000-01-01	2000	14.589	2014.86	29.44927	1	29.784	K	1985	NA	20.37804	NA	25.50681	25.87119	NA	NA	NA	NA	1	invented_person	14.589	29.44927	1

Split follow-up time

Each person with a diagnosis of Mood Disorders (D30) needs to be “unexposed” until the onset of the disorder, and “exposed” afterwards. In addition, we want to split the follow-up time further depending on time since exposure (6 months, 1, 2, 5, 10, 15, >15 years). For each person, we need to specify in which dates the follow-up time is going to be splitted.

Identify at what ages the follow-up time should be splitted

# First on date of onset of Mood Disorders
db_exposures0 <- temp[!is.na(temp[, "D30A"]), c("pnr", "exitA", "D30A")]
db_exposures0 <- db_exposures0[db_exposures0[, "exitA"] >= db_exposures0[, "D30A"],]

# And then 6 months, 1 year, 2 years, etc. later.
db_exposures6m <- db_exposures0
db_exposures6m[, "D30A"] <- db_exposures0[, "D30A"] + (1/2)
db_exposures6m <- db_exposures6m[db_exposures6m[, "exitA"] >= db_exposures6m[, "D30A"],]
db_exposures1 <- db_exposures0
db_exposures1[, "D30A"] <- db_exposures0[, "D30A"] + 1
db_exposures1 <- db_exposures1[db_exposures1[, "exitA"] >= db_exposures1[, "D30A"],]
db_exposures2 <- db_exposures0
db_exposures2[, "D30A"] <- db_exposures0[, "D30A"] + 2
db_exposures2 <- db_exposures2[db_exposures2[, "exitA"] >= db_exposures2[, "D30A"],]
db_exposures5 <- db_exposures0
db_exposures5[, "D30A"] <- db_exposures0[, "D30A"] + 5
db_exposures5 <- db_exposures5[db_exposures5[, "exitA"] >= db_exposures5[, "D30A"],]
db_exposures10 <- db_exposures0
db_exposures10[, "D30A"] <- db_exposures0[, "D30A"] + 10
db_exposures10 <- db_exposures10[db_exposures10[, "exitA"] >= db_exposures10[, "D30A"],]
db_exposures15 <- db_exposures0
db_exposures15[, "D30A"] <- db_exposures0[, "D30A"] + 15
db_exposures15 <- db_exposures15[db_exposures15[, "exitA"] >= db_exposures15[, "D30A"],]

# Finally we combine all these times
db_exposures <- rbind(db_exposures0, db_exposures6m, db_exposures1, db_exposures2, db_exposures5, db_exposures10, db_exposures15)
db_exposures <- db_exposures[!duplicated(db_exposures), ]
names(db_exposures)[names(db_exposures)=="D30A"]<-"cut_points"
db_exposures <- db_exposures[, c("pnr", "cut_points")]

This table includes invented data (not real person)
pnr	cut_points
invented_person	25.50681
invented_person	26.00681
invented_person	26.50681
invented_person	27.50681

The invented person developed Mood Disorders at age 25.5, that is when the first splitting takes place. Then it will be splitted again 6 months after and 1 and 2 years after. No more times will be needed, given that this specific person is followed until age 29.8.

Split the follow-up at those ages

We create two new variables: expo_D30 that is a binary variable saying whether the person has had a diagnosis of Mood Disorders, and expo_time_after_D30 that counts also the time after the onset.

temp <- tmerge(temp, db_exposures, id = pnr, expo_time_after_D30=cumtdc(cut_points))
temp$expo_D30 <- (temp$expo_time_after_D30 >= 1)

This table includes invented data (not real person)
pnr	fdatoD	entry	entryD	entryA	exitD	exitA	fail	statdA	kqn	Byear	D00A	D10A	D20A	D30A	D51A	D61A	D70A	D81A	D91A	outcome	id	tstart	tstop	endpoint	expo_time_after_D30	expo_D30
invented_person	1985.411	2000-01-01	2000	14.589	2014.86	29.44927	1	29.784	K	1985	NA	20.37804	NA	25.50681	25.87119	NA	NA	NA	NA	1	invented_person	14.58900	25.50681	0	0	FALSE
invented_person	1985.411	2000-01-01	2000	14.589	2014.86	29.44927	1	29.784	K	1985	NA	20.37804	NA	25.50681	25.87119	NA	NA	NA	NA	1	invented_person	25.50681	26.00681	0	1	TRUE
invented_person	1985.411	2000-01-01	2000	14.589	2014.86	29.44927	1	29.784	K	1985	NA	20.37804	NA	25.50681	25.87119	NA	NA	NA	NA	1	invented_person	26.00681	26.50681	0	2	TRUE
invented_person	1985.411	2000-01-01	2000	14.589	2014.86	29.44927	1	29.784	K	1985	NA	20.37804	NA	25.50681	25.87119	NA	NA	NA	NA	1	invented_person	26.50681	27.50681	0	3	TRUE
invented_person	1985.411	2000-01-01	2000	14.589	2014.86	29.44927	1	29.784	K	1985	NA	20.37804	NA	25.50681	25.87119	NA	NA	NA	NA	1	invented_person	27.50681	29.44927	1	4	TRUE

Note that the invented person now has 5 different observations, where tstop of each observation finishes on tstart of the following one, and the two variables regarding exposure to Mood Disorders change in each observation.

Split follow-up time for comorbidities

There is one prior-disorder (Mood Disorders) and one later-disorder (Neurotic Disorders). The remaining eight disorders are used to adjust the models for mental health comorbidity. It is therefore necessary to replicate the previous step with the other 8 disorders.

adjustment <- exposures[exposures!="D30"]
adjustment

## [1] "D00" "D10" "D20" "D51" "D61" "D70" "D81" "D91"

for(i in 1:length(adjustment)) {
  
  db_exposures0 <- temp[!is.na(temp[, paste(adjustment[i],"A",sep="")]), c("pnr", "exitA", "D30A", paste(adjustment[i],"A",sep=""))]
  
  ## We do not want to adjust for mediators (only for confounders), so we ignore mental disorders with onset after the prior-disorder
  db_exposures0 <- db_exposures0[is.na(db_exposures0[, "D30A"]) |
                                   (!is.na(db_exposures0[, "D30A"]) & db_exposures0[, paste(adjustment[i],"A",sep="")] < db_exposures0[, "D30A"]),]
  db_exposures0 <- db_exposures0[db_exposures0[, "exitA"] >= db_exposures0[, paste(adjustment[i],"A",sep="")],]
  db_exposures0 <- db_exposures0[!duplicated(db_exposures0), ]
  
  names(db_exposures0)[names(db_exposures0)==paste(adjustment[i],"A",sep="")]<-"cut_points"
  db_exposures0 <- db_exposures0[, c("pnr", "cut_points")]
  temp <- tmerge(temp, db_exposures0, id = pnr, exposure=tdc(cut_points))

  colnames(temp)[colnames(temp) == "exposure"] <- paste("adjust_D30_by_prev_", adjustment[i], sep="")
}

This table includes invented data (not real person)
pnr	fdatoD	entry	entryD	entryA	exitD	exitA	fail	statdA	kqn	Byear	D00A	D10A	D20A	D30A	D51A	D61A	D70A	D81A	D91A	outcome	id	tstart	tstop	endpoint	expo_time_after_D30	expo_D30	adjust_D30_by_prev_D10
invented_person	1985.411	2000-01-01	2000	14.589	2014.86	29.44927	1	29.784	K	1985	NA	20.37804	NA	25.50681	25.87119	NA	NA	NA	NA	1	invented_person	14.58900	20.37804	0	0	FALSE	0
invented_person	1985.411	2000-01-01	2000	14.589	2014.86	29.44927	1	29.784	K	1985	NA	20.37804	NA	25.50681	25.87119	NA	NA	NA	NA	1	invented_person	20.37804	25.50681	0	0	FALSE	1
invented_person	1985.411	2000-01-01	2000	14.589	2014.86	29.44927	1	29.784	K	1985	NA	20.37804	NA	25.50681	25.87119	NA	NA	NA	NA	1	invented_person	25.50681	26.00681	0	1	TRUE	1
invented_person	1985.411	2000-01-01	2000	14.589	2014.86	29.44927	1	29.784	K	1985	NA	20.37804	NA	25.50681	25.87119	NA	NA	NA	NA	1	invented_person	26.00681	26.50681	0	2	TRUE	1
invented_person	1985.411	2000-01-01	2000	14.589	2014.86	29.44927	1	29.784	K	1985	NA	20.37804	NA	25.50681	25.87119	NA	NA	NA	NA	1	invented_person	26.50681	27.50681	0	3	TRUE	1
invented_person	1985.411	2000-01-01	2000	14.589	2014.86	29.44927	1	29.784	K	1985	NA	20.37804	NA	25.50681	25.87119	NA	NA	NA	NA	1	invented_person	27.50681	29.44927	1	4	TRUE	1

Note that Substance Use disorders (D10) occurred before the prior-disorder, and it is considered as a time-varying variable. However, onset of Eating Disorders (D51) occurred after the onset of the prior-disorder, and it is not considered when adjusting for mental health comorbidity.

Save the data

data_survival <- temp
save(prevalent, at_risk, data_survival, data_invented, file="SHARED_disease_D41.RData")

5. Descriptive analysis

Load the data

Load the datasets prepared in steps 1-4.

rm(list = ls())
load(file="SHARED_disease_D41.RData")

Descriptive statistics of Neurotic Disorders as later-disorder (Table 1 of the publication)

Prevalent cases before start of follow-up - overall and by sex

nrow(prevalent); table(prevalent$kqn)

## [1] 89348

## 
##     K     M 
## 57910 31438

Persons at risk of developing Neurotic Disorders during follow-up - overall and by sex

nrow(at_risk); table(at_risk$kqn)

## [1] 5610596

## 
##       K       M 
## 2807442 2803154

New cases of Neurotic Disorders during follow-up - overall and by sex

nrow(data_survival[data_survival$endpoint==1, ]); table(data_survival[data_survival$endpoint==1, "kqn"])

## [1] 205241

## 
##      K      M 
## 120756  84485

Association between Mood Disorders (prior-disorder) and Neurotic Disorders (later-disorder)

1. Count persons, person-years and number of cases for exposed and unexposed to Mood Disorders

# Some persons have more than one observation (due to splitting time), and they need to be counted only once
number <- data_survival[, c("pnr", "kqn", "expo_D30")]
number <- number[!duplicated(number), ]
number$number <- 1
number_sex <- aggregate(number ~ expo_D30 + kqn, data = number, FUN=sum, na.rm=TRUE)

data_survival$time <- data_survival$tstop - data_survival$tstart
py_sex <- aggregate(cbind(time, endpoint) ~ expo_D30 + kqn, data = data_survival, FUN=sum, na.rm=TRUE)

number <- merge(number_sex, py_sex)
number2 <- rbind(number, 
                 cbind(aggregate(cbind(number, time, endpoint) ~ expo_D30, data = number, FUN=sum, na.rm=TRUE), kqn="all")
) 

# According to Danish regulations, we cannot report cells with less than 3 cases
number2[number2$number<3, "number"] <- NA
number2[number2$endpoint<3, "endpoint"] <- NA

kable(number2)

expo_D30	kqn	number	time	endpoint
FALSE	K	2773129	37728220.1	98555
FALSE	M	2784330	38039730.0	71364
TRUE	K	102155	757457.0	22201
TRUE	M	62826	445976.4	13121
FALSE	all	5557459	75767950.1	169919
TRUE	all	164981	1203433.4	35322

Note that a person could be counted twice (one among exposed and one among unexposed) if exposure to Mood Disorders occurs during the follow-up time.

2. Count persons, person-years and number of cases for exposed and unexposed to Mood Disorders (considering time since exposure)

number <- data_survival[, c("pnr", "kqn", "expo_time_after_D30")]
number <- number[!duplicated(number), ]
number$number <- 1
number_sex <- aggregate(number ~ expo_time_after_D30 + kqn, data = number, FUN=sum, na.rm=TRUE)

data_survival$time <- data_survival$tstop - data_survival$tstart
py_sex <- aggregate(cbind(time, endpoint) ~ expo_time_after_D30 + kqn, data = data_survival, FUN=sum, na.rm=TRUE)

number <- merge(number_sex, py_sex)
number2 <- rbind(number, 
                 cbind(aggregate(cbind(number, time, endpoint) ~ expo_time_after_D30, data = number, FUN=sum, na.rm=TRUE), kqn="all")
) 
# According to Danish regulations, we cannot report cells with less than 3 cases
number2[number2$number<3, "number"] <- NA
number2[number2$endpoint<3, "endpoint"] <- NA

kable(number2)

expo_time_after_D30	kqn	number	time	endpoint
0	K	2773129	37728220.07	98555
0	M	2784330	38039730.02	71364
1	K	69228	29427.65	9559
1	M	44894	18950.67	6164
2	K	58431	27836.62	1500
2	M	37537	17880.54	802
3	K	57103	52463.44	1987
3	M	36528	33614.52	1092
4	K	56917	136880.70	3256
4	M	35956	86405.72	1932
5	K	45298	164371.80	2758
5	M	27864	100406.19	1530
6	K	28634	98242.20	1267
6	M	17236	59286.81	679
7	K	29946	248234.57	1874
7	M	16475	129431.98	922
0	all	5557459	75767950.10	169919
1	all	114122	48378.32	15723
2	all	95968	45717.16	2302
3	all	93631	86077.97	3079
4	all	92873	223286.42	5188
5	all	73162	264777.99	4288
6	all	45870	157529.01	1946
7	all	46421	377666.55	2796

6. Hazard ratios

Load the data

Load the datasets prepared in steps 1-4.

rm(list = ls())
load(file="SHARED_disease_D41.RData")

1. Hazard ratios for persons (males and females combined)

Model A (adjusted for sex and calendar time)

model_A <- coxph(Surv(tstart, tstop, endpoint) ~ expo_D30 + kqn + entryD, data = data_survival)
print(paste0("HR = ", format(exp(summary(model_A)$coefficients[1, 1]), digits=1, nsmall=1),
      " [95% CI: ", format(summary(model_A)$conf.int[1, 3], digits=1, nsmall=1), "-", 
      format(summary(model_A)$conf.int[1, 4], digits=1, nsmall=1), "]"))

## [1] "HR = 15.7 [95% CI: 15.5-15.9]"

Model B (further adjusted for mental health comorbidity)

# Model B adjusts also for number of disorders, and it needs to be counted.
data_survival$num_others <- rowSums(data_survival[, c("adjust_D30_by_prev_D00", "adjust_D30_by_prev_D10", "adjust_D30_by_prev_D20", "adjust_D30_by_prev_D51", "adjust_D30_by_prev_D61", "adjust_D30_by_prev_D70","adjust_D30_by_prev_D81", "adjust_D30_by_prev_D91")])
data_survival$num_others[data_survival$num_others>4] <- 4
data_survival$num_others[data_survival$num_others==1] <- 0

model_B <- coxph(Surv(tstart, tstop, endpoint) ~ expo_D30 + kqn + entryD
                 + adjust_D30_by_prev_D00 + adjust_D30_by_prev_D10 + adjust_D30_by_prev_D20
                 + adjust_D30_by_prev_D51 + adjust_D30_by_prev_D61 + adjust_D30_by_prev_D70
                 + adjust_D30_by_prev_D81 + adjust_D30_by_prev_D91 + as.factor(num_others), data = data_survival)
print(paste0("HR = ", format(exp(summary(model_B)$coefficients[1, 1]), digits=1, nsmall=1),
             " [95% CI: ", format(summary(model_B)$conf.int[1, 3], digits=1, nsmall=1), "-", 
             format(summary(model_B)$conf.int[1, 4], digits=1, nsmall=1), "]"))

## [1] "HR = 10.5 [95% CI: 10.3-10.6]"

Lagged HR - Model A

times <- c("0-6 months", "6-12 months", "1-2 years", "2-5 years", "5-10 years", "10-15 years", "15+ years")

model_A_lagged <- coxph(Surv(tstart, tstop, endpoint) ~ as.factor(expo_time_after_D30) + kqn + entryD, data = data_survival)

for(i in 1:length(times)) {
  print(paste0(times[i], ": HR = ", format(exp(summary(model_A_lagged)$coefficients[i, 1]), digits=1, nsmall=1), " [95% CI: ", format(summary(model_A_lagged)$conf.int[i, 3], digits=1, nsmall=1), "-", format(summary(model_A_lagged)$conf.int[i, 4], digits=1, nsmall=1), "]"))
}

## [1] "0-6 months: HR = 117.9 [95% CI: 115.9-119.8]"
## [1] "6-12 months: HR = 18.5 [95% CI: 17.7-19.2]"
## [1] "1-2 years: HR = 13.5 [95% CI: 13.0-14.0]"
## [1] "2-5 years: HR = 9.4 [95% CI: 9.2- 9.7]"
## [1] "5-10 years: HR = 7.6 [95% CI: 7.4-7.9]"
## [1] "10-15 years: HR = 7.1 [95% CI: 6.8-7.4]"
## [1] "15+ years: HR = 6.4 [95% CI: 6.2-6.7]"

Lagged HR - Model B

model_B_lagged <- coxph(Surv(tstart, tstop, endpoint) ~ as.factor(expo_time_after_D30) + kqn + entryD
                 + adjust_D30_by_prev_D00 + adjust_D30_by_prev_D10 + adjust_D30_by_prev_D20
                 + adjust_D30_by_prev_D51 + adjust_D30_by_prev_D61 + adjust_D30_by_prev_D70
                 + adjust_D30_by_prev_D81 + adjust_D30_by_prev_D91 + as.factor(num_others), data = data_survival)

for(i in 1:length(times)) {
  print(paste0(times[i], ": HR = ", format(exp(summary(model_B_lagged)$coefficients[i, 1]), digits=1, nsmall=1), " [95% CI: ", format(summary(model_B_lagged)$conf.int[i, 3], digits=1, nsmall=1), "-", format(summary(model_B_lagged)$conf.int[i, 4], digits=1, nsmall=1), "]"))
}

## [1] "0-6 months: HR = 77.9 [95% CI: 76.5-79.2]"
## [1] "6-12 months: HR = 12.2 [95% CI: 11.7-12.7]"
## [1] "1-2 years: HR = 8.9 [95% CI: 8.6-9.2]"
## [1] "2-5 years: HR = 6.2 [95% CI: 6.0-6.4]"
## [1] "5-10 years: HR = 5.0 [95% CI: 4.9-5.2]"
## [1] "10-15 years: HR = 4.6 [95% CI: 4.4-4.8]"
## [1] "15+ years: HR = 4.3 [95% CI: 4.1-4.4]"

2. Sex-specific hazard ratios

Model A (adjusted for sex and calendar time)

model_A <- coxph(Surv(tstart, tstop, endpoint) ~ expo_D30*kqn + entryD, data = data_survival)
matrix_contrasts <-t(
    matrix(c(
      1,0,0,0,
      1,0,0,1
    ), ncol=2))
model_A_sex <- glht(model_A, linfct=matrix_contrasts)
HR <- data.frame(format(exp(confint(model_A_sex)$confint)[1:2, ], digits=1, nsmall=1))
HR[1, "sex"] <- "Females"
HR[2, "sex"] <- "Males"
HR

##   Estimate  lwr  upr     sex
## 1     13.8 13.6 14.1 Females
## 2     19.9 19.4 20.3   Males

Model B (further adjusted for mental health comorbidity)

model_B <- coxph(Surv(tstart, tstop, endpoint) ~ expo_D30*kqn + entryD
                 + adjust_D30_by_prev_D00 + adjust_D30_by_prev_D10 + adjust_D30_by_prev_D20
                 + adjust_D30_by_prev_D51 + adjust_D30_by_prev_D61 + adjust_D30_by_prev_D70
                 + adjust_D30_by_prev_D81 + adjust_D30_by_prev_D91 + as.factor(num_others), data = data_survival)

matrix_contrasts <-t(
  matrix(c(
    1,rep(0, 13), 0,
    1,rep(0, 13), 1
  ), ncol=2))

model_B_sex <- glht(model_B, linfct=matrix_contrasts)
HR <- data.frame(format(exp(confint(model_B_sex)$confint)[1:2, ], digits=1, nsmall=1))
HR[1, "sex"] <- "Females"
HR[2, "sex"] <- "Males"
HR

##   Estimate  lwr  upr     sex
## 1      9.7  9.5  9.9 Females
## 2     12.0 11.7 12.3   Males

Lagged HR - Model A

model_A_lagged <- coxph(Surv(tstart, tstop, endpoint) ~ as.factor(expo_time_after_D30)*kqn + entryD, data = data_survival)
matrix_contrasts <-t(
  matrix(c(
    1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,
    1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,
    0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
    0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,
    0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,
    0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,
    0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,
    0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1    
  ), ncol=14))

model_A_lagged_sex <- glht(model_A_lagged, linfct=matrix_contrasts)
HR <- data.frame(format(exp(confint(model_A_lagged_sex)$confint)[1:14, ], digits=1, nsmall=1))
HR[1:7, "sex"] <- "Females"
HR[8:14, "sex"] <- "Males"
HR$time_after_exposure <- c(times, times)
HR

##    Estimate   lwr   upr     sex time_after_exposure
## 1     100.6  97.5 103.8 Females          0-6 months
## 2      16.9  15.7  18.2 Females         6-12 months
## 3      12.3  11.5  13.1 Females           1-2 years
## 4       8.4   7.9   8.8 Females           2-5 years
## 5       6.9   6.5   7.3 Females          5-10 years
## 6       6.5   6.0   7.1 Females         10-15 years
## 7       5.9   5.5   6.4 Females           15+ years
## 8     156.4 150.4 162.5   Males          0-6 months
## 9      21.7  19.6  24.1   Males         6-12 months
## 10     16.0  14.7  17.5   Males           1-2 years
## 11     11.7  10.9  12.5   Males           2-5 years
## 12      9.2   8.5   9.9   Males          5-10 years
## 13      8.4   7.5   9.3   Males         10-15 years
## 14      7.6   6.9   8.3   Males           15+ years

Lagged HR - Model B

model_B_lagged <- coxph(Surv(tstart, tstop, endpoint) ~ as.factor(expo_time_after_D30)*kqn + entryD
                 + adjust_D30_by_prev_D00 + adjust_D30_by_prev_D10 + adjust_D30_by_prev_D20
                 + adjust_D30_by_prev_D51 + adjust_D30_by_prev_D61 + adjust_D30_by_prev_D70
                 + adjust_D30_by_prev_D81 + adjust_D30_by_prev_D91 + as.factor(num_others), data = data_survival)
matrix_contrasts <-t(
  matrix(c(
    1,0,0,0,0,0,0,0,rep(0, 11),0,0,0,0,0,0,0,0,
    0,1,0,0,0,0,0,0,rep(0, 11),0,0,0,0,0,0,0,0,
    0,0,1,0,0,0,0,0,rep(0, 11),0,0,0,0,0,0,0,0,
    0,0,0,1,0,0,0,0,rep(0, 11),0,0,0,0,0,0,0,0,
    0,0,0,0,1,0,0,0,rep(0, 11),0,0,0,0,0,0,0,0,
    0,0,0,0,0,1,0,0,rep(0, 11),0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,1,0,rep(0, 11),0,0,0,0,0,0,0,0,
    1,0,0,0,0,0,0,0,rep(0, 11),0,1,0,0,0,0,0,0,
    0,1,0,0,0,0,0,0,rep(0, 11),0,0,1,0,0,0,0,0,
    0,0,1,0,0,0,0,0,rep(0, 11),0,0,0,1,0,0,0,0,
    0,0,0,1,0,0,0,0,rep(0, 11),0,0,0,0,1,0,0,0,
    0,0,0,0,1,0,0,0,rep(0, 11),0,0,0,0,0,1,0,0,
    0,0,0,0,0,1,0,0,rep(0, 11),0,0,0,0,0,0,1,0,
    0,0,0,0,0,0,1,0,rep(0, 11),0,0,0,0,0,0,0,1    
  ), ncol=14))

model_B_lagged_sex <- glht(model_B_lagged, linfct=matrix_contrasts)
HR <- data.frame(format(exp(confint(model_B_lagged_sex)$confint)[1:14, ], digits=1, nsmall=1))
HR[1:7, "sex"] <- "Females"
HR[8:14, "sex"] <- "Males"
HR$time_after_exposure <- c(times, times)
HR

##    Estimate  lwr  upr     sex time_after_exposure
## 1      70.4 68.1 72.6 Females          0-6 months
## 2      11.8 10.9 12.7 Females         6-12 months
## 3       8.6  8.0  9.2 Females           1-2 years
## 4       5.8  5.5  6.2 Females           2-5 years
## 5       4.8  4.6  5.1 Females          5-10 years
## 6       4.4  4.1  4.8 Females         10-15 years
## 7       4.1  3.8  4.4 Females           15+ years
## 8      92.7 89.1 96.4   Males          0-6 months
## 9      12.8 11.6 14.2   Males         6-12 months
## 10      9.5  8.7 10.4   Males           1-2 years
## 11      6.9  6.5  7.4   Males           2-5 years
## 12      5.5  5.1  5.9   Males          5-10 years
## 13      4.9  4.4  5.5   Males         10-15 years
## 14      4.6  4.2  5.1   Males           15+ years

7. Absolute risks

Load the data

Load the datasets prepared in steps 1-4.

rm(list = ls())
load(file="SHARED_disease_D41.RData")

Prepare data for a multi-state model

In the Cox proportional hazards models (used to estimate hazard ratios), those who died or emigrated/disappeared were censored at date of death or emigration. The absolute risks can be biased if we do not take into consideration these competing risks events. For this reason, we give a new outcome to those events: 2 for disappearance or emigration, and 3 for death.

data_survival[data_survival$statdA==data_survival$tstop & data_survival$fail %in% c(70, 80), "endpoint"] <- 2
data_survival[data_survival$statdA==data_survival$tstop & data_survival$fail==90, "endpoint"] <- 3

For multi-state models, it is necessary to define a matrix of transitions. In this example, every person starts being free of the later-disorder (Neurotic Disorders), and they are labelled “healthy”. During follow-up, they can change the status from “healthy” to “diseased”, “emigrated”, or “dead”. No other transitions (e.g. from “emigrated” to “diseased”) are allowed.

tmat <- transMat(x=list(c(2,3,4), c(), c(), c()), names=c("healthy", "disease", "emi", "death"))
tmat

##          to
## from      healthy disease emi death
##   healthy      NA       1   2     3
##   disease      NA      NA  NA    NA
##   emi          NA      NA  NA    NA
##   death        NA      NA  NA    NA

There are 4 states, but only 3 transitions. Each observation in the dataset has to be repeated 3 times. In each repetition, one different possible outcome is considered, and the other two treated as censoring.

CRcoxdata <- combine(data_survival, data_survival, data_survival, names = c(1, 2, 3))

A new outcome d can now be created, where each type of outcome is considering censoring in the other two repetitions.

CRcoxdata$d <- ((CRcoxdata$endpoint == 1) * (CRcoxdata$source == 1)) +
  ((CRcoxdata$endpoint == 2) * (CRcoxdata$source == 2)) +
  ((CRcoxdata$endpoint == 3) * (CRcoxdata$source == 3))

CRcoxdata$h <- as.numeric(CRcoxdata$source)

The invented data used in steps 1-4 can be used to illustrate the data structure. Note that the 6 observations for the invented person have been repeated 3 times.

This table includes invented data (not real person)
pnr	D30A	tstart	tstop	endpoint	source	d	h
invented_person	25.50681	14.58900	20.37804	0	1	0	1
invented_person	25.50681	20.37804	25.50681	0	1	0	1
invented_person	25.50681	25.50681	26.00681	0	1	0	1
invented_person	25.50681	26.00681	26.50681	0	1	0	1
invented_person	25.50681	26.50681	27.50681	0	1	0	1
invented_person	25.50681	27.50681	29.44927	1	1	1	1
invented_person	25.50681	14.58900	20.37804	0	2	0	2
invented_person	25.50681	20.37804	25.50681	0	2	0	2
invented_person	25.50681	25.50681	26.00681	0	2	0	2
invented_person	25.50681	26.00681	26.50681	0	2	0	2
invented_person	25.50681	26.50681	27.50681	0	2	0	2
invented_person	25.50681	27.50681	29.44927	1	2	0	2
invented_person	25.50681	14.58900	20.37804	0	3	0	3
invented_person	25.50681	20.37804	25.50681	0	3	0	3
invented_person	25.50681	25.50681	26.00681	0	3	0	3
invented_person	25.50681	26.00681	26.50681	0	3	0	3
invented_person	25.50681	26.50681	27.50681	0	3	0	3
invented_person	25.50681	27.50681	29.44927	1	3	0	3

Focus only on persons with Mood Disorders (prior-disorder)

In these analyses, we aim to estimate the proportion of individuals with a diagnosis of a prior-disorder (Mood Disorders) who later develop a later-disorder (Neurotic Disorders). For this reason, we can keep only those with a diagnosis of Mood Disorders, and start the follow-up time at onset of the disorder.

CRcoxdata_expo <- CRcoxdata[!is.na(CRcoxdata[, "D30A"]),]
CRcoxdata_expo[, "expo_Tstart"] <- CRcoxdata_expo[, "tstart"] - CRcoxdata_expo[, "D30A"]
CRcoxdata_expo[, "expo_Tstop"] <- CRcoxdata_expo[, "tstop"] - CRcoxdata_expo[, "D30A"]
CRcoxdata_expo <- CRcoxdata_expo[CRcoxdata_expo$expo_Tstart >= 0,]

This table includes invented data (not real person)
pnr	D30A	tstart	tstop	expo_Tstart	expo_Tstop	endpoint	source	d	h
invented_person	25.50681	25.50681	26.00681	0.0	0.500000	0	1	0	1
invented_person	25.50681	26.00681	26.50681	0.5	1.000000	0	1	0	1
invented_person	25.50681	26.50681	27.50681	1.0	2.000000	0	1	0	1
invented_person	25.50681	27.50681	29.44927	2.0	3.942466	1	1	1	1
invented_person	25.50681	25.50681	26.00681	0.0	0.500000	0	2	0	2
invented_person	25.50681	26.00681	26.50681	0.5	1.000000	0	2	0	2
invented_person	25.50681	26.50681	27.50681	1.0	2.000000	0	2	0	2
invented_person	25.50681	27.50681	29.44927	2.0	3.942466	1	2	0	2
invented_person	25.50681	25.50681	26.00681	0.0	0.500000	0	3	0	3
invented_person	25.50681	26.00681	26.50681	0.5	1.000000	0	3	0	3
invented_person	25.50681	26.50681	27.50681	1.0	2.000000	0	3	0	3
invented_person	25.50681	27.50681	29.44927	2.0	3.942466	1	3	0	3

The new variables expo_Tstart and expo_Tstop focus on time since onset of Mood Disorders, compared to the old variables tstart and tstop that focus on time since birth (age).

Each replication (h=1, h=2, and h=3) is linked to one specific transition (1–>2, 1–>3, 1–>4). Two new variables from and to define these transitions.

CRcoxdata_expo$from <- 1
CRcoxdata_expo$to <- CRcoxdata_expo$h + 1

In order to investigate specific absolute risks depending on age at diagnosis of Mood Disorders, different age groups have to be identified.

CRcoxdata_expo$age_group <- cut(CRcoxdata_expo[, "D30A"], breaks=c(0, 20, 40, 60, 80, 200), include.lowest=TRUE, right = FALSE)
CRcoxdata_expo$age_group <- droplevels(CRcoxdata_expo$age_group)

Finally, the dataset was splitted many times (because of mental health comorbidity, time since exposure, etc.). This level of detail is not necessary for this analysis, so the dataset can be minimized.

CRcoxdata_expo_min <- data.table(CRcoxdata_expo[, c("pnr", "kqn", "D30A", "age_group", "expo_Tstart", "expo_Tstop", "from", "to", "h", "d")])
CRcoxdata_expo_min <- CRcoxdata_expo_min[, list(tstart=min(expo_Tstart), tstop=max(expo_Tstop), d=max(d)), by=c("pnr", "kqn", "D30A", "age_group", "from", "to", "h")]
CRcoxdata_expo_min <- data.frame(CRcoxdata_expo_min)

This table includes invented data (not real person)
pnr	D30A	age_group	kqn	tstop	from	to	d
invented_person	25.50681	[20,40)	K	3.942466	1	2	1
invented_person	25.50681	[20,40)	K	3.942466	1	3	0
invented_person	25.50681	[20,40)	K	3.942466	1	4	0

The invented person is followed from onset of Mood Disorders (at age 25.5 years) during 3.94 years, moment in which this person is diagnosed with Neurotic Disorders (and censored for emigration or death).

Count number of persons

In this step, the number of persons with Mood Disorders (exposed) is calculated, together with how many develop subsequent Neurotic Disorders (d).

count_data <- CRcoxdata_expo_min[CRcoxdata_expo_min$h==1, c("pnr", "kqn", "age_group", "tstop", "d")]
count_data <- count_data[!duplicated(count_data), ]
count_data$number <- 1
numbers <- aggregate(cbind(number, d) ~ kqn + age_group, data = count_data, FUN=sum, na.rm=TRUE)

# According to Danish regulations, we cannot report cells with less than 3 cases
numbers_print <- numbers
numbers_print[numbers_print$number<3, "number"] <- NA
numbers_print[numbers_print$d<3, "d"] <- NA

kable(numbers_print)

kqn	age_group	number	d
K	[0,20)	11474	4472
M	[0,20)	5398	1649
K	[20,40)	36526	9990
M	[20,40)	23077	6003
K	[40,60)	28614	5470
M	[40,60)	21018	4172
K	[60,80)	17207	1911
M	[60,80)	10121	1140
K	[80,200]	8334	358
M	[80,200]	3212	157

Estimate cumulative incidence proportion (CIP) for each sex and age group

for(age_group in c("all", levels(CRcoxdata_expo_min$age_group))) { 
  # First we select the observations for the age group of interest
  if(age_group=="all") { 
    multi <- CRcoxdata_expo_min
  } else {
    multi <- CRcoxdata_expo_min[CRcoxdata_expo_min$age_group==age_group, ]
  }  
  
  for(sex in c("all", "M", "K")) {
    # Second we select the observations for the sex of interest
    if(sex=="all") {
      multi2 <- multi
    } else {
      multi2 <- multi[(multi[, "kqn"]==sex), ]
    }
    
    # The CIP is estimated only if there is at least one person in the specific sex- and age-group
    if(dim(multi2)[1]>0) {
      multi2$trans <- multi2$h
      coxres <- coxph(Surv(tstart, tstop, d) ~ strata(trans), data=multi2, method="breslow")
      who <- multi2[c(1:3), ]
      who$strata<-1:3
      msf0 <- msfit(coxres, who, trans=tmat)
      pt0 <- probtrans(msf0, predt=0)
      CIP_aux <- pt0[[1]]
      CIP_aux$sex <- sex
      CIP_aux$age_group <- age_group
      
      if (age_group=="all" & sex=="all") {
        CIP <- CIP_aux
      } else {
        CIP <- rbind(CIP, CIP_aux)
      }
    }
  }    
}

head(CIP)

##          time   pstate1     pstate2 pstate3      pstate4          se1
## 1 0.000000000 1.0000000 0.000000000       0 0.000000e+00 0.0000000000
## 2 0.002732240 0.9983728 0.001573522       0 5.364279e-05 0.0001204172
## 3 0.002732240 0.9980241 0.001913317       0 6.258478e-05 0.0001326891
## 4 0.002737851 0.9916574 0.008280012       0 6.258478e-05 0.0002712736
## 5 0.002737851 0.9655200 0.034417440       0 6.258478e-05 0.0005397441
## 6 0.002737851 0.9612815 0.038655942       0 6.258478e-05 0.0005712619
##            se2 se3          se4 sex age_group
## 1 0.0000000000   0 0.000000e+00 all       all
## 2 0.0001184157   0 2.186394e-05 all       all
## 3 0.0001305720   0 2.362047e-05 all       all
## 4 0.0002702575   0 2.362047e-05 all       all
## 5 0.0005392612   0 2.362047e-05 all       all
## 6 0.0005708098   0 2.362047e-05 all       all

Keep only necessary information

There is information on too many data points and for all possible transitions. In this step, we will keep only one estimate per month (aprox 0.08 years) and only transition 1 (going from “Healthy” to “Diseased”).

CIP$time_since_dx <- 0.08 * ceiling(as.numeric(CIP$time)/0.08)
CIP <- ddply(CIP, .(sex, age_group, time_since_dx), function(x) tail(x, 1))
CIP$cip <- 100*CIP$pstate2
CIP[, "cip_low"] <- pmax(0, 100*exp(log(CIP[, "pstate2"]) - qnorm(0.975) * (CIP[, "se2"] / CIP[, "pstate2"])))
CIP[, "cip_high"] <- pmin(100, 100*exp(log(CIP[, "pstate2"]) + qnorm(0.975) * (CIP[, "se2"] / CIP[, "pstate2"])))
CIP <- CIP[, c("sex", "age_group", "time", "cip", "cip_low", "cip_high")]
head(CIP)

##   sex age_group       time      cip  cip_low cip_high
## 1 all    [0,20) 0.00000000  0.00000      NaN      NaN
## 2 all    [0,20) 0.07945205 14.09683 13.55720 14.65794
## 3 all    [0,20) 0.15890411 16.30146 15.72631 16.89764
## 4 all    [0,20) 0.23835616 17.64508 17.05019 18.26074
## 5 all    [0,20) 0.31780822 18.62727 18.01874 19.25635
## 6 all    [0,20) 0.40000000 19.34755 18.72941 19.98608

We are interested on the first 15 years after the prior-disorder. We will delete the observations after 15 years, and extend the last observation until exact 15 years after the prior-disorder.

CIP <- CIP[CIP$time<=15, ]
CIP_last <- ddply(CIP, .(sex, age_group), function(x) x[nrow(x), ])
CIP_last$time <- 15
CIP <- rbind(CIP, CIP_last)

Merge the number of cases and the CIP estimates

numbers <- numbers[!duplicated(numbers), ]
numbers2 <- aggregate(cbind(number, d)~kqn, data=numbers, FUN=sum)
numbers2$age_group <- "all"
numbers3 <- aggregate(cbind(number, d)~age_group, data=numbers, FUN=sum)
numbers3$kqn <- "all"
numbers4 <- aggregate(cbind(number, d)~1, data=numbers, FUN=sum)
numbers4$age_group <- "all"
numbers4$kqn <- "all"
numbers <- rbind(numbers, numbers2, numbers3, numbers4)
colnames(numbers)[colnames(numbers)=="kqn"] <- "sex"

CIP <- merge(CIP, numbers, all.x=T, all.y=F)
colnames(CIP)[colnames(CIP)=="d"]<-"cases_outcome"
colnames(CIP)[colnames(CIP)=="number"]<-"number_exposed"

# According to Danish regulations, we cannot report cells with less than 3 cases
CIP[CIP$cases_outcome<3, "cip"] <- NA
CIP[CIP$cases_outcome<3, "cip_low"] <- NA
CIP[CIP$cases_outcome<3, "cip_high"] <- NA
CIP[CIP$cases_outcome<3, "cases_outcome"] <- NA
CIP[CIP$number_exposed<3, "number_exposed"] <- NA

head(CIP)

##   sex age_group       time      cip  cip_low cip_high number_exposed
## 1 all    [0,20) 0.00000000  0.00000      NaN      NaN          16872
## 2 all    [0,20) 0.07945205 14.09683 13.55720 14.65794          16872
## 3 all    [0,20) 0.15890411 16.30146 15.72631 16.89764          16872
## 4 all    [0,20) 0.23835616 17.64508 17.05019 18.26074          16872
## 5 all    [0,20) 0.31780822 18.62727 18.01874 19.25635          16872
## 6 all    [0,20) 0.40000000 19.34755 18.72941 19.98608          16872
##   cases_outcome
## 1          6121
## 2          6121
## 3          6121
## 4          6121
## 5          6121
## 6          6121