`survfit.coxphms` Computing state occupancy curves on new data without repeating for every stratum #279

steliosbl · 2024-09-13T16:42:51Z

Good morning,
I would like to ask about survfit with multi-state Cox PH models. Specifically, the behaviour when there are strata variables which are included in newdata.

In the documentation for survfit.coxph you indicate:

If newdata does contain strata variables, then the result will contain one curve per row of newdata, based on the indicated stratum of the original model.

However, when running it myself, I notice that the resulting survfit object has multiple curves per sample, one per each stratum value (leaving aside the separate curves per state). This is despite the fact that I included the "ground truth" stratum value for each sample in newdata.

In an internal code comment (coxsurv3.Rnw:206) you mention:

We do completely separate computations for each stratum: the time scale starts over, nrisk, etc. Each has a separate call to the multihaz function.

My questions, therefore, are:

Is it expected that, in the multi-state case, the strata variables in newdata are ignored and survfit computes a row for every strata regardless? Or am I making a mistake in my call to survfit?
If the former, is there a way to avoid this redundant computation? I know ahead of time which stratum my newdata samples belong to. If it is somehow possible to avoid computing curves for all strata, this would save a great deal of computation. In my case, I am stratifying by sex, meaning that the call to survfit currently produces double the number of curves that I actually need.

Thank you very much for your assistance, and for your especially quick response to my last question.

The text was updated successfully, but these errors were encountered:

therneau · 2024-09-13T17:55:35Z

I am off to a conference so can't do a long reply. The [.survfit and dim.survfit functions allow a user to view a multi-state survival curve as a 3-way array of curves: (strata, data, state). What you are asking makes sense, but it has a long comet's tail of implications in the code, some rather complicated. (Take a look at [.survfit sometime: it is a maze of if/then/else; I have a hard time parsing, and I wrote it.)
One question in response: can you give more detail on the case where this will "save a great deal of time"?

steliosbl · 2024-09-13T22:36:00Z

Thanks for your response.

I am dealing with relatively large datasets of ~10 million records, each undergoing 3-4 state transitions during the observation period. The newdata validation set then contains ~2 million new individuals that I want to put through survfit(). The coxph() fitting usually takes a few hours on this dataset, and the subsequent survfit() call takes several days. Hence, I am saving time and computation wherever I can. Any advice you could provide would be invaluable and greatly appreciated. My most effective optimisation so far has been to coarsen the time scale and thus reduce the number of distinct observed event times in the data. However, I am cautious of the enormous number of ties this creates.

As for the current question, the Cox model is stratified by sex (binary), which is obviously a fixed value for each individual that is known ahead of time. Producing curves for both sex values for each individual intuitively doubles the amount of computation that we need to do. Hence, I am imagining that it would save a lot of time and processing if we were able to compute only the curves for the stratum we know the individuals to belong to.

I admit I am ignorant of the inner workings of survfit: I imagine that my thinking (half the number of curves -> half the time needed to compute them) is quite simplistic. I am also curious to know whether there's a good reason we need both curves for every individual for some other computation.

Thank you for your assistance, and enjoy the conference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`survfit.coxphms` Computing state occupancy curves on new data without repeating for every stratum #279

`survfit.coxphms` Computing state occupancy curves on new data without repeating for every stratum #279

steliosbl commented Sep 13, 2024 •

edited

Loading

therneau commented Sep 13, 2024

steliosbl commented Sep 13, 2024 •

edited

Loading

survfit.coxphms Computing state occupancy curves on new data without repeating for every stratum #279

survfit.coxphms Computing state occupancy curves on new data without repeating for every stratum #279

Comments

steliosbl commented Sep 13, 2024 • edited Loading

therneau commented Sep 13, 2024

steliosbl commented Sep 13, 2024 • edited Loading

`survfit.coxphms` Computing state occupancy curves on new data without repeating for every stratum #279

`survfit.coxphms` Computing state occupancy curves on new data without repeating for every stratum #279

steliosbl commented Sep 13, 2024 •

edited

Loading

steliosbl commented Sep 13, 2024 •

edited

Loading