Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

survfit.coxphms Computing state occupancy curves on new data without repeating for every stratum #279

Open
steliosbl opened this issue Sep 13, 2024 · 2 comments

Comments

@steliosbl
Copy link

steliosbl commented Sep 13, 2024

Good morning,
I would like to ask about survfit with multi-state Cox PH models. Specifically, the behaviour when there are strata variables which are included in newdata.

In the documentation for survfit.coxph you indicate:

If newdata does contain strata variables, then the result will contain one curve per row of newdata, based on the indicated stratum of the original model.

However, when running it myself, I notice that the resulting survfit object has multiple curves per sample, one per each stratum value (leaving aside the separate curves per state). This is despite the fact that I included the "ground truth" stratum value for each sample in newdata.

In an internal code comment (coxsurv3.Rnw:206) you mention:

We do completely separate computations for each stratum: the time scale starts over, nrisk, etc. Each has a separate call to the multihaz function.

My questions, therefore, are:

  • Is it expected that, in the multi-state case, the strata variables in newdata are ignored and survfit computes a row for every strata regardless? Or am I making a mistake in my call to survfit?
  • If the former, is there a way to avoid this redundant computation? I know ahead of time which stratum my newdata samples belong to. If it is somehow possible to avoid computing curves for all strata, this would save a great deal of computation. In my case, I am stratifying by sex, meaning that the call to survfit currently produces double the number of curves that I actually need.

Thank you very much for your assistance, and for your especially quick response to my last question.

@therneau
Copy link
Owner

I am off to a conference so can't do a long reply. The [.survfit and dim.survfit functions allow a user to view a multi-state survival curve as a 3-way array of curves: (strata, data, state). What you are asking makes sense, but it has a long comet's tail of implications in the code, some rather complicated. (Take a look at [.survfit sometime: it is a maze of if/then/else; I have a hard time parsing, and I wrote it.)
One question in response: can you give more detail on the case where this will "save a great deal of time"?

@steliosbl
Copy link
Author

steliosbl commented Sep 13, 2024

Thanks for your response.

I am dealing with relatively large datasets of ~10 million records, each undergoing 3-4 state transitions during the observation period. The newdata validation set then contains ~2 million new individuals that I want to put through survfit(). The coxph() fitting usually takes a few hours on this dataset, and the subsequent survfit() call takes several days. Hence, I am saving time and computation wherever I can. Any advice you could provide would be invaluable and greatly appreciated. My most effective optimisation so far has been to coarsen the time scale and thus reduce the number of distinct observed event times in the data. However, I am cautious of the enormous number of ties this creates.

As for the current question, the Cox model is stratified by sex (binary), which is obviously a fixed value for each individual that is known ahead of time. Producing curves for both sex values for each individual intuitively doubles the amount of computation that we need to do. Hence, I am imagining that it would save a lot of time and processing if we were able to compute only the curves for the stratum we know the individuals to belong to.

I admit I am ignorant of the inner workings of survfit: I imagine that my thinking (half the number of curves -> half the time needed to compute them) is quite simplistic. I am also curious to know whether there's a good reason we need both curves for every individual for some other computation.

Thank you for your assistance, and enjoy the conference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants