A Handy Tool for Estimating Patient Populations

sahoang
Jan 18, 2024
4 min read

What is the size of the patient population?

This question is of great interest across various contexts within the pharmaceutical industry. For a therapeutic program, the answer, known as disease prevalence, has implications for societal impact, market analysis, investment risk assessment, and clinical study design. (Note: "Prevalence" can refer to either frequency in a population or the total number of cases in a population. The definitions are similar, but we use the latter in this post.) Unfortunately, prevalence estimates are not always easy to come by. Directly measuring prevalence requires longitudinal monitoring and data collection, which is challenging and resource-intensive. It is usually much easier to find estimates of incidence — the rate at which new cases are observed in a population. Incidence estimates are more common because they can be calculated directly from hospital records, patient registries, and insurance claims databases.

So, how can we estimate prevalence from incidence? What additional information do we need? For diseases diagnosed early in life, incidence is often expressed as "1 in X live births." So, if we knew the disease incidence and the overall birth rate in a population, we would also know the rate of new cases appearing in that population (i.e., birth rate times incidence). If we also knew the rate of cases being removed from the population through either treatment or death, we would have enough information to estimate the prevalence. Here is a simple Shiny app I wrote to do just that.

It generates figures that look like this:

This plot shows prevalence estimates (color) as a function of incidence (y-axis) and patient life expectancies (x-axis). This example shows the possible prevalence values given an incidence of 1:150,000 to 1:50,000 births and a median life expectancy of 20 to 40 years. The plot tells us that the total number of patients is anywhere from ~1000 to ~4500, assuming the population and birth rate of the US. This visualization has a couple of strengths. First, it shows the full set of prevalence estimates over a range of possible incidence and life expectancy values. This is valuable because, for a given disease, incidence and life expectancy are often not precisely known, but a plausible range for each can cited. Second, it shows the contour lines where the prevalence estimate is constant. In practice, there is often a minimum prevalence target for a therapeutic program, corresponding to a contour line on the plot. In such a case, the job of an analyst is to determine the likelihood that the disease in question is uphill of that contour line.

A few notes, caveats, and disclaimers:

The estimates are dependent on the total population size and overall birth rate. The app has pre-set values for the US, EU, and Japan, but users can also specify their values.
The frequency at which cases are eliminated from the population is estimated using the median patient life expectancy. This is an oversimplification if one is considering diseases that have effective treatments, i.e., when cases are eliminated from the population through treatment.
Incidence is provided in terms of births per year. This simplification works well for diseases that are diagnosed early in life. It is not suitable for diseases that, for example, have a varying or unpredictable onset.
The prevalence estimates assume steady-state, where the change in prevalence is asymptotically approaching zero.
There are many other approaches to estimating prevalence that may be more or less appropriate, depending on the context. This is a simple method that is reasonable and interpretable when its assumptions are met.
The source code is available on GitHub.

So what's under the hood?

For those interested, the rationale and details of the calculations are below.

We want an expression that gives us prevalence in terms of incidence. We know that the change in prevalence over time is the incidence minus the rate at which cases disappear from the population:

`(dP)/(dt)=I-gamma P quad quad quad quad (1)`

Here, `gamma` is the proportion of the patient population lost per unit time. Solving this differential equation, we get

`P(t) = (I-e^(-gamma t))/(gamma) quad quad quad quad (2)`

A common assumption in survival analysis is that survival times are exponentially distributed. Under this assumption, the proportion of patients that have not survived at time `t` is given by the cumulative distribution function of the exponential distribution:

`F(t, lambda) = 1 - e^(-lambda t) quad quad quad quad (3)`

And since `gamma` is the proportion lost after a unit of time, we have

`gamma = 1 - e^(-lambda) quad quad quad quad (4)`

The rate parameter `lambda` can be expressed in terms of median survival time, `S_m`:

`lambda = (ln(2))/S_m quad quad quad quad (5)`

Combining equations 4 and 5, we get:

`gamma = 1 - e^(-ln(2)/S_m) quad quad quad quad (6)`

As `t -> oo` —i.e., the steady-state assumption—the exponential term in equation (2) goes to zero. And after substituting (6) into (2), we get the steady-state prevalence in terms of incidence and median life expectancy:

`P_(ss) = I/(1-e^(ln(2)/S_m)) quad quad quad quad (7)`

This is the expression that the app uses to calculate the prevalence at each point, `(I, S_m)`.

STEELYARD SCIENCE

Biomedical Data Science

A Handy Tool for Estimating Patient Populations

So what's under the hood?

Recent Posts

Comments

Subscribe to Steelyard Science