# An Estimate of the U.S. Effective Reproduction Number for COVID-19

Updated May 7, 2020 for the latest data and corrected an arithmetic error, which leads to revised conclusions.

## Summary

The effective reproduction number is the number of people on average that a person will infect with a contagious disease. Through measures such as social distancing, we can reduce that number below a critical threshold of 1, at which point, if it stays there, the disease will hopefully die out without a cure or vaccine. This analysis has found that as of May 6, the effective reproduction number is above 1 at 1.247 (95% confidence interval = [1.235, 1.258]). It was trending strongly downward from mid-March to mid-April, however since mid-April it has levelled off, and is oscillating between about 0.86 and 1.36 at an average of 1.1 for the last 30 days.  These results are updated from mid-April, and also correct an error which shows that the effective reproduction number is spending more time above 1 than previously shown, although it is still close to 1.

Although the disease is spreading fairly slowly as compared to its initial introduction in the U.S., social distancing and so forth as they have been implemented will not be sufficient to let the disease die out on its own. I am very skeptical of re-opening the economy, and will continue to be without far more robust contact tracing and testing systems in place, or a vaccine or cure. I also think the oscillations in the curves show that people are not observing social distancing as strictly on the weekends. Finally, I discuss the effect of undercounting on the calculation, and how it might be addressed.

The data for this analysis is available from the N.Y. Times. The code is here.

## Introduction

In controlling the spread of an infectious disease, it is important to know how contagious it is. One simple way of quantifying is contagious is the number of people on average each infected person will directly infect. This is called the reproduction number, denoted $\displaystyle R$.  For example, if the average person infected will directly infect 5 other people, then we say $R = 5$.

One the reasons that $\displaystyle R$ is such a useful measure it is a simple way of predicting what a disease will do. If $\displaystyle R > 1$, i.e., if each infected person directly infects more than one other person on average, then the disease will spread at an increasing rate throughout the population (assuming there are enough susceptible people ). But if $R < 1$, the disease will tend to die out on its own, even in the absence of a vaccine or cure, because it is just not infecting people fast enough.

The reproduction number is not constant over the course of an epidemic. It depends on anything that could affect how fast the disease spreads: the mechanism of transmission, how many people the average person interacts with, sanitation, etc. Controlling the disease means taking measures that get $\displaystyle R$ as low as is feasible – which means knowing whether what we’re doing now is working to get the reproduction number down.

### Basic Reproduction Number

A disease’s basic reproduction number, denoted $\displaystyle R_0$, is the number of direct infections the average person will produce early on in the epidemic. At this time, people do not know about the disease’s appearance or haven’t had time to react. $\displaystyle R_0$ represents the disease’s contagiousness when it is spreading more or less unchecked.

One estimate of COVID-19’s basic reproduction rate from the early phase of the first outbreak in China put $\displaystyle R_0$ in the range of 2.24 – 3.58.[1] For comparison, ordinary influenza has an $\displaystyle R_0$ of 0.9 to 2.1.[2] An extremely contagious disease like chickenpox, the scourge of elementary school students everywhere, has an $\displaystyle R_0$ of 10-12.[3] There is a vaccine now, but until recently, with $\displaystyle R_0$ that high, contracting it as child was accepted as inevitable. The common cold $\displaystyle R_0$ is 2-3.[4]  This means COVID-19 spreads about as easily as the cold – but COVID-19 is orders of magnitude more deadly (not to mention that it spreads from people who do not have symptoms).

### Effective Reproduction Number

So we can see why reducing the effective reproduction number, i.e., $\displaystyle R$ at any time after the beginning of the epidemic, through measures to mitigate transmission is so important.[5] COVID-19 spreads through droplets expelled from the respiratory system that hang in the air near the infected person or get deposited on surfaces. Social distancing and isolation reduce how often people are inhaling infected droplets, and reduces the rate at which they touch infected surfaces. Frequent hand-washing removes the infected droplets from your hands so you don’t deposit the virus in your mouth or nose. And so forth. The more extreme these measures, the lower the effective reproduction number will drop. If we can hold the effective reproduction number under 1 for long enough, we can eliminate the disease entirely, without a vaccine or cure, or at least keep it at a manageable level. We will denote the effective reproduction number as $\displaystyle R_t$, that is, $\displaystyle R$ at a given time $\displaystyle t$. $\displaystyle R_0$, then, is just $\displaystyle R_t$ at time $\displaystyle t = 0$.

If we have a way of estimating $\displaystyle R_t$, and observing how it evolves over time, then we have a way of knowing how effective our mitigation measures are (which way is $\displaystyle R_t$ trending, and how fast?), and how close we are to possibly containing and/or eliminating the disease (how close are we to $\displaystyle R_t=1$?).

## The Renewal Equation

If we have data on the incidence of the disease, the rate at which new cases appear over time, and an estimate of its generation interval, the average amount of time it takes an infected person to directly infect one other person (the time to “generate” another case), we can estimate $\displaystyle R_t$  with the following renewal equation[6][7]:

$\displaystyle R_t = \frac {I(t)} { \int_0^t I(t - \tau) w(\tau) d\tau}$

$I(t)$ is the incidence. We treat generation interval $\tau$ as probabilistic with a certain distribution $w(\tau)$.[8] We can use a gamma distribution to estimate the probability of a waiting time[9] (here, the waiting time from infection in one person to the next person) so that

$w(\tau) = {\displaystyle {\frac {1}{\Gamma (k)\theta ^{k}}}\tau^{k-1}e^{-{\frac {\tau}{\theta }}}}$

with

${\displaystyle \Gamma (k)=\int _{0}^{\infty }x^{k-1}e^{-x}\,dx\ \qquad k>0\ .}$

The COVID-19 mean generation interval $\mu$ has been estimated at 4.41 days, with a standard deviation of 3.17.[10] To select the parameters $k$ and $\theta$, we observe that for a gamma distribution, the mean is equal to $k \theta$, and the standard deviation equal to $\sqrt{k \theta^2}$. Letting $k \theta = 4.41$ and $\sqrt{k \theta^2} = 3.17$, we have $k = 1.94$ and $\theta = 2.28$. [Note: An earlier version of this stated incorrect values of k and theta due to an arithmetic error. The plot of the distribution below is also corrected.]

Now let us see what the renewal equation is saying. It is just the number of new cases at time $t$, divided by a weighted sum representing the number of cases that directly produced them.[11] These cases appeared one generation interval $\tau$ days ago, and the weights are just the probabilities that the generation interval actually is $\tau$.

### Discretization

The above form of the renewal equation, and the generation interval probability distribution are both in continuous form, as is the incidence function $I(t)$. But in practice, $I(t)$ will not be reported continuously but discretely, at regular (or irregular) intervals.[12] In the United States, and most other countries, sources are publishing daily data on reported cases. The data we will use is the daily data from the New York Times. So, we need the discrete form of the renewal equation[13]:

$\displaystyle R_{t_i} = \frac {I_i} { \sum_{j=0}^{i} I_{i-j} w_j}$

Where $i$ is the age of the epidemic in days, and $j$ is the generation interval in days. We also need $w_j$, the discretized form of the gamma distribution[14]:

$\displaystyle w(j)= \frac{1}{ \Gamma (k)} \left( \int _{{j}/{\theta}}^{\infty }x^{k-1}e^{-x}\,dx\ - \int _{{(j+1)}/{\theta}}^{\infty }x^{k-1}e^{-x}\,dx\ \right)$

The two terms on the right side of the above equation are the upper incomplete gamma function evaluated at $j / \theta$ and $(j+1) / \theta$. Scipy provides a built in function, gammaincc(), to compute the upper incomplete gamma function.

Our probability function with $k = 1.94$ and $\theta = 2.27$ looks like this:

### Estimation Method

To estimate $R_{t_i}$, we follow the method used in Shim et al. 2020 and here. First, if $C_i$ is the cumulative number of all reported cases on day $i$, then the daily incidence $I_i = C_i - C_{i-1}$. From this we can get an estimate for $R_{t_i}$ using the discrete renewal equation. We can also get confidence intervals by assuming that the incidence has a Poisson error structure, then performing the same calculation over 10,000 simulated incidence datasets, where the simulated incidence on each day is a Poisson distributed variable with mean $I_i$. Then we can take 95th and 5th percentile values for $R_{t_i}$ on each day as our confidence interval.

## Results

As of May 6, 2020, $R_{t_i}$ = 1.247, with 95% confidence interval = [1.258, 1.235].

So initially we have some very high values for $R_{t_i}$, with huge confidence intervals. This is to be expected because, looking at the discrete renewal equation, it is clearly very sensitive to changes in $I_i$ when $I_i$ is very small. For several weeks there were very few reported cases in the United States, so that is why we see a wildly varying $R_{t_i}$ early on, but a much more stable curve with tighter confidence intervals later.

First, let’s look at an 11-day moving average of $R_{t_i}$. I chose 11 days because this is the mean generation interval 4.41 + 2 * standard deviation of 3.17 = 10.75, rounded up to 11 (since we are using discrete intervals). The probability of an 11-day generation interval is less than 1%, so incidence data from more than this far back from day $i$ will make a negligible contribution to the calculation. Therefore I take the 11-day average as a reasonable measure of the trend.

The 11-day average declines very slowly, then enters a steeper decline in mid-March (shortly after a national emergency was declared and social-distancing measures began). I’m going to guess that that slight decline before mid-March is probably due to a small but increasing subset of people following the news, anticipating the pandemic, and taking protective measures early on. If that’s true, then this makes the authorities, who failed to prevent what a significant number of average people could easily see coming, look very foolish.

The flat portion at the beginning of the curve suggests that the basic reproduction number is a bit less than 2.5. This is in good agreement with the estimate cited above of 2.24 (95% CI: 1.96, 2.55).

We also see that, since mid-April, the decline has pretty much levelled off, and on average, we have been hovering just above 1. The disease is current spreading but spreading slowly. The social distancing stopped the worst part of the crisis. However, it has not gotten us to a point where the disease will die out on its own without further measures.

Here’s what has been happening since the national emergency was declared on Mar. 13:

The beginning of the plot shows a spike occurring on March 19, followed by another pretty consistent decline until April 13. This spike is also apparent in the graph of the 11-day average. I am not sure why this spike appears. I will note that, consistent with a mean generation interval of 4.41 days, this is 4 days after the Center for Disease Control first publicly warned against gatherings of larger than 50 people on March 15. This is probably the point when many people and state and local authorities starting taking the need for social distancing much more seriously, because it became clear that most public places could not remain open. So, after this spike, we accordingly find a steep, then more shallow decline in $R_t$ as authorities stop introducing further distancing measures.

Zooming in on the last 30 days, we see that the reproduction number has been oscillating between 0.86 and 1.36, at an average of 1.1.

Again, the social distancing got the reproduction number pretty close to 1. But the disease is still present and waiting to spread as soon as any of these measures are relaxed. Without other measures in place, like robust systems of contact tracing and accurate testing, to properly identify and isolate those people with the virus, it will spread unchecked and quickly lead back to a crisis. Personally, I don’t see much evidence that this is where it needs to be at a national level, and I expect the numbers are going to get significantly worse again in 2-3 weeks. I welcome any information that would prove me wrong on this point.

### Bounces?

Note that in the above graphs there are regular bounces every 6-8 days. That is, the curve seems to repeatedly reach a local minimum, then start climbing to a peak about 3-4 days later, then falls back to a local minimum after another 3-4 days.

But once distancing measures are in place, they tend not to be lifted or relaxed, so we might expect the curve to be smoother (although maybe not monotonic). This could of course just be due to noise, but if it were then we should not expect to see such narrow confidence bands, because the estimation method simulated Poisson-structured noise. So unless the uncertainty in the incidence was way larger than a Poisson distributed error (which could easily be a difference of several hundred in either direction by the last dates of the data set), that should not be the source of spikiness.

The only explanation for these regular bounces which have been sustained consistently since at least the mid-March is that people are not observing social distancing as strictly on the weekends. If that’s true it would demonstrate how readily the rate of spread can climb with even slight relaxation of mitigation measures.

### Undercounting of Cases

Another question we might ask is, since we know that there are probably a huge number of unreported cases, how does undercounting affect this estimation? It could if testing capacity has significantly changed recently to the estimate, but otherwise probably not. Why?

Suppose we somehow magically knew how many cases had not been reported on every given day since the first reported case. We could then a define a function $U$, which would tell us the factor we needed to multiply $I_i$ by $U_i$ in order to get the true incidence. Then we could adjust our renewal equation to the following to get a true $R_{t_i}$:

$\displaystyle R^{\mathrm{true}}_{t_i} = \frac {I_i U_i} { \sum_{j=0}^{i} I_{i-j} U_{i-j} w_j}$

Now, if $U_i \approx U_{i-j}$ then $R^{\mathrm{true}}_{t_i} \approx R_{t_i}$. In other words, if the undercounting factor has not significantly changed within the period where $w_j$ is non-negligible, that is, in the last 11-days or so, then your true estimate will be about the same as your original one.

I would guess this is a reasonable assumption because at first, there is very little testing capacity, but as cases and deaths mount, health authorities will ramp up their testing as fast as is feasible. However, they cannot do this forever and probably reach a plateau in the testing capacity at some point. After this point, we will have $U_i \approx U_{i-j}$ for $j \leq 11$ . Certainly testing capacity has changed since the beginning of the first U.S. reported case. But, as we said earlier, those first estimates were not very reliable in the first place. It’s the more recent estimates, well after social distancing and so forth have been introduced, that we really care about because we want to know what effect our mitigation measures are having.

If we did have some estimate of $U$, we could go back and recalculate $R^{\mathrm{true}}_{t_i}$. If the undercounting factor was very large in the past, then that would also reduce the true estimate’s sensitivity to the small incidence numbers that were giving us those unreliable estimates at the beginning of the epidemic.

For example, we might estimate by fitting an exponential curve based on observed testing policies. We might choose an exponential curve of the form $U=1 + Ae^-Bt$, with $A, B > 0$, because it would start off large, then show rapid decrease in $U$ as testing ramped up, then it would level off and approach 1 (i.e., no undercounting) asymptotically as we assumed (otherwise, though, I chose this arbitrarily for illustration purposes).

To estimate $A$ and $B$, observe that initially in the outbreak, when there are very few reported cases and only one or two deaths, the initial reports severely undercount cases because if there has been one death, there are probably many more undetected cases. I have heard various numbers suggesting the undercounting factor could be anywhere from the tens to the thousands early on.

We might reasonably guess that on day 1, the initial reports undercount by a factor of, say, 100. Then suppose that 30% of cases are asymptomatic, 56% are mildly symptomatic, 10% are severe requiring hospitalization, and 4% are critical, requiring ICU or ventilator treatment, possibly resulting in death (as is suggested here). Let’s say by day 28, all people requiring hospitalization, 14%, will be tested, but no one else. Then $U_28 = 1/0.14 = 7.14$. By day 56, testing is available for all hospitalized cases, and say, half of all mild cases, which 14% + 28% = 46%. Then $U_56 = 1/0.46 = 2.17$. With these three points, we could fit our exponential curve and apply it to estimate $R^{\mathrm{true}}_{t_i}$, so that our undercounting curve looks something like this fake curve I generated using wolfram alpha:

I didn’t bother to do this myself because I chose those numbers to estimate undercounting arbitrarily. I am just suggesting a way in which one might deal with undercounting.

Someone asked me about high false negative rates. The effect is the same. It will not change the calculation if the false negative rate is basically constant. A false negative rate of p is equivalent to assuming that the test has a 0% false negative rate, but failing to test a proportion of people equal to p.

# Sources

[1] Shi Zhao, et al., Preliminary estimation of the basic reproduction number of novel coronavirus (2019-nCoV) in China, from 2019 to 2020: A data-driven analysis in the early phase of the outbreak. Int’l J. 92 Infectious Diseases 214-217 (March 2020). https://doi.org/10.1016/j.ijid.2020.01.050.

[2] Coburn BJ; Wagner BG; Blower S (2009). “Modeling influenza epidemics and pandemics: insights into the future of swine flu (H1N1)”. BMC Medicine. 7. Article 30. doi:10.1186/1741-7015-7-30.

[3] Ireland’s Health Services. Health Care Worker Information (PDF).

[4] Freeman, Colin. “Magic formula that will determine whether Ebola is beaten”The Telegraph. Telegraph.Co.Uk.

[5] Shim, Eunha, et al. Transmission potential and severity of COVID-19 in South Korea. International Journal of Infectious Diseases 93 (2020) 339–344. https://linkinghub.elsevier.com/retrieve/pii/S1201971220301508;

[7] Shim, et al. (2020).

[8] Shim, et al. (2020); Nishiura & Chowell, 2009.

[10] Shim et al., 2020.

[11] Nishiura & Chowell, 2009.

[12] Nishiura & Chowell, 2009.

[13] Nishiura & Chowell, 2009.

[14] Chakraborty, S., & Chakravarty, D. Discrete Gamma Distributions: Properties and Parameter Estimations. Communications in Statistics – Theory and Methods, 41(18), 3301–3324 (2012). doi:10.1080/03610926.2011.563014