# Double Observer Surveys: Post 1

Full Independence and Equal Sightability

R
Distance Analysis
Author

Trent McDonald

Published

October 12, 2023

Distance sampling surveys often involve two observers. With two observer, analysts can estimate $$g(0) < 1$$. In this post, I demonstrate the simplest double-observer method, fully independent observers and equal distance functions.

Double observer line-transect surveys involve two observers traversing transects in close succession. Picture one observer walking right behind another, or one observer in the front seat and another in the back seat of an airplane. Both observers independently record target observations and a single distance from the observer’s location to the target is recorded. Assuming targets do not move between detection by each observer, one of the primary assumptions of standard (single-observer) distance sampling, the “g(0) = 1” assumption, can be relaxed. That is, double observer surveys do not assume perfect detection on the transect because detection on the transect can be estimated from survey data.

This post illustrates analysis of double-observer surveys in their simplest form. The simplest form of double-observer analysis makes the following two assumptions:

1. Observers are fully independent. Detection by observer does not influence detection by the other observer, regardless of distance. At all distances, probability of detection by both observers is the sum of detection probabilities of the individual observers.
2. Distance functions are identical. The decline in detection probabilities is identical among observers.

I show analysis of this simple case using R package mrds, and then I show equivalent computations in base R. Showing both will make the computations clear, and it certainly helped my understanding. Among observers, I assume observer 2 is somehow distracted, for example with data collection or by having to fly the aircraft, and is not expected to observe everything. Observer 1 is the primary observer and is expected to be the better observer. Observer 2’s overall detection probability is expected to be less than that of Observer 1.

Note

I use the ‘package’::‘function’ R syntax for function calls when the function involved resides in a non-base package. This syntax makes package membership explicit and is more helpful than attaching package in my opinion because attaching packages obscures which function is contained in which package.

Nevertheless, to re-run the code in this post, you will need the following R packages:

• mrds, knitr, and dplyr

# Example Data

I will analyze the golf tee data that is included in the mrds package because it is a convienient example of double-observer data. The following code copies the data set from mrds into the current working directory.

data("book.tee.data", package = "mrds")

The golf tee data comes packaged as a list with 4 elements, each pre-formatted for input to mrds. The primary data frame containing detections is in the book.tee.dataframe element.

knitr::kable(book.tee.data$book.tee.dataframe[1:6,]) object observer detected distance size sex exposure 1 1 1 1 2.68 2 1 1 21 1 2 0 2.68 2 1 1 2 2 1 1 3.33 2 1 0 22 2 2 0 3.33 2 1 0 3 3 1 1 0.34 1 0 0 23 3 2 0 0.34 1 0 0 The book.tee.dataframe dataframe contains two rows per detected target (object) and columns indicating which observer detected the object (observer and detected). The common detection distance is in column distance. Column size is the number of golf tees in the detected group. The remaining columns (sex and exposure) are covariates. sex is 0 for green golf tees, 1 for yellow tees. exposure reports a categorical variable based on height of the tee above ground. exposure equals 0 for ‘low’ exposure or short tees, and 1 for ‘high’ exposure for tall tees. I do not use either covariate in this analysis, but will in subsequent posts. # Abundance in Covered Area ### Using mrds The following code uses mrds to estimate abundance in the surveyed area assuming observers are fully-independent and distance functions are identical among observers. An analysis on of this data on Distancesample.org truncates distances at 4 [m]; but, there are no distances greater than 4 [m] in the data set, so truncation is not necessary. The meta.data = list(width = T) line is not necessary. The key mrds inputs are method='trial.fi' which specifies observer 1 as the primary observer and observer 2 as the secondary. This model views observer 2 as setting up detection “trials” for observer 1. This is the call to ddf of the mrds package. mrds.fi.mr <- mrds::ddf(method='trial.fi' , mrmodel=~glm(~ 1) , data=book.tee.data$book.tee.dataframe
, meta.data=list(width=4)
)
mrds.fi.mr.sum <-summary(mrds.fi.mr)
mrds.fi.mr.sum
>
> Summary for trial.fi object
> Number of observations               :  162
> Number seen by primary               :  124
> Number seen by secondary (trials)    :  142
> Number seen by both (detected trials):  104
> AIC                                  :  510.766
>
>
> Conditional detection function parameters:
>             estimate       se
> (Intercept) 1.006805 0.189555
>
>                         Estimate          SE         CV
> Average p              0.7323944  0.03799899 0.05188324
> Average primary p(0)   0.7323944  0.03715144 0.05072600
> N in covered region  169.3076923 11.64565815 0.06878399

Estimated abundance of golf tee groups in the surveyed area appears on the last line of output, i.e.,

mrds_ests <- c("N in covered region" = mrds.fi.mr.sum$Nhat , "SE" = mrds.fi.mr.sum$Nhat.se
)
mrds_ests
> N in covered region                  SE
>           169.30769            11.64566

### The Mark-Recapture Estimator

If observers are fully independent, and probability of detection declines with distance at the same rate for both observers, we only need the mr (mark-recapture) portion of mrds (mark-recapture distance-sampling) method. In this case, double-observer distance data can be analyzed as if it came from a two-session closed population mark-recapture data.

The first capture session of two-session closed population mark-recapture studies entails capturing individuals, marking them, and releasing them. Later, these studies conduct a second capture session wherein they record the number of marked individuals captured, as well as the total number of captured individuals in the second sample. Assuming no change in the population (no births, deaths, emigration, or immigration) between capture sessions, we estimate population size as the ratio of the number of animals captured in the second session and the probability of capturing an animal, which we estimate as the number of marked individuals in the second sample divided by the total number of marked individuals in the population.

A double-observer study that assumes equal distance functions and full independence views the secondary observer as the first “occasion” while the primary observer is the second “occasion”. The secondary observer “marks” targets during the first “occasion” while the primary observer “captures” a random sample of marked individuals during the second “occasion”. From one point of view, the secondary observer generates detection trials for the primary observer, hence the name trial.fi in mrds, which stands for <trial method>.<fully independent>.

It is interesting to note that reversing observers, i.e., swapping the primary observer, makes no difference to this analysis. Primary and secondary observers will matter when we consider covariates in the distance function (see a subsequent post on that topic).

To estimate abundance in the surveyed area, we exploit the parallel between the distance sampling situation and mark-recapture analyses, and apply a well known mark-recapture estimator. I will apply the Chapman mark-recapture estimator which is a derivative of the famous Lincoln-Peterson estimator. I start here with the Lincoln-Peterson estimator because its form illuminates the logic of the estimator and motivates the Chapman estimator.

The Lincoln-Peterson estimator equates the ratio of marked individuals in the population just after the first occasion, and the same ratio in the sample taken during occasion two, i.e., $\frac{n_1}{N} = \frac{m_2}{n_2},$ where $$n_1$$ is the number of targets seen by the secondary observer, $$n_2$$ is the number of targets seen by the primary observer, and $$m_2$$ is the number of “marked” targets seen by the primary observer, or the number of targets seen by both the primary and secondary observer. Lincoln-Peterson estimates $$N$$ by solving the above equation for $$N$$, i.e., $\hat{N} = \frac{n_1 n_2}{m_2}.$

Re-expressing the estimator as, $\hat{N} = \frac{n_2}{m_2 / n_1}$ shows that it is the ratio of the number of targets seen by the primary observer ($$n_2$$) and an estimate of the probability of detecting a target in the population ($$m_2 / n_1$$). Horvitz-Thompson type estimators have this general form. Horvitz-Thompson type estimators pervade distance sampling.

It is well known that the Lincoln-Peterson estimator is biased for true population size. To reduce bias, analysts usually apply the so-called Chapman estimator in practice. The Chapman estimator is less biased than the Lincoln-Peterson and its variance estimator is better behaved.

The bias correction of the Chapman estimator simply adds 1 to every count of the Lincoln-Peterson estimator, and subtracts 1 from the result, i.e., $\hat{N} = \frac{(n_1+1)(n_2+1)}{(m_2+1)} - 1$ The Chapman estimator is close to the Lincoln-Peterson estimator when $$m_2$$ is a substantial fraction of both $$n_1$$ and $$n_2$$; but, the two diverge as $$m_2$$ decreases.

The estimated standard error of the Chapman estimator is, $\hat{se}(N) = \frac{(n_1+1)(n_2+1)(n_1-m_2)(n_2-m_2)}{(m_2+1)^2 (m_2 + 2)}.$

### Using Base R

In this section, I demonstrate base R calculations required to repeat the mrds analysis. The following code computes $$n_1$$, $$n_2$$, and $$m_2$$. Counts $$n_1$$ and $$m_2$$ must be calculated from a data frame of observer 1’s detections of observer 2’s targets. In the following code, data frame df_1given2 contains only those targets detected by observer 2 and column detected informs us whether observer 1 detected each of the targets detected by observer 2. Count $$n_2$$ requires the original data set.

# Construct the 'trial' data frame.
# Observer 1's observations of observer 2's targets.
objs_seenBy_2 <- book.tee.data$book.tee.dataframe |> dplyr::filter(observer == 2 & detected == 1) |> dplyr::select(object) df_1given2 <- book.tee.data$book.tee.dataframe |>
dplyr::filter(observer == 1) |>
dplyr::right_join(objs_seenBy_2, by = "object")

# Counts for the Chapman estimator
n1 <- nrow(df_1given2)
n2 <- book.tee.data$book.tee.dataframe |> dplyr::filter(observer == 1) |> dplyr::summarise(n = sum(detected)) |> dplyr::pull(n) m2 <- sum(df_1given2$detected)

c(n1 = n1
, n2 = n2
, m2 = m2)
>  n1  n2  m2
> 142 124 104

These counts match output from mrds (i.e., number seen by primary, secondary, and both). The Lincoln-Peterson estimate of target abundance in the surveyed strip matches the estimate from mrds exactly, i.e.,

# Lincoln-Peterson estimator ----
N_lp <- n1 * n2 / m2
c("N in covered region (Lincoln-Peterson)" = N_lp)
> N in covered region (Lincoln-Peterson)
>                               169.3077

The Chapman estimator is slightly different from the Lincoln-Peterson. The Chapman estimator and it’s variance are,

# Chapman estimator ----
N_ch <- ((n1+1) * (n2+1) / (m2+1)) - 1
se_N_ch <- ( ((n1+1)*(n2+1)*(n1-m2)*(n2-m2)) / ((m2+1)^2 * (m2 + 2)))
ch_ests <- c("N in covered region" = N_ch
, "SE" = se_N_ch)
ch_ests
> N in covered region                  SE
>           169.23810            11.62452

Finally, confidence intervals constructed from either estimator assume normality and take the standard $$\pm 1.96$$ form. For comparison, Table 1 contains a summary of both estimator’s numerical values.

fi.ests <- dplyr::bind_rows(mrds_ests, ch_ests) |>
dplyr::mutate(estimator = c("mrds", "Chapman")) |>
dplyr::select(estimator, dplyr::everything()) |>
dplyr::mutate(lo_95ci = N in covered region - 1.96*SE
, hi_95ci = N in covered region + 1.96*SE)
knitr::kable(fi.ests)
Table 1: Estimated abundance, standard error, and 95% confidence intervals of targets (golf tees) in the surveyed area computed by mrds and base R.
estimator N in covered region SE lo_95ci hi_95ci
mrds 169.3077 11.64566 146.4822 192.1332
Chapman 169.2381 11.62452 146.4540 192.0222

# Abundance in Study Area

### Using mrds

Abundance in the larger study area is estimated by assuming density if equal everywhere, and simply applying density in the surveyed area to the entire study area. The reader will see that expansion to study area using mrds is convenient because the golf tee data are pre-formatted for mrds; but, I have often been unclear whether I have manipulated data into the format required by mrds. So, I will attempt to explain the format of the additional information required to estimate study area abundance.

In mrds, expansion to the study area is accomplished by a call to the dht function. dht requires three data frames, in addition to the object returned by ddf, which required the detection data frame.

1. Region Size: A data frame containing one line per estimation sub-region in the study area. Each line should contain the region’s identifier in a column named Region.Label and the region’s size in a column named Area. Region area should be reported in square units of the distance measurements and transect lengths. If distances were collected in meters, area of each region should be in square meters.
2. Transect Lengths: A data frame containing one line per transect. mrds documentation calls this data frame the ‘samples’ data set because in other studies the rows are points, and not transects. For simplicity, I will call ‘samples’ transects. Each line of this data frame should contain the transect’s identifier in a column named Sample.Label, the study region identifier in which the transect resides in a column named Region.Label, and the length of each transect in a column named Effort. Transect lengths must be reported in the same measurement units as distances, and in the square root of area units.
3. Observations: A data frame containing one line per detection. mrds documentation calls this the ‘obs’ data set. Each line should contain the detection target’s identifier in a column named object, the transect identifier on which the target was detected in a column named Sample.Label, and the study region identifier in which the transect resides in a column named Region.Label.
Note

mrds is inflexible in its column names. The columns of each data frame must be named exactly as stated above.

Measurement Units

mrds does not check compatibility of measurement units. In real studies, analysts must be carefully convert and input the correct distance units into all mrds routines.

The golf tee region data frame appears in Table 2. The golf tee transect data frame appears in Table 3. The first six rows of the golf tee observation data frame appear in Table 4.

Table 2: The golf tee region data frame containing region identifier and area.
Region.Label Area
1 1040
2 640
Table 3: The golf tee transect data frame containing region identifier, transect identifier, and length.
Sample.Label Region.Label Effort
1 1 10
2 1 30
3 1 30
4 1 27
5 1 21
6 1 12
7 2 23
8 2 23
9 2 15
10 2 12
11 2 7
Table 4: Six lines of the golf tee observation data frame containing target identifier, region identifier, and transect identifier.
object Region.Label Sample.Label
1 1 1 1
2 2 1 1
3 3 1 1
12 4 1 2
13 25 1 2
14 26 1 2

In the golf tee example, observers walked 11 transects of varying length that totaled 210 [m]. I and others treat the sampled area as covering the entire study area. In most real studies this will not be true. Here, maximum strip width is 4 [m] and study area size is (210 [m])(2(4 [m])) = 1680 [m^2]. The total study area was subdivided into two arbitrary regions of size 1040 [m^2] and 640 [m^2] each. I will ignore these two sub-regions and report abundance over the entire area. The only information necessary to estimate abundance of golf tees from abundance of golf tee groups (see above) is group size.

Given correctly named columns, correct rows, and correct units, exansion to study area is a straightforward call to mrds::dht.

mrds_saEsts <- mrds::dht(
model = mrds.fi.mr
, region.table = book.tee.data$book.tee.region , sample.table = book.tee.data$book.tee.samples
, obs.table = book.tee.data$book.tee.obs) mrds_saEsts > Abundance and density estimates from distance sampling > Variance : R2, N/L > > Summary statistics > > Region Area CoveredArea Effort n k ER se.ER cv.ER > 1 1 1040 1040 130 72 6 0.5538462 0.02926903 0.05284685 > 2 2 640 640 80 52 5 0.6500000 0.08292740 0.12758061 > 3 Total 1680 1680 210 124 11 0.5904762 0.03641856 0.06167659 > > Summary for clusters > > Abundance: > Region Estimate se cv lcl ucl df > 1 1 98.30769 7.201276 0.07325242 84.29420 114.6509 17.842176 > 2 2 71.00000 9.747951 0.13729508 50.31898 100.1809 5.360291 > 3 Total 169.30769 13.520391 0.07985692 143.18017 200.2030 17.840640 > > Density: > Region Estimate se cv lcl ucl df > 1 1 0.09452663 0.006924304 0.07325242 0.08105211 0.1102412 17.842176 > 2 2 0.11093750 0.015231173 0.13729508 0.07862341 0.1565326 5.360291 > 3 Total 0.10077839 0.008047852 0.07985692 0.08522629 0.1191684 17.840640 > > Summary for individuals > > Abundance: > Region Estimate se cv lcl ucl df > 1 1 312.6731 26.07225 0.08338501 260.9894 374.5916 12.426140 > 2 2 207.5385 37.99597 0.18307918 128.9206 334.0989 4.691818 > 3 Total 520.2115 49.57222 0.09529243 423.2323 639.4125 12.457770 > > Density: > Region Estimate se cv lcl ucl df > 1 1 0.3006472 0.02506947 0.08338501 0.2509514 0.3601843 12.426140 > 2 2 0.3242788 0.05936870 0.18307918 0.2014384 0.5220295 4.691818 > 3 Total 0.3096497 0.02950727 0.09529243 0.2519240 0.3806027 12.457770 > > Expected cluster size > Region Expected.S se.Expected.S cv.Expected.S > 1 1 3.180556 0.2114629 0.06648615 > 2 2 2.923077 0.1750319 0.05987935 > 3 Total 3.072581 0.1391365 0.04528327 mrds::dht estimates the abundance of golf tee clusters to be 169.31 groups (“Total” line, “Abundance” table in “Summary for clusters” section of output). Again, normally this number will be different from ‘N in covered area’ because study area size will be greater than the surveyed area. mrds::dht estimates the abundance of all golf tees in the 1680 [m^2] study area to be 520.21 (“Total” line, “Abundance” table in “Summary for individuals” section of output), which differs from ‘N in the covered area’ by average group size. ### Using Base R Computation of tee abundance in the study area is straightforward. I compute average group size over targets detected by observer 1. I estimate tee abundance in the study area by multiplying abundance in the covered area by average group size. avgGroupSize <- book.tee.data$book.tee.dataframe |>
dplyr::filter(observer == 1 & detected == 1) |>
dplyr::summarise(mean = mean(size)
, se = sd(size) / sqrt(dplyr::n()))
avgGroupSize
>       mean        se
> 1 3.072581 0.1537082
ch_saEst <- data.frame(t(ch_ests)) |>
dplyr::mutate(est = N.in.covered.region * avgGroupSize$mean ) |> dplyr::pull(est) ch_saEst >  519.9977 This abundance estimate, 520, differs from abundance estimated by mrds, 520.21, only because I used the Chapman estimate of abundance in the covered area. Estimation of the variance is involves an approximation for the variance of the product of two random variables. The approximation I use here for the variance of a product is one of the difference between the mrds analysis and the base R calculations. The following code computes the standard error of abundance based on the Chapman estimator and this approximation. ch_saEst_se <- data.frame(t(ch_ests)) |> dplyr::bind_cols(avgGroupSize) |> dplyr::mutate(est = sqrt(N.in.covered.region^2 * se^2 + mean^2 * SE^2 + se^2*SE^2) ) |> dplyr::pull(est) ch_saEst_se >  44.22225 final <- data.frame(mrds_saEsts$individuals\$N) |>
dplyr::filter(Label == "Total") |>
dplyr::select(Estimate, se, lcl, ucl) |>
dplyr::bind_rows(data.frame(Estimate = ch_saEst
, se = ch_saEst_se
, lcl = ch_saEst - 1.95*ch_saEst_se
, ucl = ch_saEst + 1.95*ch_saEst_se)) |>
dplyr::mutate(Estimator = c("mrds", "Chapman")) |>
dplyr::select(Estimator, dplyr::everything())

knitr::kable(final)
Estimator Estimate se lcl ucl
mrds 520.2115 49.57222 423.2323 639.4125
Chapman 519.9977 44.22225 433.7643 606.2311

I have shown analysis of a simple distance sampling situation wherein observers are fully independent, their distance functions are identical, and one observer is primary while the other is secondary. In this case, the estimate of abundance reduces to mark-recapture estimates. I have shown calculations using package mrds and base R. The differences between mrds and the base R computations I show are:
1. MR estimator: In base R, I used the Chapman mark-recapture estimator of abundance instead of the Lincoln-Peterson estimator used by mrds. The Chapman estimator is known to be less biased than the Lincoln-Peterson; but, mrds uses Lincoln-Peterson because it fits directly into the package’s general Horvitz-Thompson estimation approach.
2. Variance of MR estimator: I used the closed the closed form variance estimator for the Chapman abundance estimate. The mrds estimate of standard error of abundance in the surveyed area was very close to the closed form estimate, but assumabley is based on the Horvitz-Thompson variance estimator. That said, the standard error estimated by mrds::ddf and mrds::dht differ, and this is a difference I do not yet understand. Standard error reported by mrds::ddf for N in the covered area was 11.6457 while standard error for the same quantity reported by mrds::dht was 13.5204. It is interesting to note that this is not an effect of the two sub-regions. When I eliminate sub-regions, mrds::dht still increase its estimate of standard error relative to mrds::ddf. In a subsequent post, I will dive into the H-T variance estimator and hopefully figure this out.
3. Variance of Study Area Abundance: I used the variance of a product formula to compute standard error of the study area’s total abundance. I assume mrds::dht uses one of the H-T variance estimators to arrive at a slightly higher value for standard error.
4. Confidence Interval on Abundance: I used the standard normal assumption and computed 95% confidence intervals using the $$\pm 1.96(se)$$ method. mrds::dht apparently uses a variance estimator that yields an asymmetric confidence interval.