Double Observer Surveys: Post 1

Full Independence and Equal Sightability

R
Distance Analysis
Author

Trent McDonald

Published

October 12, 2023

Double-observer line-transect surveys estimate the abundance of targets (e.g., targets could be birds, bears, bats, bees, etc.) in an area. Double-observer line-transect surveys involve two observers simultaneously searching for targets as they traverse defined paths within the area. Picture one observer walking right behind another, or one observer in the front seat of an aircraft, with another in the back. Both observers independently record target observations. A single detection distance, from observers to the target, is recorded when targets are spotted.

Why the second observer? Wouldn’t a single observer suffice? A single observer would suffice, if we make an assumption, so there is an analysis cost. A standard assumption of single-observer line-transect surveys is that all targets on the path are seen. This assumption is necessary to estimate abundance and is known as the “g(0) = 1” assumption. In other word, the detection function at 0, “g(0)”, is 100%.

The entire point of including two simultaneous observers searching the same areas for the same targets is that it allows analysts to compare the rates of targets seen and not seen by each observer. These rates in turn estimate the probability of detecting targets on the path. In other words, double-observer surveys relax the “g(0) = 1” assumption and do not assume perfect detection on the path. “g(0)” can be estimated from the collected data.

In this post, I demonstrate estimation of the simplest double-observer method using R. The simplest double-observer analysis makes the following two assumptions:

  1. Observers are fully independent. Detection by one observer does not influence detection by the other, regardless of distance. At all distances, probability of detection at least one observer the sum of individual observer detection probabilities minus the probability of detection by both observers.
  2. Distance functions are identical. Detection at all positive distances is the same between observers. Put another way, the decline in detection as distance increases is identical among observers.

I show analysis of this simple case using R package mrds, and then I show equivalent computations in base R. Showing both will make the computations clear. Among observers, I assume observer 2 is somehow distracted, for example with data collection or by having to fly the aircraft, and is not expected to observe everything. Observer 1 is the primary observer and is expected to be the better observer. Observer 2’s overall detection probability is expected to be less than that of Observer 1.

Note

I use the ‘package’::‘function’ R syntax for function calls when the function resides in a non-base package. This syntax makes package membership explicit and, in my opinion, is more helpful than attaching package because attaching packages obscures package memebership of function.

Nevertheless, to re-run the code in this post, you will need the following R packages:

  • mrds, knitr, and dplyr

Example Data

I analyze the golf tee data included in the mrds package, primarily because it is convenient. The following code copies the data set from mrds into the current working directory.

data("book.tee.data", package = "mrds")

The golf tee data comes packaged as a 4 element list, each pre-formatted for input to mrds. The primary data frame containing detections is in the book.tee.dataframe element.

knitr::kable(book.tee.data$book.tee.dataframe[1:6,])
object observer detected distance size sex exposure
1 1 1 1 2.68 2 1 1
21 1 2 0 2.68 2 1 1
2 2 1 1 3.33 2 1 0
22 2 2 0 3.33 2 1 0
3 3 1 1 0.34 1 0 0
23 3 2 0 0.34 1 0 0

The book.tee.dataframe data frame contains two rows per detected target (object) and columns indicating which observer detected the object (observer and detected). The common detection distance is in column distance. Column size is the number of golf tees in the detected group. The remaining columns (sex and exposure) are covariates. sex is 0 for green golf tees, 1 for yellow tees. exposure reports a categorical variable based on height of the tee above ground. exposure equals 0 for ‘low’, short, tees, and 1 for ‘high’, tall, tees. I do not use covariates in this analysis, but will in subsequent posts.

Abundance in Covered Area

Using mrds

The following code uses mrds::ddf to estimate abundance in the surveyed area under assumptions of this case, i.e., fully-independent observers and identical distance functions. An analysis on of this data on Distancesample.org truncates distances at 4 [m]; but, there are no distances greater than 4 [m] in the data set, so truncation is not necessary and the meta.data = list(width = T) line is not necessary (but I’ll leave it because truncation is usually necessary). The key mrds parameter is method='trial.fi' which specifies observer 1 as the primary observer and observer 2 as the secondary. This model views observer 2 as setting up detection “trials” for observer 1. The other parameters specify the model and the data set.

This is the call to ddf of the mrds package.

mrds.fi.mr <- mrds::ddf(method='trial.fi'
                      , mrmodel=~glm(~ 1)
                      , data=book.tee.data$book.tee.dataframe
                      , meta.data=list(width=4)
                      )
mrds.fi.mr.sum <-summary(mrds.fi.mr)
mrds.fi.mr.sum
> 
> Summary for trial.fi object 
> Number of observations               :  162 
> Number seen by primary               :  124 
> Number seen by secondary (trials)    :  142 
> Number seen by both (detected trials):  104 
> AIC                                  :  510.766 
> 
> 
> Conditional detection function parameters:
>             estimate       se
> (Intercept) 1.006805 0.189555
> 
>                         Estimate          SE         CV
> Average p              0.7323944  0.03799899 0.05188324
> Average primary p(0)   0.7323944  0.03715144 0.05072600
> N in covered region  169.3076923 11.64565815 0.06878399

Estimated abundance of golf tee groups in the surveyed area appears on the last line of output in the above box, i.e.,

mrds_ests <- c("N in covered region" = mrds.fi.mr.sum$Nhat
              , "SE" = mrds.fi.mr.sum$Nhat.se 
              )
mrds_ests
> N in covered region                  SE 
>           169.30769            11.64566

The Mark-Recapture Estimator

If observers are fully independent, and probability of detection declines with distance at the same rate for both observers, we only need the mr (mark-recapture) portion of mrds (mark-recapture distance-sampling) method. In this case, double-observer distance data can be analyzed as if it came from a two-session closed population mark-recapture data.

The first capture session of two-session closed population mark-recapture studies entails capturing individuals, marking them, and releasing them. Later, these studies conduct a second capture session wherein they record the number of marked individuals captured, as well as the total number of captured individuals in the second sample. Assuming no change in the population (no births, deaths, emigration, or immigration) between capture sessions, we estimate population size as the ratio of the number of animals captured in the second session and the probability of capturing an animal, which we estimate as the number of marked individuals in the second sample divided by the total number of marked individuals in the population.

A double-observer study that assumes equal distance functions and full independence views the secondary observer as the first “occasion” while the primary observer is the second “occasion”. The secondary observer “marks” targets during the first “occasion” while the primary observer “captures” a random sample of marked individuals during the second “occasion”. From one point of view, the secondary observer generates detection trials for the primary observer, hence the name trial.fi in mrds, which stands for <trial method>.<fully independent>.

It is interesting to note that reversing observers, i.e., swapping the primary observer, makes no difference to this analysis. Primary and secondary observers will matter when we consider covariates in the distance function (see a subsequent post on that topic).

To estimate abundance in the surveyed area, we exploit the parallel between the distance sampling situation and mark-recapture analyses, and apply a well known mark-recapture estimator. I will apply the Chapman mark-recapture estimator which is a derivative of the famous Lincoln-Peterson estimator. I start here with the Lincoln-Peterson estimator because its form illuminates the logic of the estimator and motivates the Chapman estimator.

The Lincoln-Peterson estimator equates the ratio of marked individuals in the population just after the first occasion, and the same ratio in the sample taken during occasion two, i.e., \[ \frac{n_1}{N} = \frac{m_2}{n_2}, \] where \(n_1\) is the number of targets seen by the secondary observer, \(n_2\) is the number of targets seen by the primary observer, and \(m_2\) is the number of “marked” targets seen by the primary observer, or the number of targets seen by both the primary and secondary observer. Lincoln-Peterson estimates \(N\) by solving the above equation for \(N\), i.e., \[ \hat{N} = \frac{n_1 n_2}{m_2}. \]

Re-expressing the estimator as, \[ \hat{N} = \frac{n_2}{m_2 / n_1} \] shows that it is the ratio of the number of targets seen by the primary observer (\(n_2\)) and an estimate of the probability of detecting a target in the population (\(m_2 / n_1\)). Horvitz-Thompson type estimators have this general form. Horvitz-Thompson type estimators pervade distance sampling.

It is well known that the Lincoln-Peterson estimator is biased for true population size. To reduce bias, analysts usually apply the so-called Chapman estimator in practice. The Chapman estimator is less biased than the Lincoln-Peterson and its variance estimator is better behaved.

The bias correction of the Chapman estimator simply adds 1 to every count of the Lincoln-Peterson estimator, and subtracts 1 from the result, i.e., \[ \hat{N} = \frac{(n_1+1)(n_2+1)}{(m_2+1)} - 1 \] The Chapman estimator is close to the Lincoln-Peterson estimator when \(m_2\) is a substantial fraction of both \(n_1\) and \(n_2\); but, the two diverge as \(m_2\) decreases.

The estimated standard error of the Chapman estimator is, \[ \hat{se}(N) = \frac{(n_1+1)(n_2+1)(n_1-m_2)(n_2-m_2)}{(m_2+1)^2 (m_2 + 2)}. \]

Using Base R

In this section, I demonstrate base R calculations required to repeat the mrds analysis. The following code computes \(n_1\), \(n_2\), and \(m_2\). Counts \(n_1\) and \(m_2\) must be calculated from a data frame of observer 1’s detections of observer 2’s targets. In the following code, data frame df_1given2 contains only those targets detected by observer 2 and column detected informs us whether observer 1 detected each of the targets detected by observer 2. Count \(n_2\) requires the original data set.

# Construct the 'trial' data frame. 
# Observer 1's observations of observer 2's targets.
objs_seenBy_2 <- book.tee.data$book.tee.dataframe |>
  dplyr::filter(observer == 2 & detected == 1) |>
  dplyr::select(object)

df_1given2 <- book.tee.data$book.tee.dataframe |>
  dplyr::filter(observer == 1) |>
  dplyr::right_join(objs_seenBy_2, by = "object")

# Counts for the Chapman estimator
n1 <- nrow(df_1given2)
n2 <- book.tee.data$book.tee.dataframe |>
  dplyr::filter(observer == 1) |>
  dplyr::summarise(n = sum(detected)) |>
  dplyr::pull(n)
m2 <- sum(df_1given2$detected)

c(n1 = n1
  , n2 = n2
  , m2 = m2)
>  n1  n2  m2 
> 142 124 104

These counts match output from mrds (i.e., number seen by primary, secondary, and both). The Lincoln-Peterson estimate of target abundance in the surveyed strip matches the estimate from mrds exactly, i.e.,

# Lincoln-Peterson estimator ----
N_lp <- n1 * n2 / m2
c("N in covered region (Lincoln-Peterson)" = N_lp)
> N in covered region (Lincoln-Peterson) 
>                               169.3077

The Chapman estimator is slightly different from the Lincoln-Peterson. The Chapman estimator and it’s variance are,

# Chapman estimator ----
N_ch <- ((n1+1) * (n2+1) / (m2+1)) - 1
se_N_ch <- ( ((n1+1)*(n2+1)*(n1-m2)*(n2-m2)) / ((m2+1)^2 * (m2 + 2)))
ch_ests <- c("N in covered region" = N_ch
  , "SE" = se_N_ch)
ch_ests
> N in covered region                  SE 
>           169.23810            11.62452

Finally, confidence intervals constructed from either estimator assume normality and take the standard \(\pm 1.96\) form. For comparison, Table 1 contains a summary of both estimator’s numerical values.

fi.ests <- dplyr::bind_rows(mrds_ests, ch_ests) |>
  dplyr::mutate(estimator = c("mrds", "Chapman")) |>
  dplyr::select(estimator, dplyr::everything()) |>
  dplyr::mutate(lo_95ci = `N in covered region` - 1.96*SE
              , hi_95ci = `N in covered region` + 1.96*SE)
knitr::kable(fi.ests)
Table 1: Estimated abundance, standard error, and 95% confidence intervals of targets (golf tees) in the surveyed area computed by mrds and base R.
estimator N in covered region SE lo_95ci hi_95ci
mrds 169.3077 11.64566 146.4822 192.1332
Chapman 169.2381 11.62452 146.4540 192.0222

Abundance in Study Area

Using mrds

Abundance in the larger study area is estimated by assuming density is equal everywhere, and simply applying density in the surveyed area to the entire study area. The reader will see that expansion to study area using mrds is convenient because the golf tee data are pre-formatted for mrds; but, I have often been unclear whether I have manipulated data into the format required by mrds. So, I will attempt to explain the format of the additional information required to estimate study area abundance.

In mrds, expansion to the study area is accomplished by a call to the dht function. dht requires three data frames, in addition to the object returned by ddf that includes the detection data frame.

  1. Region Size: A data frame containing one line per sub-region in the study area for which estimates are desired. Each line should contain the region’s identifier in a column named Region.Label and the region’s size in a column named Area. The region’s area should be in square units of the distance measurements and transect lengths. If distances were collected in meters, area of each region should be in square meters. If distances are in feet, area should be square feet, and so on.
  2. Transect Lengths: A data frame containing one line per transect. mrds documentation calls this data frame the ‘samples’ data set because in other studies the rows are points, and not transects. For simplicity, I will call ‘samples’ transects. Each line of this data frame should contain the transect’s identifier in a column named Sample.Label, the study region identifier containing the transect in a column named Region.Label, and the length of each transect in a column named Effort. Transect lengths must be reported in the same measurement units as distances, and in the square root of area units.
  3. Observations: A data frame containing one line per detection. mrds documentation calls this the ‘obs’ data set. Each line should contain the detection target’s identifier in a column named object, the transect identifier on which the target was detected in a column named Sample.Label, and the study region identifier in which the transect resides in a column named Region.Label.
Note

mrds is inflexible in its column names. The columns of each data frame must be named exactly as stated above.

Measurement Units

mrds does not check compatibility of measurement units. In real studies, analysts must carefully convert, check, and input the correct distance units into all mrds routines.

The golf tee region data frame appears in Table 2. The golf tee transect data frame appears in Table 3. The first six rows of the golf tee observation data frame appear in Table 4.

Table 2: The golf tee region data frame containing region identifier and area.
Region.Label Area
1 1040
2 640
Table 3: The golf tee transect data frame containing region identifier, transect identifier, and length.
Sample.Label Region.Label Effort
1 1 10
2 1 30
3 1 30
4 1 27
5 1 21
6 1 12
7 2 23
8 2 23
9 2 15
10 2 12
11 2 7
Table 4: Six lines of the golf tee observation data frame containing target identifier, region identifier, and transect identifier.
object Region.Label Sample.Label
1 1 1 1
2 2 1 1
3 3 1 1
12 4 1 2
13 25 1 2
14 26 1 2

In the golf tee example, observers walked 11 transects of varying length that totaled 210 [m]. I and others treat the sampled area as covering the entire study area. In most real studies this will not be true. Here, maximum strip width is 4 [m] and study area size is (210 [m])(2(4 [m])) = 1680 [m^2]. The total study area was subdivided into two arbitrary regions of size 1040 [m^2] and 640 [m^2] each. I will ignore these two sub-regions and report abundance over the entire area. The only information necessary to estimate abundance of golf tees from abundance of golf tee groups (see above) is group size.

Given correctly named columns, correct rows, and correct units, exansion to study area is a straightforward call to mrds::dht.

mrds_saEsts <- mrds::dht(
      model = mrds.fi.mr
    , region.table = book.tee.data$book.tee.region
    , sample.table = book.tee.data$book.tee.samples
    , obs.table = book.tee.data$book.tee.obs)
mrds_saEsts
> Abundance and density estimates from distance sampling
> Variance       : R2, N/L 
> 
> Summary statistics
> 
>   Region Area CoveredArea Effort   n  k        ER      se.ER      cv.ER
> 1      1 1040        1040    130  72  6 0.5538462 0.02926903 0.05284685
> 2      2  640         640     80  52  5 0.6500000 0.08292740 0.12758061
> 3  Total 1680        1680    210 124 11 0.5904762 0.03641856 0.06167659
> 
> Summary for clusters
> 
> Abundance:
>   Region  Estimate        se         cv       lcl      ucl        df
> 1      1  98.30769  7.201276 0.07325242  84.29420 114.6509 17.842176
> 2      2  71.00000  9.747951 0.13729508  50.31898 100.1809  5.360291
> 3  Total 169.30769 13.520391 0.07985692 143.18017 200.2030 17.840640
> 
> Density:
>   Region   Estimate          se         cv        lcl       ucl        df
> 1      1 0.09452663 0.006924304 0.07325242 0.08105211 0.1102412 17.842176
> 2      2 0.11093750 0.015231173 0.13729508 0.07862341 0.1565326  5.360291
> 3  Total 0.10077839 0.008047852 0.07985692 0.08522629 0.1191684 17.840640
> 
> Summary for individuals
> 
> Abundance:
>   Region Estimate       se         cv      lcl      ucl        df
> 1      1 312.6731 26.07225 0.08338501 260.9894 374.5916 12.426140
> 2      2 207.5385 37.99597 0.18307918 128.9206 334.0989  4.691818
> 3  Total 520.2115 49.57222 0.09529243 423.2323 639.4125 12.457770
> 
> Density:
>   Region  Estimate         se         cv       lcl       ucl        df
> 1      1 0.3006472 0.02506947 0.08338501 0.2509514 0.3601843 12.426140
> 2      2 0.3242788 0.05936870 0.18307918 0.2014384 0.5220295  4.691818
> 3  Total 0.3096497 0.02950727 0.09529243 0.2519240 0.3806027 12.457770
> 
> Expected cluster size
>   Region Expected.S se.Expected.S cv.Expected.S
> 1      1   3.180556     0.2114629    0.06648615
> 2      2   2.923077     0.1750319    0.05987935
> 3  Total   3.072581     0.1391365    0.04528327

mrds::dht estimates the abundance of golf tee clusters to be 169.31 groups (“Total” line, “Abundance” table in “Summary for clusters” section of output). Again, normally this number will be different from ‘N in covered area’ because study area size will be greater than the surveyed area. mrds::dht estimates the abundance of all golf tees in the 1680 [m^2] study area to be 520.21 (“Total” line, “Abundance” table in “Summary for individuals” section of output), which differs from ‘N in the covered area’ by average group size.

Using Base R

Computation of tee abundance in the study area is straightforward. I compute average group size over targets detected by observer 1. I estimate tee abundance in the study area by multiplying abundance in the covered area by average group size.

avgGroupSize <- book.tee.data$book.tee.dataframe |>
  dplyr::filter(observer == 1 & detected == 1) |>
  dplyr::summarise(mean = mean(size)
                 , se = sd(size) / sqrt(dplyr::n())) 
avgGroupSize
>       mean        se
> 1 3.072581 0.1537082
ch_saEst <- data.frame(t(ch_ests)) |> 
  dplyr::mutate(est = N.in.covered.region * avgGroupSize$mean ) |>
  dplyr::pull(est)
ch_saEst
> [1] 519.9977

This abundance estimate, 520, differs from abundance estimated by mrds, 520.21, only because I used the Chapman estimate of abundance in the covered area.

Estimation of the variance is involves an approximation for the variance of the product of two random variables. The approximation I use here for the variance of a product is one of the difference between the mrds analysis and the base R calculations. The following code computes the standard error of abundance based on the Chapman estimator and this approximation.

ch_saEst_se <- data.frame(t(ch_ests)) |> 
  dplyr::bind_cols(avgGroupSize) |>
  dplyr::mutate(est = sqrt(N.in.covered.region^2 * se^2 + mean^2 * SE^2 + se^2*SE^2) ) |>
  dplyr::pull(est)
ch_saEst_se
> [1] 44.22225

final <- data.frame(mrds_saEsts$individuals$N) |>
  dplyr::filter(Label == "Total") |>
  dplyr::select(Estimate, se, lcl, ucl) |>
  dplyr::bind_rows(data.frame(Estimate = ch_saEst
                            , se = ch_saEst_se
                            , lcl = ch_saEst - 1.95*ch_saEst_se
                            , ucl = ch_saEst + 1.95*ch_saEst_se)) |>
  dplyr::mutate(Estimator = c("mrds", "Chapman")) |>
  dplyr::select(Estimator, dplyr::everything())

knitr::kable(final)
Estimator Estimate se lcl ucl
mrds 520.2115 49.57222 423.2323 639.4125
Chapman 519.9977 44.22225 433.7643 606.2311

Final Comments

I have shown analysis of a simple distance sampling situation wherein observers are fully independent, their distance functions are identical, and one observer is primary while the other is secondary. In this case, the estimate of abundance reduces to mark-recapture estimates. I have shown calculations using package mrds and base R. The differences between mrds and the base R computations I show are:

  1. MR estimator: In base R, I used the Chapman mark-recapture estimator of abundance instead of the Lincoln-Peterson estimator used by mrds. The Chapman estimator is known to be less biased than the Lincoln-Peterson; but, mrds uses Lincoln-Peterson because it fits directly into the package’s general Horvitz-Thompson estimation approach.
  2. Variance of MR estimator: I used the closed the closed form variance estimator for the Chapman abundance estimate. The mrds estimate of standard error of abundance in the surveyed area was very close to the closed form estimate, but assumabley is based on the Horvitz-Thompson variance estimator. That said, the standard error estimated by mrds::ddf and mrds::dht differ, and this is a difference I do not yet understand. Standard error reported by mrds::ddf for N in the covered area was 11.6457 while standard error for the same quantity reported by mrds::dht was 13.5204. It is interesting to note that this is not an effect of the two sub-regions. When I eliminate sub-regions, mrds::dht still increase its estimate of standard error relative to mrds::ddf. In a subsequent post, I will dive into the H-T variance estimator and hopefully figure this out.
  3. Variance of Study Area Abundance: I used the variance of a product formula to compute standard error of the study area’s total abundance. I assume mrds::dht uses one of the H-T variance estimators to arrive at a slightly higher value for standard error.
  4. Confidence Interval on Abundance: I used the standard normal assumption and computed 95% confidence intervals using the \(\pm 1.96(se)\) method. mrds::dht apparently uses a variance estimator that yields an asymmetric confidence interval.

In conclusion, the simple mark-recapture point estimates of abundance from double-observer data are clear. Some mystery remains, for me, regarding the standard error estimates and confidence intervals on total study area abundance. I will figure this mystery out and post about it later.