`data("book.tee.data", package = "mrds")`

# Double Observer Surveys: Post 1

Full Independence and Equal Sightability

Distance sampling surveys often involve two observers. With two observer, analysts can estimate \(g(0) < 1\). In this post, I demonstrate the simplest double-observer method, fully independent observers and equal distance functions.

Double observer line-transect surveys involve two observers traversing transects in close succession. Picture one observer walking right behind another, or one observer in the front seat and another in the back seat of an airplane. Both observers independently record target observations and a single distance from the observer’s location to the target is recorded. Assuming targets do not move between detection by each observer, one of the primary assumptions of standard (single-observer) distance sampling, the “g(0) = 1” assumption, can be relaxed. That is, double observer surveys do not assume perfect detection on the transect because detection on the transect can be estimated from survey data.

This post illustrates analysis of double-observer surveys in their simplest form. The simplest form of double-observer analysis makes the following two assumptions:

. Detection by observer does not influence detection by the other observer, regardless of distance. At all distances, probability of detection by both observers is the sum of detection probabilities of the individual observers.*Observers are fully independent*. The decline in detection probabilities is identical among observers.*Distance functions are identical*

I show analysis of this simple case using R package `mrds`

, and then I show equivalent computations in base R. Showing both will make the computations clear, and it certainly helped my understanding. Among observers, I assume observer 2 is somehow distracted, for example with data collection or by having to fly the aircraft, and is not expected to observe everything. Observer 1 is the primary observer and is expected to be the better observer. Observer 2’s overall detection probability is expected to be less than that of Observer 1.

I use the ‘*package*’::‘*function*’ R syntax for function calls when the function involved resides in a non-base package. This syntax makes package membership explicit and is more helpful than attaching package in my opinion because attaching packages obscures which function is contained in which package.

Nevertheless, **to re-run the code in this post, you will need the following R packages**:

`mrds`

,`knitr`

, and`dplyr`

# Example Data

I will analyze the golf tee data that is included in the `mrds`

package because it is a convienient example of double-observer data. The following code copies the data set from `mrds`

into the current working directory.

The golf tee data comes packaged as a list with 4 elements, each pre-formatted for input to `mrds`

. The primary data frame containing detections is in the `book.tee.dataframe`

element.

`::kable(book.tee.data$book.tee.dataframe[1:6,]) knitr`

object | observer | detected | distance | size | sex | exposure | |
---|---|---|---|---|---|---|---|

1 | 1 | 1 | 1 | 2.68 | 2 | 1 | 1 |

21 | 1 | 2 | 0 | 2.68 | 2 | 1 | 1 |

2 | 2 | 1 | 1 | 3.33 | 2 | 1 | 0 |

22 | 2 | 2 | 0 | 3.33 | 2 | 1 | 0 |

3 | 3 | 1 | 1 | 0.34 | 1 | 0 | 0 |

23 | 3 | 2 | 0 | 0.34 | 1 | 0 | 0 |

The `book.tee.dataframe`

dataframe contains two rows per detected target (`object`

) and columns indicating which observer detected the object (`observer`

and `detected`

). The common detection distance is in column `distance`

. Column `size`

is the number of golf tees in the detected group. The remaining columns (`sex`

and `exposure`

) are covariates. `sex`

is 0 for green golf tees, 1 for yellow tees. `exposure`

reports a categorical variable based on height of the tee above ground. `exposure`

equals 0 for ‘low’ exposure or short tees, and 1 for ‘high’ exposure for tall tees. I do not use either covariate in this analysis, but will in subsequent posts.

# Abundance in Covered Area

### Using `mrds`

The following code uses `mrds`

to estimate abundance in the surveyed area assuming observers are fully-independent and distance functions are identical among observers. An analysis on of this data on Distancesample.org truncates distances at 4 [m]; but, there are no distances greater than 4 [m] in the data set, so truncation is not necessary. The `meta.data = list(width = T)`

line is not necessary. The key `mrds`

inputs are `method='trial.fi'`

which specifies observer 1 as the primary observer and observer 2 as the secondary. This model views observer 2 as setting up detection “trials” for observer 1.

This is the call to `ddf`

of the `mrds`

package.

```
<- mrds::ddf(method='trial.fi'
mrds.fi.mr mrmodel=~glm(~ 1)
, data=book.tee.data$book.tee.dataframe
, meta.data=list(width=4)
,
)<-summary(mrds.fi.mr)
mrds.fi.mr.sum
mrds.fi.mr.sum>
> Summary for trial.fi object
> Number of observations : 162
> Number seen by primary : 124
> Number seen by secondary (trials) : 142
> Number seen by both (detected trials): 104
> AIC : 510.766
>
>
> Conditional detection function parameters:
> estimate se
> (Intercept) 1.006805 0.189555
>
> Estimate SE CV
> Average p 0.7323944 0.03799899 0.05188324
> Average primary p(0) 0.7323944 0.03715144 0.05072600
> N in covered region 169.3076923 11.64565815 0.06878399
```

Estimated abundance of golf tee groups in the surveyed area appears on the last line of output, i.e.,

```
<- c("N in covered region" = mrds.fi.mr.sum$Nhat
mrds_ests "SE" = mrds.fi.mr.sum$Nhat.se
,
)
mrds_ests> N in covered region SE
> 169.30769 11.64566
```

### The Mark-Recapture Estimator

If observers are fully independent, and probability of detection declines with distance at the same rate for both observers, we only need the `mr`

(mark-recapture) portion of `mrds`

(mark-recapture distance-sampling) method. In this case, double-observer distance data can be analyzed as if it came from a two-session closed population mark-recapture data.

The first capture session of two-session closed population mark-recapture studies entails capturing individuals, marking them, and releasing them. Later, these studies conduct a second capture session wherein they record the number of marked individuals captured, as well as the total number of captured individuals in the second sample. Assuming no change in the population (no births, deaths, emigration, or immigration) between capture sessions, we estimate population size as the ratio of the number of animals captured in the second session and the probability of capturing an animal, which we estimate as the number of marked individuals in the second sample divided by the total number of marked individuals in the population.

A double-observer study that assumes equal distance functions and full independence views the secondary observer as the first “occasion” while the primary observer is the second “occasion”. The secondary observer “marks” targets during the first “occasion” while the primary observer “captures” a random sample of marked individuals during the second “occasion”. From one point of view, the secondary observer generates detection trials for the primary observer, hence the name `trial.fi`

in `mrds`

, which stands for `<trial method>.<fully independent>`

.

It is interesting to note that reversing observers, i.e., swapping the primary observer, makes no difference to this analysis. Primary and secondary observers will matter when we consider covariates in the distance function (see a subsequent post on that topic).

To estimate abundance in the surveyed area, we exploit the parallel between the distance sampling situation and mark-recapture analyses, and apply a well known mark-recapture estimator. I will apply the Chapman mark-recapture estimator which is a derivative of the famous Lincoln-Peterson estimator. I start here with the Lincoln-Peterson estimator because its form illuminates the logic of the estimator and motivates the Chapman estimator.

The Lincoln-Peterson estimator equates the ratio of marked individuals in the population just after the first occasion, and the same ratio in the sample taken during occasion two, i.e., \[ \frac{n_1}{N} = \frac{m_2}{n_2}, \] where \(n_1\) is the number of targets seen by the secondary observer, \(n_2\) is the number of targets seen by the primary observer, and \(m_2\) is the number of “marked” targets seen by the primary observer, or the number of targets seen by both the primary and secondary observer. Lincoln-Peterson estimates \(N\) by solving the above equation for \(N\), i.e., \[ \hat{N} = \frac{n_1 n_2}{m_2}. \]

Re-expressing the estimator as, \[ \hat{N} = \frac{n_2}{m_2 / n_1} \] shows that it is the ratio of the number of targets seen by the primary observer (\(n_2\)) and an estimate of the probability of detecting a target in the population (\(m_2 / n_1\)). Horvitz-Thompson type estimators have this general form. Horvitz-Thompson type estimators pervade distance sampling.

It is well known that the Lincoln-Peterson estimator is biased for true population size. To reduce bias, analysts usually apply the so-called Chapman estimator in practice. The Chapman estimator is less biased than the Lincoln-Peterson and its variance estimator is better behaved.

The bias correction of the Chapman estimator simply adds 1 to every count of the Lincoln-Peterson estimator, and subtracts 1 from the result, i.e., \[ \hat{N} = \frac{(n_1+1)(n_2+1)}{(m_2+1)} - 1 \] The Chapman estimator is close to the Lincoln-Peterson estimator when \(m_2\) is a substantial fraction of both \(n_1\) and \(n_2\); but, the two diverge as \(m_2\) decreases.

The estimated standard error of the Chapman estimator is, \[ \hat{se}(N) = \frac{(n_1+1)(n_2+1)(n_1-m_2)(n_2-m_2)}{(m_2+1)^2 (m_2 + 2)}. \]

### Using Base R

In this section, I demonstrate base R calculations required to repeat the `mrds`

analysis. The following code computes \(n_1\), \(n_2\), and \(m_2\). Counts \(n_1\) and \(m_2\) must be calculated from a data frame of observer 1’s detections of observer 2’s targets. In the following code, data frame `df_1given2`

contains only those targets detected by observer 2 and column `detected`

informs us whether observer 1 detected each of the targets detected by observer 2. Count \(n_2\) requires the original data set.

```
# Construct the 'trial' data frame.
# Observer 1's observations of observer 2's targets.
<- book.tee.data$book.tee.dataframe |>
objs_seenBy_2 ::filter(observer == 2 & detected == 1) |>
dplyr::select(object)
dplyr
<- book.tee.data$book.tee.dataframe |>
df_1given2 ::filter(observer == 1) |>
dplyr::right_join(objs_seenBy_2, by = "object")
dplyr
# Counts for the Chapman estimator
<- nrow(df_1given2)
n1 <- book.tee.data$book.tee.dataframe |>
n2 ::filter(observer == 1) |>
dplyr::summarise(n = sum(detected)) |>
dplyr::pull(n)
dplyr<- sum(df_1given2$detected)
m2
c(n1 = n1
n2 = n2
, m2 = m2)
, > n1 n2 m2
> 142 124 104
```

These counts match output from `mrds`

(i.e., number seen by primary, secondary, and both). The Lincoln-Peterson estimate of target abundance in the surveyed strip matches the estimate from `mrds`

exactly, i.e.,

```
# Lincoln-Peterson estimator ----
<- n1 * n2 / m2
N_lp c("N in covered region (Lincoln-Peterson)" = N_lp)
> N in covered region (Lincoln-Peterson)
> 169.3077
```

The Chapman estimator is slightly different from the Lincoln-Peterson. The Chapman estimator and it’s variance are,

```
# Chapman estimator ----
<- ((n1+1) * (n2+1) / (m2+1)) - 1
N_ch <- ( ((n1+1)*(n2+1)*(n1-m2)*(n2-m2)) / ((m2+1)^2 * (m2 + 2)))
se_N_ch <- c("N in covered region" = N_ch
ch_ests "SE" = se_N_ch)
,
ch_ests> N in covered region SE
> 169.23810 11.62452
```

Finally, confidence intervals constructed from either estimator assume normality and take the standard \(\pm 1.96\) form. For comparison, Table 1 contains a summary of both estimator’s numerical values.

```
<- dplyr::bind_rows(mrds_ests, ch_ests) |>
fi.ests ::mutate(estimator = c("mrds", "Chapman")) |>
dplyr::select(estimator, dplyr::everything()) |>
dplyr::mutate(lo_95ci = `N in covered region` - 1.96*SE
dplyrhi_95ci = `N in covered region` + 1.96*SE)
, ::kable(fi.ests) knitr
```

estimator | N in covered region | SE | lo_95ci | hi_95ci |
---|---|---|---|---|

mrds | 169.3077 | 11.64566 | 146.4822 | 192.1332 |

Chapman | 169.2381 | 11.62452 | 146.4540 | 192.0222 |

# Abundance in Study Area

### Using `mrds`

Abundance in the larger study area is estimated by assuming density if equal everywhere, and simply applying density in the surveyed area to the entire study area. The reader will see that expansion to study area using `mrds`

is convenient because the golf tee data are pre-formatted for `mrds`

; but, I have often been unclear whether I have manipulated data into the format required by `mrds`

. So, I will attempt to explain the format of the additional information required to estimate study area abundance.

In `mrds`

, expansion to the study area is accomplished by a call to the `dht`

function. `dht`

requires three data frames, in addition to the object returned by `ddf`

, which required the detection data frame.

: A data frame containing one line per estimation sub-region in the study area. Each line should contain the region’s identifier in a column named*Region Size*`Region.Label`

and the region’s size in a column named`Area`

. Region area should be reported in square units of the distance measurements and transect lengths. If distances were collected in meters, area of each region should be in square meters.: A data frame containing one line per transect.*Transect Lengths*`mrds`

documentation calls this data frame the ‘samples’ data set because in other studies the rows are points, and not transects. For simplicity, I will call ‘samples’ transects. Each line of this data frame should contain the transect’s identifier in a column named`Sample.Label`

, the study region identifier in which the transect resides in a column named`Region.Label`

, and the length of each transect in a column named`Effort`

. Transect lengths must be reported in the same measurement units as distances, and in the square root of area units.: A data frame containing one line per detection.*Observations*`mrds`

documentation calls this the ‘obs’ data set. Each line should contain the detection target’s identifier in a column named`object`

, the transect identifier on which the target was detected in a column named`Sample.Label`

, and the study region identifier in which the transect resides in a column named`Region.Label`

.

`mrds`

is inflexible in its column names. The columns of each data frame must be named exactly as stated above.

`mrds`

does not check compatibility of measurement units. In real studies, analysts must be carefully convert and input the correct distance units into all `mrds`

routines.

The golf tee ** region** data frame appears in Table 2. The golf tee

**data frame appears in Table 3. The first six rows of the golf tee**

*transect***data frame appear in Table 4.**

*observation*Region.Label | Area |
---|---|

1 | 1040 |

2 | 640 |

Sample.Label | Region.Label | Effort |
---|---|---|

1 | 1 | 10 |

2 | 1 | 30 |

3 | 1 | 30 |

4 | 1 | 27 |

5 | 1 | 21 |

6 | 1 | 12 |

7 | 2 | 23 |

8 | 2 | 23 |

9 | 2 | 15 |

10 | 2 | 12 |

11 | 2 | 7 |

object | Region.Label | Sample.Label | |
---|---|---|---|

1 | 1 | 1 | 1 |

2 | 2 | 1 | 1 |

3 | 3 | 1 | 1 |

12 | 4 | 1 | 2 |

13 | 25 | 1 | 2 |

14 | 26 | 1 | 2 |

In the golf tee example, observers walked 11 transects of varying length that totaled 210 [m]. I and others treat the sampled area as covering the entire study area. In most real studies this will not be true. Here, maximum strip width is 4 [m] and study area size is (210 [m])(2(4 [m])) = 1680 [m^2]. The total study area was subdivided into two arbitrary regions of size 1040 [m^2] and 640 [m^2] each. I will ignore these two sub-regions and report abundance over the entire area. The only information necessary to estimate abundance of golf tees from abundance of golf tee groups (see above) is group size.

Given correctly named columns, correct rows, and correct units, exansion to study area is a straightforward call to `mrds::dht`

.

```
<- mrds::dht(
mrds_saEsts model = mrds.fi.mr
region.table = book.tee.data$book.tee.region
, sample.table = book.tee.data$book.tee.samples
, obs.table = book.tee.data$book.tee.obs)
,
mrds_saEsts> Abundance and density estimates from distance sampling
> Variance : R2, N/L
>
> Summary statistics
>
> Region Area CoveredArea Effort n k ER se.ER cv.ER
> 1 1 1040 1040 130 72 6 0.5538462 0.02926903 0.05284685
> 2 2 640 640 80 52 5 0.6500000 0.08292740 0.12758061
> 3 Total 1680 1680 210 124 11 0.5904762 0.03641856 0.06167659
>
> Summary for clusters
>
> Abundance:
> Region Estimate se cv lcl ucl df
> 1 1 98.30769 7.201276 0.07325242 84.29420 114.6509 17.842176
> 2 2 71.00000 9.747951 0.13729508 50.31898 100.1809 5.360291
> 3 Total 169.30769 13.520391 0.07985692 143.18017 200.2030 17.840640
>
> Density:
> Region Estimate se cv lcl ucl df
> 1 1 0.09452663 0.006924304 0.07325242 0.08105211 0.1102412 17.842176
> 2 2 0.11093750 0.015231173 0.13729508 0.07862341 0.1565326 5.360291
> 3 Total 0.10077839 0.008047852 0.07985692 0.08522629 0.1191684 17.840640
>
> Summary for individuals
>
> Abundance:
> Region Estimate se cv lcl ucl df
> 1 1 312.6731 26.07225 0.08338501 260.9894 374.5916 12.426140
> 2 2 207.5385 37.99597 0.18307918 128.9206 334.0989 4.691818
> 3 Total 520.2115 49.57222 0.09529243 423.2323 639.4125 12.457770
>
> Density:
> Region Estimate se cv lcl ucl df
> 1 1 0.3006472 0.02506947 0.08338501 0.2509514 0.3601843 12.426140
> 2 2 0.3242788 0.05936870 0.18307918 0.2014384 0.5220295 4.691818
> 3 Total 0.3096497 0.02950727 0.09529243 0.2519240 0.3806027 12.457770
>
> Expected cluster size
> Region Expected.S se.Expected.S cv.Expected.S
> 1 1 3.180556 0.2114629 0.06648615
> 2 2 2.923077 0.1750319 0.05987935
> 3 Total 3.072581 0.1391365 0.04528327
```

`mrds::dht`

estimates the abundance of golf tee clusters to be 169.31 groups (“Total” line, “Abundance” table in “Summary for clusters” section of output). Again, normally this number will be different from ‘N in covered area’ because study area size will be greater than the surveyed area. `mrds::dht`

estimates the abundance of all golf tees in the 1680 [m^2] study area to be 520.21 (“Total” line, “Abundance” table in “Summary for individuals” section of output), which differs from ‘N in the covered area’ by average group size.

### Using Base R

Computation of tee abundance in the study area is straightforward. I compute average group size over targets detected by observer 1. I estimate tee abundance in the study area by multiplying abundance in the covered area by average group size.

```
<- book.tee.data$book.tee.dataframe |>
avgGroupSize ::filter(observer == 1 & detected == 1) |>
dplyr::summarise(mean = mean(size)
dplyrse = sd(size) / sqrt(dplyr::n()))
,
avgGroupSize> mean se
> 1 3.072581 0.1537082
<- data.frame(t(ch_ests)) |>
ch_saEst ::mutate(est = N.in.covered.region * avgGroupSize$mean ) |>
dplyr::pull(est)
dplyr
ch_saEst> [1] 519.9977
```

This abundance estimate, 520, differs from abundance estimated by `mrds`

, 520.21, only because I used the Chapman estimate of abundance in the covered area.

Estimation of the variance is involves an approximation for the variance of the product of two random variables. The approximation I use here for the variance of a product is one of the difference between the `mrds`

analysis and the base R calculations. The following code computes the standard error of abundance based on the Chapman estimator and this approximation.

```
<- data.frame(t(ch_ests)) |>
ch_saEst_se ::bind_cols(avgGroupSize) |>
dplyr::mutate(est = sqrt(N.in.covered.region^2 * se^2 + mean^2 * SE^2 + se^2*SE^2) ) |>
dplyr::pull(est)
dplyr
ch_saEst_se> [1] 44.22225
<- data.frame(mrds_saEsts$individuals$N) |>
final ::filter(Label == "Total") |>
dplyr::select(Estimate, se, lcl, ucl) |>
dplyr::bind_rows(data.frame(Estimate = ch_saEst
dplyrse = ch_saEst_se
, lcl = ch_saEst - 1.95*ch_saEst_se
, ucl = ch_saEst + 1.95*ch_saEst_se)) |>
, ::mutate(Estimator = c("mrds", "Chapman")) |>
dplyr::select(Estimator, dplyr::everything())
dplyr
::kable(final) knitr
```

Estimator | Estimate | se | lcl | ucl |
---|---|---|---|---|

mrds | 520.2115 | 49.57222 | 423.2323 | 639.4125 |

Chapman | 519.9977 | 44.22225 | 433.7643 | 606.2311 |

# Final Comments

I have shown analysis of a simple distance sampling situation wherein observers are fully independent, their distance functions are identical, and one observer is primary while the other is secondary. In this case, the estimate of abundance reduces to mark-recapture estimates. I have shown calculations using package `mrds`

and base R. The differences between `mrds`

and the base R computations I show are:

**MR estimator**: In base R, I used the Chapman mark-recapture estimator of abundance instead of the Lincoln-Peterson estimator used by`mrds`

. The Chapman estimator is known to be less biased than the Lincoln-Peterson; but,`mrds`

uses Lincoln-Peterson because it fits directly into the package’s general Horvitz-Thompson estimation approach.**Variance of MR estimator**: I used the closed the closed form variance estimator for the Chapman abundance estimate. The`mrds`

estimate of standard error of abundance in the surveyed area was very close to the closed form estimate, but assumabley is based on the Horvitz-Thompson variance estimator. That said, the standard error estimated by`mrds::ddf`

and`mrds::dht`

differ, and this is a difference I do not yet understand. Standard error reported by`mrds::ddf`

for N in the covered area was 11.6457 while standard error for the same quantity reported by`mrds::dht`

was 13.5204. It is interesting to note that this is not an effect of the two sub-regions. When I eliminate sub-regions,`mrds::dht`

still increase its estimate of standard error relative to`mrds::ddf`

. In a subsequent post, I will dive into the H-T variance estimator and hopefully figure this out.**Variance of Study Area Abundance**: I used the variance of a product formula to compute standard error of the study area’s total abundance. I assume`mrds::dht`

uses one of the H-T variance estimators to arrive at a slightly higher value for standard error.**Confidence Interval on Abundance**: I used the standard normal assumption and computed 95% confidence intervals using the \(\pm 1.96(se)\) method.`mrds::dht`

apparently uses a variance estimator that yields an asymmetric confidence interval.

In conclusion, the simple mark-recapture point estimates of abundance from double-observer data are clear. Some mystery remains, for me, regarding the standard error estimates and confidence intervals on total study area abundance. I will figure this mystery out and post about it later.