Evaluationπ
callcut.evaluation provides tools for evaluating call detection performance:
decoding predictions into intervals, matching intervals, and computing metrics.
Note
For running inference on full recordings, use the
predict() method on the model directly.
Decodingπ
Decoders convert frame-level probabilities (values in [0, 1] for each time frame)
into a list of discrete call intervals with onset and offset times.
HysteresisDecoder uses hysteresis thresholding with
separate enter/exit thresholds to avoid rapid on/off switching when probabilities hover
near a single threshold. It also merges nearby intervals and filters out short
detections. Custom decoding strategies can be implemented by subclassing
BaseDecoder.
Abstract base class for probability-to-interval decoders. |
|
|
Decode probabilities using hysteresis thresholding. |
Interval matchingπ
To compute event-level metrics and boundary accuracy, predicted intervals must be matched to ground truth intervals. The matching strategy determines which predictions correspond to which ground truth events.
IoUMatcher uses greedy Intersection-over-Union (IoU)
matching: it pairs predictions with ground truth based on their temporal overlap,
prioritizing high-overlap matches. Custom matching strategies can be implemented by
subclassing BaseIntervalMatcher.
Abstract base class for interval matching strategies. |
|
|
Match intervals using greedy IoU (Intersection over Union). |
Frame-level metricsπ
Frame-level metrics evaluate detection at the individual frame granularity. Each frame is classified as either containing a call (positive) or not (negative), and compared against ground truth labels. This produces a standard confusion matrix with true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN).
Frame-level metrics are useful for quick sanity checks during training, but can be misleading for event detection. A model might achieve high frame-level F1 by correctly predicting the middle of calls while missing boundaries, or by predicting many short false alarms that happen to overlap with calls.
|
Compute frame-level precision, recall, and F1 score. |
|
Frame-level detection metrics. |
Event-level metricsπ
Event-level metrics evaluate detection at the call/event granularity. Each ground truth call is either matched to a prediction (true positive) or missed (false negative), and each prediction is either matched (true positive) or a false alarm (false positive).
Event-level metrics better reflect real-world performance: they answer questions like βHow many calls did we detect?β and βHow many false alarms did we produce?β rather than βWhat fraction of frames were correct?β.
|
Compute event-level precision, recall, and F1 score. |
|
Event-level detection metrics. |
Boundary metricsπ
Boundary accuracy measures how precisely the predicted call boundaries (onset and offset times) align with the ground truth boundaries. This is computed only for matched events (true positives).
Errors are computed as predicted - ground_truth, so:
Positive onset error: The prediction started too late
Negative onset error: The prediction started too early
Positive offset error: The prediction ended too late
Negative offset error: The prediction ended too early
Boundary accuracy complements event-level metrics: a model might detect all calls (perfect recall) but consistently start predictions 100ms late, which boundary accuracy would reveal.
|
Compute onset/offset timing errors for matched events. |
|
Boundary (onset/offset) accuracy statistics for matched events. |
Typesπ
Core data types used throughout the evaluation module.