EvaluationπŸ”—

callcut.evaluation provides tools for evaluating call detection performance: decoding predictions into intervals, matching intervals, and computing metrics.

Note

For running inference on full recordings, use the predict() method on the model directly.

DecodingπŸ”—

Decoders convert frame-level probabilities (values in [0, 1] for each time frame) into a list of discrete call intervals with onset and offset times.

HysteresisDecoder uses hysteresis thresholding with separate enter/exit thresholds to avoid rapid on/off switching when probabilities hover near a single threshold. It also merges nearby intervals and filters out short detections. Custom decoding strategies can be implemented by subclassing BaseDecoder.

BaseDecoder()

Abstract base class for probability-to-interval decoders.

HysteresisDecoder([enter_threshold, ...])

Decode probabilities using hysteresis thresholding.

Interval matchingπŸ”—

To compute event-level metrics and boundary accuracy, predicted intervals must be matched to ground truth intervals. The matching strategy determines which predictions correspond to which ground truth events.

IoUMatcher uses greedy Intersection-over-Union (IoU) matching: it pairs predictions with ground truth based on their temporal overlap, prioritizing high-overlap matches. Custom matching strategies can be implemented by subclassing BaseIntervalMatcher.

BaseIntervalMatcher()

Abstract base class for interval matching strategies.

IoUMatcher([iou_threshold])

Match intervals using greedy IoU (Intersection over Union).

Frame-level metricsπŸ”—

Frame-level metrics evaluate detection at the individual frame granularity. Each frame is classified as either containing a call (positive) or not (negative), and compared against ground truth labels. This produces a standard confusion matrix with true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN).

Frame-level metrics are useful for quick sanity checks during training, but can be misleading for event detection. A model might achieve high frame-level F1 by correctly predicting the middle of calls while missing boundaries, or by predicting many short false alarms that happen to overlap with calls.

compute_frame_metrics(probabilities, labels, *)

Compute frame-level precision, recall, and F1 score.

FrameMetrics(n_frames, tp, fp, fn, tn, ...)

Frame-level detection metrics.

Event-level metricsπŸ”—

Event-level metrics evaluate detection at the call/event granularity. Each ground truth call is either matched to a prediction (true positive) or missed (false negative), and each prediction is either matched (true positive) or a false alarm (false positive).

Event-level metrics better reflect real-world performance: they answer questions like β€œHow many calls did we detect?” and β€œHow many false alarms did we produce?” rather than β€œWhat fraction of frames were correct?”.

compute_event_metrics(ground_truth, ...)

Compute event-level precision, recall, and F1 score.

EventMetrics(n_ground_truth, n_predicted, ...)

Event-level detection metrics.

Boundary metricsπŸ”—

Boundary accuracy measures how precisely the predicted call boundaries (onset and offset times) align with the ground truth boundaries. This is computed only for matched events (true positives).

Errors are computed as predicted - ground_truth, so:

  • Positive onset error: The prediction started too late

  • Negative onset error: The prediction started too early

  • Positive offset error: The prediction ended too late

  • Negative offset error: The prediction ended too early

Boundary accuracy complements event-level metrics: a model might detect all calls (perfect recall) but consistently start predictions 100ms late, which boundary accuracy would reveal.

compute_boundary_accuracy(ground_truth, ...)

Compute onset/offset timing errors for matched events.

BoundaryAccuracy(n_matches, onset_errors_ms, ...)

Boundary (onset/offset) accuracy statistics for matched events.

TypesπŸ”—

Core data types used throughout the evaluation module.

Interval(onset, offset)

A time interval representing a detected or annotated call.

Match(gt_index, pred_index, iou)

A match between a ground truth and predicted interval.