Evaluation ---------- .. currentmodule:: callcut.evaluation ``callcut.evaluation`` provides tools for evaluating call detection performance: decoding predictions into intervals, matching intervals, and computing metrics. .. note:: For running inference on full recordings, use the :meth:`~callcut.nn.BaseDetector.predict` method on the model directly. Decoding ~~~~~~~~ Decoders convert frame-level probabilities (values in ``[0, 1]`` for each time frame) into a list of discrete call intervals with onset and offset times. :class:`~callcut.evaluation.HysteresisDecoder` uses hysteresis thresholding with separate enter/exit thresholds to avoid rapid on/off switching when probabilities hover near a single threshold. It also merges nearby intervals and filters out short detections. Custom decoding strategies can be implemented by subclassing :class:`~callcut.evaluation.BaseDecoder`. .. autosummary:: :toctree: ../generated/api BaseDecoder HysteresisDecoder Interval matching ~~~~~~~~~~~~~~~~~ To compute event-level metrics and boundary accuracy, predicted intervals must be matched to ground truth intervals. The matching strategy determines which predictions correspond to which ground truth events. :class:`~callcut.evaluation.IoUMatcher` uses greedy Intersection-over-Union (IoU) matching: it pairs predictions with ground truth based on their temporal overlap, prioritizing high-overlap matches. Custom matching strategies can be implemented by subclassing :class:`~callcut.evaluation.BaseIntervalMatcher`. .. autosummary:: :toctree: ../generated/api BaseIntervalMatcher IoUMatcher Frame-level metrics ~~~~~~~~~~~~~~~~~~~ Frame-level metrics evaluate detection at the individual frame granularity. Each frame is classified as either containing a call (positive) or not (negative), and compared against ground truth labels. This produces a standard confusion matrix with true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN). Frame-level metrics are useful for quick sanity checks during training, but can be misleading for event detection. A model might achieve high frame-level F1 by correctly predicting the middle of calls while missing boundaries, or by predicting many short false alarms that happen to overlap with calls. .. autosummary:: :toctree: ../generated/api compute_frame_metrics FrameMetrics Event-level metrics ~~~~~~~~~~~~~~~~~~~ Event-level metrics evaluate detection at the call/event granularity. Each ground truth call is either matched to a prediction (true positive) or missed (false negative), and each prediction is either matched (true positive) or a false alarm (false positive). Event-level metrics better reflect real-world performance: they answer questions like "How many calls did we detect?" and "How many false alarms did we produce?" rather than "What fraction of frames were correct?". .. autosummary:: :toctree: ../generated/api compute_event_metrics EventMetrics Boundary metrics ~~~~~~~~~~~~~~~~ Boundary accuracy measures how precisely the predicted call boundaries (onset and offset times) align with the ground truth boundaries. This is computed only for matched events (true positives). Errors are computed as ``predicted - ground_truth``, so: - **Positive onset error**: The prediction started too late - **Negative onset error**: The prediction started too early - **Positive offset error**: The prediction ended too late - **Negative offset error**: The prediction ended too early Boundary accuracy complements event-level metrics: a model might detect all calls (perfect recall) but consistently start predictions 100ms late, which boundary accuracy would reveal. .. autosummary:: :toctree: ../generated/api compute_boundary_accuracy BoundaryAccuracy Types ~~~~~ Core data types used throughout the evaluation module. .. autosummary:: :toctree: ../generated/api Interval Match