callcut.nn.BaseDetector🔗

class callcut.nn.BaseDetector(n_bands, window_frames)[source]🔗

Abstract base class for call detection models.

Subclasses must implement:

  • forward: Process input features and return logits.

  • receptive_field: Property returning the receptive field in frames.

  • _save_config: Return additional constructor kwargs for serialization.

Models accept input of shape (batch, n_bands, time) and return logits of shape (batch, time).

Parameters:
n_bandsint

Number of input frequency bands.

window_framesint

Number of frames per input window. This determines the temporal context the model sees during training and inference. The corresponding duration in seconds depends on the feature extractor’s hop size: window_duration_s = window_frames * hop_ms / 1000.

Attributes

n_bands

Number of input frequency bands.

receptive_field

Receptive field in frames.

window_frames

Number of frames per input window.

Methods

forward(x)

Forward pass.

predict(features, *[, hop_frames])

Run sliding window inference on a full recording.

Notes

The receptive field is the number of input frames that influence a single output prediction. For a CNN, it is typically the sum of (kernel_size - 1) across all convolutional layers. This determines how much temporal context the model uses when making predictions.

Examples

Create a custom model by subclassing BaseDetector:

>>> class MyModel(BaseDetector):
...     def __init__(self, n_bands: int, window_frames: int):
...         super().__init__(n_bands, window_frames)
...         self._conv = nn.Conv1d(n_bands, 1, kernel_size=5, padding=2)
...
...     @property
...     def receptive_field(self) -> int:
...         return 4  # kernel_size - 1
...
...     def forward(self, x: Tensor) -> Tensor:
...         return self._conv(x).squeeze(1)
...
...     def _save_config(self) -> dict:
...         return {}  # no additional constructor args
abstractmethod forward(x)[source]🔗

Forward pass.

Parameters:
xTensor

Input features of shape (batch, n_bands, time).

Returns:
logitsTensor

Output logits of shape (batch, time).

predict(features, *, hop_frames=None)[source]🔗

Run sliding window inference on a full recording.

The model is applied to overlapping windows across the recording. Where windows overlap, predictions are averaged to produce smoother, more robust per-frame probability estimates.

Parameters:
featuresTensor

Input features of shape (n_bands, n_frames). Should be on the same device as the model.

hop_framesint | None

Hop between consecutive windows in frames. Smaller values produce more overlap and smoother predictions but increase computation time. If None, defaults to window_frames // 4 (75% overlap).

Returns:
probabilitiesTensor

Per-frame call probabilities of shape (n_frames,). Values are in [0, 1], where higher values indicate higher confidence that a call is present.

Notes

The inference process:

  1. Slide a window of size window_frames across the recording with step hop_frames.

  2. For each window, run the model to get logits, then apply sigmoid to get probabilities.

  3. Accumulate predictions for each frame. Frames covered by multiple windows receive multiple predictions.

  4. Average the accumulated predictions to get final per-frame probabilities.

For frames near the end of the recording that don’t fit a full window, the window is padded using edge values.

Examples

>>> from callcut.pipeline import load_pipeline
>>> from callcut.io import load_audio
>>>
>>> model, extractor, decoder = load_pipeline("pipeline.pt", device="cpu")
>>>
>>> waveform, sr = load_audio("recording.wav", sample_rate=32000)
>>> features, times = extractor(waveform)
>>>
>>> probs = model.predict(features)
>>> probs.shape
torch.Size([1234])
property n_bands🔗

Number of input frequency bands.

Type:

int

abstract property receptive_field🔗

Receptive field in frames.

The number of input frames that influence a single output prediction. Used to determine padding requirements during inference.

Type:

int

property window_frames🔗

Number of frames per input window.

The corresponding duration in seconds depends on the feature extractor’s hop size: window_duration_s = window_frames * hop_ms / 1000.

Type:

int