callcut.nn.BaseDetector🔗

class callcut.nn.BaseDetector(n_bands, window_frames)[source]🔗

Abstract base class for call detection models.

Subclasses must implement:

forward: Process input features and return logits.
receptive_field: Property returning the receptive field in frames.
_save_config: Return additional constructor kwargs for serialization.

Models accept input of shape (batch, n_bands, time) and return logits of shape (batch, time).

Parameters:

n_bandsint: Number of input frequency bands.
window_framesint: Number of frames per input window. This determines the temporal context the model sees during training and inference. The corresponding duration in seconds depends on the feature extractor’s hop size: window_duration_s = window_frames * hop_ms / 1000.

Attributes

`n_bands`	Number of input frequency bands.
`receptive_field`	Receptive field in frames.
`window_frames`	Number of frames per input window.

Methods

`forward`(x)	Forward pass.
`predict`(features, *[, hop_frames])	Run sliding window inference on a full recording.

Notes

The receptive field is the number of input frames that influence a single output prediction. For a CNN, it is typically the sum of (kernel_size - 1) across all convolutional layers. This determines how much temporal context the model uses when making predictions.

Examples

Create a custom model by subclassing BaseDetector:

>>> class MyModel(BaseDetector):
...     def __init__(self, n_bands: int, window_frames: int):
...         super().__init__(n_bands, window_frames)
...         self._conv = nn.Conv1d(n_bands, 1, kernel_size=5, padding=2)
...
...     @property
...     def receptive_field(self) -> int:
...         return 4  # kernel_size - 1
...
...     def forward(self, x: Tensor) -> Tensor:
...         return self._conv(x).squeeze(1)
...
...     def _save_config(self) -> dict:
...         return {}  # no additional constructor args

abstractmethod forward(x)[source]🔗

Forward pass.

Parameters:

xTensor: Input features of shape (batch, n_bands, time).

Returns:

logitsTensor: Output logits of shape (batch, time).

predict(features, *, hop_frames=None)[source]🔗

Run sliding window inference on a full recording.

The model is applied to overlapping windows across the recording. Where windows overlap, predictions are averaged to produce smoother, more robust per-frame probability estimates.

Parameters:

featuresTensor: Input features of shape (n_bands, n_frames). Should be on the same device as the model.
hop_framesint | None: Hop between consecutive windows in frames. Smaller values produce more overlap and smoother predictions but increase computation time. If None, defaults to window_frames // 4 (75% overlap).

Returns:

probabilitiesTensor: Per-frame call probabilities of shape (n_frames,). Values are in [0, 1], where higher values indicate higher confidence that a call is present.

Notes

The inference process:

Slide a window of size window_frames across the recording with step hop_frames.
For each window, run the model to get logits, then apply sigmoid to get probabilities.
Accumulate predictions for each frame. Frames covered by multiple windows receive multiple predictions.
Average the accumulated predictions to get final per-frame probabilities.

For frames near the end of the recording that don’t fit a full window, the window is padded using edge values.

Examples

>>> from callcut.pipeline import load_pipeline
>>> from callcut.io import load_audio
>>>
>>> model, extractor, decoder = load_pipeline("pipeline.pt", device="cpu")
>>>
>>> waveform, sr = load_audio("recording.wav", sample_rate=32000)
>>> features, times = extractor(waveform)
>>>
>>> probs = model.predict(features)
>>> probs.shape
torch.Size([1234])

property n_bands🔗

Number of input frequency bands.

Type:: int

abstract property receptive_field🔗

Receptive field in frames.

The number of input frames that influence a single output prediction. Used to determine padding requirements during inference.

Type:: int

property window_frames🔗

Number of frames per input window.

The corresponding duration in seconds depends on the feature extractor’s hop size: window_duration_s = window_frames * hop_ms / 1000.

Type:: int