Skip to content

Generic Datasets

terratorch.datasets.generic_pixel_wise_dataset #

Module containing generic dataset classes

GenericNonGeoPixelwiseRegressionDataset #

Bases: GenericPixelWiseDataset

GenericNonGeoPixelwiseRegressionDataset

__init__(data_root, label_data_root=None, image_grep='*', label_grep='*', split=None, ignore_split_file_extensions=True, allow_substring_split_file=True, rgb_indices=None, dataset_bands=None, output_bands=None, constant_scale=1, transform=None, no_data_replace=None, no_label_replace=None, expand_temporal_dimension=False, reduce_zero_label=False) #

Constructor

Parameters:

Name Type Description Default
data_root Path

Path to data root directory

required
label_data_root Path

Path to data root directory with labels. If not specified, will use the same as for images.

None
image_grep str

Regular expression appended to data_root to find input images. Defaults to "*".

'*'
label_grep str

Regular expression appended to data_root to find ground truth masks. Defaults to "*".

'*'
split Path

Path to file containing files to be used for this split. The file should be a new-line separated prefixes contained in the desired files. Files will be seached using glob with the form Path(data_root).glob(prefix + [image or label grep])

None
ignore_split_file_extensions bool

Whether to disregard extensions when using the split file to determine which files to include in the dataset. E.g. necessary for Eurosat, since the split files specify ".jpg" but files are actually ".jpg". Defaults to True.

True
allow_substring_split_file bool

Whether the split files contain substrings that must be present in file names to be included (as in mmsegmentation), or exact matches (e.g. eurosat). Defaults to True.

True
rgb_indices list[str]

Indices of RGB channels. Defaults to [0, 1, 2].

None
dataset_bands list[HLSBands | int] | None

Bands present in the dataset.

None
output_bands list[HLSBands | int] | None

Bands that should be output by the dataset.

None
constant_scale float

Factor to multiply image values by. Defaults to 1.

1
transform Compose | None

Albumentations transform to be applied. Should end with ToTensorV2(). If used through the generic_data_module, should not include normalization. Not supported for multi-temporal data. Defaults to None, which simply applies ToTensorV2().

None
no_data_replace float | None

Replace nan values in input images with this value. If none, does no replacement. Defaults to None.

None
no_label_replace int | None

Replace nan values in label with this value. If none, does no replacement. Defaults to None.

None
expand_temporal_dimension bool

Go from shape (time*channels, h, w) to (channels, time, h, w). Defaults to False.

False
reduce_zero_label bool

Subtract 1 from all labels. Useful when labels start from 1 instead of the expected 0. Defaults to False.

False

plot(sample, suptitle=None) #

Plot a sample from the dataset.

Parameters:

Name Type Description Default
sample dict[str, Tensor]

a sample returned by :meth:__getitem__

required
suptitle str | None

optional string to use as a suptitle

None

Returns:

Type Description
Figure

a matplotlib Figure with the rendered sample

.. versionadded:: 0.2

GenericNonGeoSegmentationDataset #

Bases: GenericPixelWiseDataset

GenericNonGeoSegmentationDataset

__init__(data_root, num_classes, label_data_root=None, image_grep='*', label_grep='*', split=None, ignore_split_file_extensions=True, allow_substring_split_file=True, rgb_indices=None, dataset_bands=None, output_bands=None, class_names=None, constant_scale=1, transform=None, no_data_replace=None, no_label_replace=None, expand_temporal_dimension=False, reduce_zero_label=False) #

Constructor

Parameters:

Name Type Description Default
data_root Path

Path to data root directory

required
num_classes int

Number of classes in the dataset

required
label_data_root Path

Path to data root directory with labels. If not specified, will use the same as for images.

None
image_grep str

Regular expression appended to data_root to find input images. Defaults to "*".

'*'
label_grep str

Regular expression appended to data_root to find ground truth masks. Defaults to "*".

'*'
split Path

Path to file containing files to be used for this split. The file should be a new-line separated prefixes contained in the desired files. Files will be seached using glob with the form Path(data_root).glob(prefix + [image or label grep])

None
ignore_split_file_extensions bool

Whether to disregard extensions when using the split file to determine which files to include in the dataset. E.g. necessary for Eurosat, since the split files specify ".jpg" but files are actually ".jpg". Defaults to True

True
allow_substring_split_file bool

Whether the split files contain substrings that must be present in file names to be included (as in mmsegmentation), or exact matches (e.g. eurosat). Defaults to True.

True
rgb_indices list[str]

Indices of RGB channels. Defaults to [0, 1, 2].

None
dataset_bands list[HLSBands | int] | None

Bands present in the dataset.

None
output_bands list[HLSBands | int] | None

Bands that should be output by the dataset.

None
class_names list[str]

Class names. Defaults to None.

None
constant_scale float

Factor to multiply image values by. Defaults to 1.

1
transform Compose | None

Albumentations transform to be applied. Should end with ToTensorV2(). If used through the generic_data_module, should not include normalization. Not supported for multi-temporal data. Defaults to None, which simply applies ToTensorV2().

None
no_data_replace float | None

Replace nan values in input images with this value. If none, does no replacement. Defaults to None.

None
no_label_replace int | None

Replace nan values in label with this value. If none, does no replacement. Defaults to None.

None
expand_temporal_dimension bool

Go from shape (time*channels, h, w) to (channels, time, h, w). Defaults to False.

False
reduce_zero_label bool

Subtract 1 from all labels. Useful when labels start from 1 instead of the expected 0. Defaults to False.

False

plot(sample, suptitle=None) #

Plot a sample from the dataset.

Parameters:

Name Type Description Default
sample dict[str, Tensor]

a sample returned by :meth:__getitem__

required
suptitle str | None

optional string to use as a suptitle

None

Returns:

Type Description
Figure

a matplotlib Figure with the rendered sample

.. versionadded:: 0.2

GenericPixelWiseDataset #

Bases: NonGeoDataset, ABC

This is a generic dataset class to be used for instantiating datasets from arguments. Ideally, one would create a dataset class specific to a dataset.

__init__(data_root, label_data_root=None, image_grep='*', label_grep='*', split=None, ignore_split_file_extensions=True, allow_substring_split_file=True, rgb_indices=None, dataset_bands=None, output_bands=None, constant_scale=1, transform=None, no_data_replace=None, no_label_replace=None, expand_temporal_dimension=False, reduce_zero_label=False) #

Constructor

Parameters:

Name Type Description Default
data_root Path

Path to data root directory

required
label_data_root Path

Path to data root directory with labels. If not specified, will use the same as for images.

None
image_grep str

Regular expression appended to data_root to find input images. Defaults to "*".

'*'
label_grep str

Regular expression appended to data_root to find ground truth masks. Defaults to "*".

'*'
split Path

Path to file containing files to be used for this split. The file should be a new-line separated prefixes contained in the desired files. Files will be seached using glob with the form Path(data_root).glob(prefix + [image or label grep])

None
ignore_split_file_extensions bool

Whether to disregard extensions when using the split file to determine which files to include in the dataset. E.g. necessary for Eurosat, since the split files specify ".jpg" but files are actually ".jpg". Defaults to True.

True
allow_substring_split_file bool

Whether the split files contain substrings that must be present in file names to be included (as in mmsegmentation), or exact matches (e.g. eurosat). Defaults to True.

True
rgb_indices list[str]

Indices of RGB channels. Defaults to [0, 1, 2].

None
dataset_bands list[HLSBands | int | tuple[int, int] | str] | None

Bands present in the dataset. This parameter names input channels (bands) using HLSBands, ints, int ranges, or strings, so that they can then be refered to by output_bands. Defaults to None.

None
output_bands list[HLSBands | int | tuple[int, int] | str] | None

Bands that should be output by the dataset as named by dataset_bands.

None
constant_scale float

Factor to multiply image values by. Defaults to 1.

1
transform Compose | None

Albumentations transform to be applied. Should end with ToTensorV2(). If used through the generic_data_module, should not include normalization. Not supported for multi-temporal data. Defaults to None, which simply applies ToTensorV2().

None
no_data_replace float | None

Replace nan values in input images with this value. If none, does no replacement. Defaults to None.

None
no_label_replace int | None

Replace nan values in label with this value. If none, does no replacement. Defaults to -1.

None
expand_temporal_dimension bool

Go from shape (time*channels, h, w) to (channels, time, h, w). Defaults to False.

False
reduce_zero_label bool

Subtract 1 from all labels. Useful when labels start from 1 instead of the expected 0. Defaults to False.

False

terratorch.datasets.generic_scalar_label_dataset #

Module containing generic dataset classes

GenericNonGeoClassificationDataset #

Bases: GenericScalarLabelDataset

GenericNonGeoClassificationDataset

__init__(data_root, num_classes, split=None, ignore_split_file_extensions=True, allow_substring_split_file=True, rgb_indices=None, dataset_bands=None, output_bands=None, class_names=None, constant_scale=1, transform=None, no_data_replace=0, expand_temporal_dimension=False) #

A generic Non-Geo dataset for classification.

Parameters:

Name Type Description Default
data_root Path

Path to data root directory

required
num_classes int

Number of classes in the dataset

required
split Path

Path to file containing files to be used for this split. The file should be a new-line separated prefixes contained in the desired files. Files will be seached using glob with the form Path(data_root).glob(prefix + [image or label grep])

None
ignore_split_file_extensions bool

Whether to disregard extensions when using the split file to determine which files to include in the dataset. E.g. necessary for Eurosat, since the split files specify ".jpg" but files are actually ".jpg". Defaults to True.

True
allow_substring_split_file bool

Whether the split files contain substrings that must be present in file names to be included (as in mmsegmentation), or exact matches (e.g. eurosat). Defaults to True.

True
rgb_indices list[str]

Indices of RGB channels. Defaults to [0, 1, 2].

None
dataset_bands list[HLSBands | int] | None

Bands present in the dataset.

None
output_bands list[HLSBands | int] | None

Bands that should be output by the dataset.

None
class_names list[str]

Class names. Defaults to None.

None
constant_scale float

Factor to multiply image values by. Defaults to 1.

1
transform Compose | None

Albumentations transform to be applied. Should end with ToTensorV2(). If used through the generic_data_module, should not include normalization. Not supported for multi-temporal data. Defaults to None, which simply applies ToTensorV2().

None
no_data_replace float

Replace nan values in input images with this value. Defaults to 0.

0
expand_temporal_dimension bool

Go from shape (time*channels, h, w) to (channels, time, h, w). Defaults to False.

False

GenericScalarLabelDataset #

Bases: NonGeoDataset, ImageFolder, ABC

This is a generic dataset class to be used for instantiating datasets from arguments. Ideally, one would create a dataset class specific to a dataset.

__init__(data_root, split=None, ignore_split_file_extensions=True, allow_substring_split_file=True, rgb_indices=None, dataset_bands=None, output_bands=None, constant_scale=1, transform=None, no_data_replace=0, expand_temporal_dimension=False) #

Constructor

Parameters:

Name Type Description Default
data_root Path

Path to data root directory

required
split Path

Path to file containing files to be used for this split. The file should be a new-line separated prefixes contained in the desired files. Files will be seached using glob with the form Path(data_root).glob(prefix + [image or label grep])

None
ignore_split_file_extensions bool

Whether to disregard extensions when using the split file to determine which files to include in the dataset. E.g. necessary for Eurosat, since the split files specify ".jpg" but files are actually ".jpg". Defaults to True.

True
allow_substring_split_file bool

Whether the split files contain substrings that must be present in file names to be included (as in mmsegmentation), or exact matches (e.g. eurosat). Defaults to True.

True
rgb_indices list[str]

Indices of RGB channels. Defaults to [0, 1, 2].

None
dataset_bands list[HLSBands | int | tuple[int, int] | str] | None

Bands present in the dataset. This parameter gives identifiers to input channels (bands) so that they can then be refered to by output_bands. Can use the HLSBands enum, ints, int ranges, or strings. Defaults to None.

None
output_bands list[HLSBands | int | tuple[int, int] | str] | None

Bands that should be output by the dataset as named by dataset_bands.

None
constant_scale float

Factor to multiply image values by. Defaults to 1.

1
transform Compose | None

Albumentations transform to be applied. Should end with ToTensorV2(). If used through the generic_data_module, should not include normalization. Not supported for multi-temporal data. Defaults to None, which simply applies ToTensorV2().

None
no_data_replace float

Replace nan values in input images with this value. Defaults to 0.

0
expand_temporal_dimension bool

Go from shape (time*channels, h, w) to (channels, time, h, w). Defaults to False.

False

Last update: March 23, 2025