pytorchvideo.data.encoded_video_dataset¶

Dataset loaders and supporting classes for encode video datasets (Ex: Kinetics, HmDB51, UCF101, etc)

class pytorchvideo.data.encoded_video_dataset.EncodedVideoDataset(*args, **kwds)[source]¶

EncodedVideoDataset handles the storage, loading, decoding and clip sampling for a video dataset. It assumes each video is stored as an encoded video (e.g. mp4, avi).

__init__(labeled_video_paths, clip_sampler, video_sampler=<class 'torch.utils.data.sampler.RandomSampler'>, transform=None, decode_audio=True, decoder='pyav')[source]¶

Parameters

List[Tuple[str (labeled_video_paths) – List containing video file paths and associated labels
Optional[dict]]]]) – List containing video file paths and associated labels
clip_sampler (ClipSampler) – Defines how clips should be sampled from each video. See the clip sampling documentation for more information.
video_sampler (Type[torch.utils.data.Sampler]) – Sampler for the internal video container. This defines the order videos are decoded and, if necessary, the distributed split.
transform (Callable) –
This callable is evaluated on the clip output before the clip is returned. It can be used for user defined preprocessing and augmentations to the clips. The clip output is a dictionary with the following format:

{
‘video’: <video_tensor> ‘label’: <index_label> ‘video_index’: <video_index> ‘clip_index’: <clip_index> ‘aug_index’: <aug_index>, augmentation index as augmentations

might generate multiple views for one clip.

}

If transform is None, the raw clip output in the above format is returned unmodified.
decoder (str) – Defines what type of decoder used to decode a video.
labeled_video_paths (List[Tuple[str, Optional[dict]]]) –
decode_audio (bool) –

Return type

None

__next__()[source]¶

Retrieves the next clip based on the clip sampling strategy and video sampler.

Returns

A video clip with the following format if transform is None –

{
‘video’: <video_tensor>, ‘label’: <index_label>, ‘video_index’: <video_index> ‘clip_index’: <clip_index> ‘aug_index’: <aug_index>, augmentation index as augmentations

might generate multiple views for one clip.

}

Otherwise, the transform defines the clip output.

Return type

dict

pytorchvideo.data.encoded_video_dataset.labeled_encoded_video_dataset(data_path, clip_sampler, video_sampler=<class 'torch.utils.data.sampler.RandomSampler'>, transform=None, video_path_prefix='', decode_audio=True, decoder='pyav')[source]¶

A helper function to create EncodedVideoDataset object for Ucf101 and Kinectis datasets.

Parameters

data_path (pathlib.Path) – Path to the data. The path type defines how the
should be read (data) –
- For a file path, the file is read and each line is parsed into a
  video path and label.
- For a directory, the directory structure defines the classes
  (i.e. each subdirectory is a class).
the LabeledVideoPaths class documentation for specific formatting (See) –
and examples. (details) –
clip_sampler (ClipSampler) – Defines how clips should be sampled from each video. See the clip sampling documentation for more information.
video_sampler (Type[torch.utils.data.Sampler]) – Sampler for the internal video container. This defines the order videos are decoded and, if necessary, the distributed split.
transform (Callable) –
This callable is evaluated on the clip output before the clip is returned. It can be used for user defined preprocessing and augmentations to the clips. The clip output is a dictionary with the following format:

{
‘video’: <video_tensor>, ‘label’: <index_label>, ‘video_index’: <video_index> ‘clip_index’: <clip_index> ‘aug_index’: <aug_index>, augmentation index as augmentations

might generate multiple views for one clip.

}

If transform is None, the raw clip output in the above format is returned unmodified.
video_path_prefix (str) – Path to root directory with the videos that are loaded in EncodedVideoDataset. All the video paths before loading are prefixed with this path.
decoder (str) – Defines what type of decoder used to decode a video.
decode_audio (bool) –

Return type

EncodedVideoDataset

pytorchvideo.data.encoded_video_pyav¶

class pytorchvideo.data.encoded_video_pyav.EncodedVideoPyAV(file, video_name=None, decode_audio=True)[source]¶

EncodedVideoPyAV is an abstraction for accessing clips from an encoded video using PyAV as the decoding backend. It supports selective decoding when header information is available.

__init__(file, video_name=None, decode_audio=True)[source]¶

Parameters

file (BinaryIO) – a file-like object (e.g. io.BytesIO or io.StringIO) that contains the encoded video.
video_name (Optional[str]) –
decode_audio (bool) –

Return type

None

property name¶: Returns: name: the name of the stored video if set.

property duration¶: Returns: duration: the video’s duration/end-time in seconds.

get_clip(start_sec, end_sec)[source]¶

Retrieves frames from the encoded video at the specified start and end times in seconds (the video always starts at 0 seconds).

Parameters

start_sec (float) – the clip start time in seconds
end_sec (float) – the clip end time in seconds

Returns

clip_data – A dictionary mapping the entries at “video” and “audio” to a tensors.

”video”: A tensor of the clip’s RGB frames with shape: (channel, time, height, width). The frames are of type torch.float32 and in the range [0 - 255].

”audio”: A tensor of the clip’s audio samples with shape: (samples). The samples are of type torch.float32 and in the range [0 - 255].

Returns None if no video or audio found within time range.

Return type

Dict[str, Optional[torch.Tensor]]

close()[source]¶: Closes the internal video container.

pytorchvideo.data.encoded_video_torchvision¶

class pytorchvideo.data.encoded_video_torchvision.EncodedVideoTorchVision(file, video_name=None, decode_audio=True)[source]¶

Accessing clips from an encoded video using Torchvision video reading API (torch.ops.video_reader.read_video_from_memory) as the decoding backend.

property name¶: Returns: name: the name of the stored video if set.

property duration¶: Returns: duration: the video’s duration/end-time in seconds.

get_clip(start_sec, end_sec)[source]¶

Retrieves frames from the encoded video at the specified start and end times in seconds (the video always starts at 0 seconds).

Parameters

start_sec (float) – the clip start time in seconds
end_sec (float) – the clip end time in seconds

Returns

clip_data – A dictionary mapping the entries at “video” and “audio” to a tensors.

”video”: A tensor of the clip’s RGB frames with shape: (channel, time, height, width). The frames are of type torch.float32 and in the range [0 - 255].

”audio”: A tensor of the clip’s audio samples with shape: (samples). The samples are of type torch.float32 and in the range [0 - 255].

Returns None if no video or audio found within time range.

Return type

Dict[str, Optional[torch.Tensor]]

pytorchvideo.data.encoded_video¶

pytorchvideo.data.encoded_video.select_video_class(decoder)[source]¶

Select the class for accessing clips based on provided decoder string

Parameters: decoder (str) – Defines what type of decoder used to decode a video.
Return type: pytorchvideo.data.video.Video

class pytorchvideo.data.encoded_video.EncodedVideo(file, video_name=None, decode_audio=True, decoder='pyav')[source]¶

EncodedVideo is an abstraction for accessing clips from an encoded video. It supports selective decoding when header information is available.

classmethod from_path(file_path, decode_audio=True, decoder='pyav')[source]¶

Fetches the given video path using PathManager (allowing remote uris to be fetched) and constructs the EncodedVideo object.

Parameters

file_path (str) – a PathManager file-path.
decode_audio (bool) –
decoder (str) –

__init__(file, video_name=None, decode_audio=True, decoder='pyav')[source]¶

Parameters

file (BinaryIO) – a file-like object (e.g. io.BytesIO or io.StringIO) that contains the encoded video.
decoder (str) – Defines what type of decoder used to decode a video.
video_name (Optional[str]) –
decode_audio (bool) –

Return type

None

property name¶: Returns: name: the name of the stored video if set.

property duration¶: Returns: duration: the video’s duration/end-time in seconds.

get_clip(start_sec, end_sec)[source]¶

Retrieves frames from the encoded video at the specified start and end times in seconds (the video always starts at 0 seconds).

Parameters

start_sec (float) – the clip start time in seconds
end_sec (float) – the clip end time in seconds

Returns

clip_data – A dictionary mapping the entries at “video” and “audio” to a tensors.

”video”: A tensor of the clip’s RGB frames with shape: (channel, time, height, width). The frames are of type torch.float32 and in the range [0 - 255].

”audio”: A tensor of the clip’s audio samples with shape: (samples). The samples are of type torch.float32 and in the range [0 - 255].

Returns None if no video or audio found within time range.

Return type

Dict[str, Optional[torch.Tensor]]

close()[source]¶: Closes the internal video container.