Shortcuts

pytorchvideo.data.encoded_video_dataset

Dataset loaders and supporting classes for encode video datasets (Ex: Kinetics, HmDB51, UCF101, etc)

class pytorchvideo.data.encoded_video_dataset.EncodedVideoDataset(*args, **kwds)[source]

EncodedVideoDataset handles the storage, loading, decoding and clip sampling for a video dataset. It assumes each video is stored as an encoded video (e.g. mp4, avi).

__init__(labeled_video_paths, clip_sampler, video_sampler=<class 'torch.utils.data.sampler.RandomSampler'>, transform=None, decode_audio=True, decoder='pyav')[source]
Parameters
  • List[Tuple[str (labeled_video_paths) – List containing video file paths and associated labels

  • Optional[dict]]]]) – List containing video file paths and associated labels

  • clip_sampler (ClipSampler) – Defines how clips should be sampled from each video. See the clip sampling documentation for more information.

  • video_sampler (Type[torch.utils.data.Sampler]) – Sampler for the internal video container. This defines the order videos are decoded and, if necessary, the distributed split.

  • transform (Callable) –

    This callable is evaluated on the clip output before the clip is returned. It can be used for user defined preprocessing and augmentations to the clips. The clip output is a dictionary with the following format:

    {

    ‘video’: <video_tensor> ‘label’: <index_label> ‘video_index’: <video_index> ‘clip_index’: <clip_index> ‘aug_index’: <aug_index>, augmentation index as augmentations

    might generate multiple views for one clip.

    }

    If transform is None, the raw clip output in the above format is returned unmodified.

  • decoder (str) – Defines what type of decoder used to decode a video.

  • labeled_video_paths (List[Tuple[str, Optional[dict]]]) –

  • decode_audio (bool) –

Return type

None

__next__()[source]

Retrieves the next clip based on the clip sampling strategy and video sampler.

Returns

A video clip with the following format if transform is None

{

‘video’: <video_tensor>, ‘label’: <index_label>, ‘video_index’: <video_index> ‘clip_index’: <clip_index> ‘aug_index’: <aug_index>, augmentation index as augmentations

might generate multiple views for one clip.

}

Otherwise, the transform defines the clip output.

Return type

dict

pytorchvideo.data.encoded_video_dataset.labeled_encoded_video_dataset(data_path, clip_sampler, video_sampler=<class 'torch.utils.data.sampler.RandomSampler'>, transform=None, video_path_prefix='', decode_audio=True, decoder='pyav')[source]

A helper function to create EncodedVideoDataset object for Ucf101 and Kinectis datasets.

Parameters
  • data_path (pathlib.Path) – Path to the data. The path type defines how the

  • should be read (data) –

    • For a file path, the file is read and each line is parsed into a

      video path and label.

    • For a directory, the directory structure defines the classes

      (i.e. each subdirectory is a class).

  • the LabeledVideoPaths class documentation for specific formatting (See) –

  • and examples. (details) –

  • clip_sampler (ClipSampler) – Defines how clips should be sampled from each video. See the clip sampling documentation for more information.

  • video_sampler (Type[torch.utils.data.Sampler]) – Sampler for the internal video container. This defines the order videos are decoded and, if necessary, the distributed split.

  • transform (Callable) –

    This callable is evaluated on the clip output before the clip is returned. It can be used for user defined preprocessing and augmentations to the clips. The clip output is a dictionary with the following format:

    {

    ‘video’: <video_tensor>, ‘label’: <index_label>, ‘video_index’: <video_index> ‘clip_index’: <clip_index> ‘aug_index’: <aug_index>, augmentation index as augmentations

    might generate multiple views for one clip.

    }

    If transform is None, the raw clip output in the above format is returned unmodified.

  • video_path_prefix (str) – Path to root directory with the videos that are loaded in EncodedVideoDataset. All the video paths before loading are prefixed with this path.

  • decoder (str) – Defines what type of decoder used to decode a video.

  • decode_audio (bool) –

Return type

EncodedVideoDataset

pytorchvideo.data.encoded_video_pyav

class pytorchvideo.data.encoded_video_pyav.EncodedVideoPyAV(file, video_name=None, decode_audio=True)[source]

EncodedVideoPyAV is an abstraction for accessing clips from an encoded video using PyAV as the decoding backend. It supports selective decoding when header information is available.

__init__(file, video_name=None, decode_audio=True)[source]
Parameters
  • file (BinaryIO) – a file-like object (e.g. io.BytesIO or io.StringIO) that contains the encoded video.

  • video_name (Optional[str]) –

  • decode_audio (bool) –

Return type

None

property name

Returns: name: the name of the stored video if set.

property duration

Returns: duration: the video’s duration/end-time in seconds.

get_clip(start_sec, end_sec)[source]

Retrieves frames from the encoded video at the specified start and end times in seconds (the video always starts at 0 seconds).

Parameters
  • start_sec (float) – the clip start time in seconds

  • end_sec (float) – the clip end time in seconds

Returns

clip_data – A dictionary mapping the entries at “video” and “audio” to a tensors.

”video”: A tensor of the clip’s RGB frames with shape: (channel, time, height, width). The frames are of type torch.float32 and in the range [0 - 255].

”audio”: A tensor of the clip’s audio samples with shape: (samples). The samples are of type torch.float32 and in the range [0 - 255].

Returns None if no video or audio found within time range.

Return type

Dict[str, Optional[torch.Tensor]]

close()[source]

Closes the internal video container.

pytorchvideo.data.encoded_video_torchvision

class pytorchvideo.data.encoded_video_torchvision.EncodedVideoTorchVision(file, video_name=None, decode_audio=True)[source]

Accessing clips from an encoded video using Torchvision video reading API (torch.ops.video_reader.read_video_from_memory) as the decoding backend.

property name

Returns: name: the name of the stored video if set.

property duration

Returns: duration: the video’s duration/end-time in seconds.

get_clip(start_sec, end_sec)[source]

Retrieves frames from the encoded video at the specified start and end times in seconds (the video always starts at 0 seconds).

Parameters
  • start_sec (float) – the clip start time in seconds

  • end_sec (float) – the clip end time in seconds

Returns

clip_data – A dictionary mapping the entries at “video” and “audio” to a tensors.

”video”: A tensor of the clip’s RGB frames with shape: (channel, time, height, width). The frames are of type torch.float32 and in the range [0 - 255].

”audio”: A tensor of the clip’s audio samples with shape: (samples). The samples are of type torch.float32 and in the range [0 - 255].

Returns None if no video or audio found within time range.

Return type

Dict[str, Optional[torch.Tensor]]

pytorchvideo.data.encoded_video

pytorchvideo.data.encoded_video.select_video_class(decoder)[source]

Select the class for accessing clips based on provided decoder string

Parameters

decoder (str) – Defines what type of decoder used to decode a video.

Return type

pytorchvideo.data.video.Video

class pytorchvideo.data.encoded_video.EncodedVideo(file, video_name=None, decode_audio=True, decoder='pyav')[source]

EncodedVideo is an abstraction for accessing clips from an encoded video. It supports selective decoding when header information is available.

classmethod from_path(file_path, decode_audio=True, decoder='pyav')[source]

Fetches the given video path using PathManager (allowing remote uris to be fetched) and constructs the EncodedVideo object.

Parameters
  • file_path (str) – a PathManager file-path.

  • decode_audio (bool) –

  • decoder (str) –

__init__(file, video_name=None, decode_audio=True, decoder='pyav')[source]
Parameters
  • file (BinaryIO) – a file-like object (e.g. io.BytesIO or io.StringIO) that contains the encoded video.

  • decoder (str) – Defines what type of decoder used to decode a video.

  • video_name (Optional[str]) –

  • decode_audio (bool) –

Return type

None

property name

Returns: name: the name of the stored video if set.

property duration

Returns: duration: the video’s duration/end-time in seconds.

get_clip(start_sec, end_sec)[source]

Retrieves frames from the encoded video at the specified start and end times in seconds (the video always starts at 0 seconds).

Parameters
  • start_sec (float) – the clip start time in seconds

  • end_sec (float) – the clip end time in seconds

Returns

clip_data – A dictionary mapping the entries at “video” and “audio” to a tensors.

”video”: A tensor of the clip’s RGB frames with shape: (channel, time, height, width). The frames are of type torch.float32 and in the range [0 - 255].

”audio”: A tensor of the clip’s audio samples with shape: (samples). The samples are of type torch.float32 and in the range [0 - 255].

Returns None if no video or audio found within time range.

Return type

Dict[str, Optional[torch.Tensor]]

close()[source]

Closes the internal video container.