Preserving hardware memory during cuvid decoding, exporting/importing via dlpack. by caffeinism · Pull Request #2155 · PyAV-Org/PyAV

caffeinism · 2026-02-04T16:35:56Z

Hello? I'm a user with limited knowledge of libav, dlpack, and cython. However, recognizing this as a necessary feature, I drafted this with the help of an LLM.

Motivation

If an application decodes video, performs GPU operations, and then re-encodes it, PyAV currently incurs a significant amount of memcopy. (GPU (cuvid) -> CPU (PyAV) -> GPU (Torch, etc.) -> CPU (PyAV) -> GPU (nvenc)) However, if we could export frames decoded by cuvid to dlpack while keeping them on the GPU, we wouldn't need to move the frames to CPU memory.

I passed all existing tests, but with such extensive modifications, it seems difficult for a beginner like me to catch every single detail. However, since most changes involve adding features rather than modifying existing ones, I hope this PR serves as a good starting point.

Usage example

import av
from av.codec.hwaccel import HWAccel
import torch

hwaccel = HWAccel(
    device_type="cuda",
    device=0,
    allow_software_fallback=False,
    output_format="hw", # preserve hw memory
)

# decode using cuvid
with av.open(from_video_filename, "r", hwaccel=hwaccel) as c:
    frame = next(c.decode(video=0))
    y = torch.from_dlpack(frame.planes[0]) # device(type='cuda', index=0), torch.uint8, torch.Size([H, W])
    uv = torch.from_dlpack(frame.planes[1]) # device(type='cuda', index=0), torch.uint8, torch.Size([H/2, W/2])

f = av.VideoFrame.from_dlpack(((y*0.5).to(torch.uint8), uv)) # some operation

with av.open(to_video_filename, "w") as c:
    s = c.add_stream("h264_nvenc", rate=24) # encode using nvenc
    for it in s.encode(f):
        c.mux(it)
    for it in s.encode(None):
        c.mux(it)

caffeinism · 2026-02-04T18:47:06Z

@WyattBlue If I add tests, will it work fine even if it only runs on a CUDA machine? I don't think it will work in the GitHub workflow.

WyattBlue · 2026-02-04T18:48:14Z

You need to test the interface. For example, hw_format does not have an pyi interface, and writing a test would catch that fact.

WyattBlue · 2026-02-04T18:50:52Z

av/hwcontext.pxd‎ should be merged with include/avutil. *.pxd files should otherwise not be free radicals, i.e., they should have a corresponding real .py file.

caffeinism · 2026-02-05T02:34:53Z

You need to test the interface. For example, hw_format does not have an pyi interface, and writing a test would catch that fact.

Could you please explain it in a bit more detail?

av/hwcontext.pxd‎ should be merged with include/avutil. *.pxd files should otherwise not be free radicals, i.e., they should have a corresponding real .py file.

In this case, how should dlpack.pxd be handled? Should this also be moved to the include directory?

caffeinism · 2026-02-05T08:54:10Z

@WyattBlue Could you take a look at the last commit section? I modified the buffer creation logic for frames generated by VideoFrame(), VideoFrame.from_ndarray(), and VideoFrame.reformat to support dlpack.

WyattBlue · 2026-02-05T16:09:03Z

  _cuda_device_ctx_cache = {}
  _cuda_frames_ctx_cache = {}

These grow indefinitely with no cleanup mechanism. For long-running applications using many different frame sizes, this could lead to memory growth. The global caches in frame.py and the registry in _hwdevice_registry.py are not thread-safe. Worth documenting or addressing

Redundant None check (av/video/frame.py:596-597)
if primary_ctx is None:
primary_ctx = True
The function signature already has primary_ctx: bool = True, so this check is
unreachable.

Duplicate validation (av/video/frame.py:498-500)
if dev0 != dev1:
raise ValueError("plane tensors must be on the same CUDA device")
if dev_type0 == kDLCUDA:
if dev0 != dev1: # Redundant - already checked above
raise ValueError("plane tensors must be on the same CUDA device")
The second check at line 499-500 is redundant since dev0 != dev1 is already
checked at line 496-497.

I'm not sure why we're import PyObject, imports that aren't used should be removed.

caffeinism · 2026-02-06T01:49:26Z

  _cuda_device_ctx_cache = {}
  _cuda_frames_ctx_cache = {}
These grow indefinitely with no cleanup mechanism. For long-running applications using many different frame sizes, this could lead to memory growth. The global caches in frame.py and the registry in _hwdevice_registry.py are not thread-safe. Worth documenting or addressing

This part is complex due to my limited implementation skills, so I left the global cache in place reluctantly. I'll give it more thought.

Redundant None check (av/video/frame.py:596-597) if primary_ctx is None: primary_ctx = True The function signature already has primary_ctx: bool = True, so this check is unreachable.

There was some deliberation about whether to set the default value to bool | None = None or bool = True. I will delete that paragraph.

Duplicate validation (av/video/frame.py:498-500) if dev0 != dev1: raise ValueError("plane tensors must be on the same CUDA device") if dev_type0 == kDLCUDA: if dev0 != dev1: # Redundant - already checked above raise ValueError("plane tensors must be on the same CUDA device") The second check at line 499-500 is redundant since dev0 != dev1 is already checked at line 496-497.

I didn't notice it during my review.

I'm not sure why we're import PyObject, imports that aren't used should be removed.

I realized that PyObject* doesn't have a reference count and replaced it with object, but I forgot to remove the import. I will check this part as well.

- Add _device_id field to VideoFrame, set it in from_dlpack and HW decode path - Store CudaContext on frame (_cuda_ctx) to prevent premature GC - Fix plane.py to use frame._device_id Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tream - Pass is_hw_owned when cloning HWAccel so decoded frames stay on device - Raise NotImplementedError when __dlpack__ is called with a CUDA stream since PyAV cannot perform stream synchronization - Make stream keyword-only to match DLPack spec Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

caffeinism · 2026-02-08T05:24:31Z

Thanks!

If anyone happened to come here via the issue, torch's default from_dlpack implementation seems to use stream != None by default, so you can use something like torch.from_dlpack(frame.planes[0].__dlpack__()).

caffeinism added 3 commits February 5, 2026 01:39

Impl __dlpack__, keep cuda memory

fda4962

Impl VideoFrame.from_dlpack

aaa90db

Impl minimal support device_id

56dd2dc

caffeinism force-pushed the dlpack branch from 5e0f429 to 56dd2dc Compare February 4, 2026 16:39

ruff / isort

9426057

caffeinism force-pushed the dlpack branch from 3ef7b26 to 9426057 Compare February 4, 2026 16:44

WyattBlue added the needs tests This PR needs a test label Feb 4, 2026

Merge av/hwcontext.pxd into include/libavutil/avutil.pxd

b22e7a5

caffeinism added 4 commits February 5, 2026 11:38

Move av/dlpack.pxd to include/dlpack.pxd

f713f95

Add tests/test_dlpack.py

d87422a

Fix interfaces

0af7dcf

Create VideoFrame using av_frame_get_buffer instead of av_image_alloc

a4a03ae

caffeinism force-pushed the dlpack branch from c346b18 to a4a03ae Compare February 5, 2026 08:51

WyattBlue removed the needs tests This PR needs a test label Feb 5, 2026

Merge DLDevice to DLTensor

a146710

caffeinism force-pushed the dlpack branch from 3a20278 to b424fa3 Compare February 6, 2026 02:32

caffeinism added 4 commits February 6, 2026 13:05

Remove redundant/unused lines

4b10414

Remove global cuda ctx cache

e498f36

Remove global _hwdevice_registry

4e9c7d6

Set immutable properties

cdd32eb

caffeinism force-pushed the dlpack branch from b424fa3 to cdd32eb Compare February 6, 2026 06:45

Fix missing property in from_dlpack

38616d4

WyattBlue force-pushed the dlpack branch from 8c1f009 to 2c580dd Compare February 8, 2026 02:45

WyattBlue force-pushed the dlpack branch 7 times, most recently from a6e6d6f to eb5c39f Compare February 8, 2026 03:46

output_format -> hw.is_hw_owned

8d36039

WyattBlue force-pushed the dlpack branch from eb5c39f to 8d36039 Compare February 8, 2026 03:56

WyattBlue force-pushed the dlpack branch from cce1832 to 2f9dfeb Compare February 8, 2026 04:30

WyattBlue and others added 2 commits February 7, 2026 23:34

VideoFrame: Don't need np_buffer

c8c9076

WyattBlue force-pushed the dlpack branch from b9d99f0 to c8c9076 Compare February 8, 2026 04:51

WyattBlue merged commit 085b4d2 into PyAV-Org:main Feb 8, 2026
6 checks passed

caffeinism deleted the dlpack branch February 8, 2026 05:18

nagadomi mentioned this pull request Feb 25, 2026

Improve hardware encoder decoder performance nagadomi/nunif#356

Closed

bellegee2-create mentioned this pull request Apr 25, 2026

Feature request: hardware video decode (nvdec/CUDA) for MKV input nagadomi/nunif#672

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preserving hardware memory during cuvid decoding, exporting/importing via dlpack.#2155

Preserving hardware memory during cuvid decoding, exporting/importing via dlpack.#2155
WyattBlue merged 19 commits into
PyAV-Org:mainfrom
caffeinism:dlpack

caffeinism commented Feb 4, 2026 •

edited

Loading

caffeinism commented Feb 4, 2026

WyattBlue commented Feb 4, 2026 •

edited

Loading

WyattBlue commented Feb 4, 2026

caffeinism commented Feb 5, 2026

caffeinism commented Feb 5, 2026

WyattBlue commented Feb 5, 2026

caffeinism commented Feb 6, 2026

Uh oh!

caffeinism commented Feb 8, 2026

Labels

2 participants

Uh oh!

Conversation

caffeinism commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Usage example

caffeinism commented Feb 4, 2026

WyattBlue commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

WyattBlue commented Feb 4, 2026

caffeinism commented Feb 5, 2026

caffeinism commented Feb 5, 2026

WyattBlue commented Feb 5, 2026

caffeinism commented Feb 6, 2026

Uh oh!

caffeinism commented Feb 8, 2026

Labels

2 participants

caffeinism commented Feb 4, 2026 •

edited

Loading

WyattBlue commented Feb 4, 2026 •

edited

Loading