Skip to content

Preserving hardware memory during cuvid decoding, exporting/importing via dlpack.#2155

Merged
WyattBlue merged 19 commits into
PyAV-Org:mainfrom
caffeinism:dlpack
Feb 8, 2026
Merged

Preserving hardware memory during cuvid decoding, exporting/importing via dlpack.#2155
WyattBlue merged 19 commits into
PyAV-Org:mainfrom
caffeinism:dlpack

Conversation

@caffeinism

@caffeinism caffeinism commented Feb 4, 2026

Copy link
Copy Markdown
Contributor

#2148

Hello? I'm a user with limited knowledge of libav, dlpack, and cython. However, recognizing this as a necessary feature, I drafted this with the help of an LLM.

Motivation

If an application decodes video, performs GPU operations, and then re-encodes it, PyAV currently incurs a significant amount of memcopy. (GPU (cuvid) -> CPU (PyAV) -> GPU (Torch, etc.) -> CPU (PyAV) -> GPU (nvenc)) However, if we could export frames decoded by cuvid to dlpack while keeping them on the GPU, we wouldn't need to move the frames to CPU memory.

I passed all existing tests, but with such extensive modifications, it seems difficult for a beginner like me to catch every single detail. However, since most changes involve adding features rather than modifying existing ones, I hope this PR serves as a good starting point.

Usage example

import av
from av.codec.hwaccel import HWAccel
import torch

hwaccel = HWAccel(
    device_type="cuda",
    device=0,
    allow_software_fallback=False,
    output_format="hw", # preserve hw memory
)

# decode using cuvid
with av.open(from_video_filename, "r", hwaccel=hwaccel) as c:
    frame = next(c.decode(video=0))
    y = torch.from_dlpack(frame.planes[0]) # device(type='cuda', index=0), torch.uint8, torch.Size([H, W])
    uv = torch.from_dlpack(frame.planes[1]) # device(type='cuda', index=0), torch.uint8, torch.Size([H/2, W/2])

f = av.VideoFrame.from_dlpack(((y*0.5).to(torch.uint8), uv)) # some operation

with av.open(to_video_filename, "w") as c:
    s = c.add_stream("h264_nvenc", rate=24) # encode using nvenc
    for it in s.encode(f):
        c.mux(it)
    for it in s.encode(None):
        c.mux(it)
@WyattBlue WyattBlue added the needs tests This PR needs a test label Feb 4, 2026
@caffeinism

Copy link
Copy Markdown
Contributor Author

@WyattBlue If I add tests, will it work fine even if it only runs on a CUDA machine? I don't think it will work in the GitHub workflow.

@WyattBlue

WyattBlue commented Feb 4, 2026

Copy link
Copy Markdown
Member

You need to test the interface. For example, hw_format does not have an pyi interface, and writing a test would catch that fact.

@WyattBlue

Copy link
Copy Markdown
Member

av/hwcontext.pxd‎ should be merged with include/avutil. *.pxd files should otherwise not be free radicals, i.e., they should have a corresponding real .py file.

@caffeinism

Copy link
Copy Markdown
Contributor Author

You need to test the interface. For example, hw_format does not have an pyi interface, and writing a test would catch that fact.

Could you please explain it in a bit more detail?

av/hwcontext.pxd‎ should be merged with include/avutil. *.pxd files should otherwise not be free radicals, i.e., they should have a corresponding real .py file.

In this case, how should dlpack.pxd be handled? Should this also be moved to the include directory?

@caffeinism

Copy link
Copy Markdown
Contributor Author

@WyattBlue Could you take a look at the last commit section? I modified the buffer creation logic for frames generated by VideoFrame(), VideoFrame.from_ndarray(), and VideoFrame.reformat to support dlpack.

@WyattBlue WyattBlue removed the needs tests This PR needs a test label Feb 5, 2026
@WyattBlue

Copy link
Copy Markdown
Member
  _cuda_device_ctx_cache = {}
  _cuda_frames_ctx_cache = {}

These grow indefinitely with no cleanup mechanism. For long-running applications using many different frame sizes, this could lead to memory growth. The global caches in frame.py and the registry in _hwdevice_registry.py are not thread-safe. Worth documenting or addressing

Redundant None check (av/video/frame.py:596-597)
if primary_ctx is None:
primary_ctx = True
The function signature already has primary_ctx: bool = True, so this check is
unreachable.

Duplicate validation (av/video/frame.py:498-500)
if dev0 != dev1:
raise ValueError("plane tensors must be on the same CUDA device")
if dev_type0 == kDLCUDA:
if dev0 != dev1: # Redundant - already checked above
raise ValueError("plane tensors must be on the same CUDA device")
The second check at line 499-500 is redundant since dev0 != dev1 is already
checked at line 496-497.

I'm not sure why we're import PyObject, imports that aren't used should be removed.

@caffeinism

Copy link
Copy Markdown
Contributor Author
  _cuda_device_ctx_cache = {}
  _cuda_frames_ctx_cache = {}

These grow indefinitely with no cleanup mechanism. For long-running applications using many different frame sizes, this could lead to memory growth. The global caches in frame.py and the registry in _hwdevice_registry.py are not thread-safe. Worth documenting or addressing

This part is complex due to my limited implementation skills, so I left the global cache in place reluctantly. I'll give it more thought.

Redundant None check (av/video/frame.py:596-597) if primary_ctx is None: primary_ctx = True The function signature already has primary_ctx: bool = True, so this check is unreachable.

There was some deliberation about whether to set the default value to bool | None = None or bool = True. I will delete that paragraph.

Duplicate validation (av/video/frame.py:498-500) if dev0 != dev1: raise ValueError("plane tensors must be on the same CUDA device") if dev_type0 == kDLCUDA: if dev0 != dev1: # Redundant - already checked above raise ValueError("plane tensors must be on the same CUDA device") The second check at line 499-500 is redundant since dev0 != dev1 is already checked at line 496-497.

I didn't notice it during my review.

I'm not sure why we're import PyObject, imports that aren't used should be removed.

I realized that PyObject* doesn't have a reference count and replaced it with object, but I forgot to remove the import. I will check this part as well.

@WyattBlue WyattBlue force-pushed the dlpack branch 7 times, most recently from a6e6d6f to eb5c39f Compare February 8, 2026 03:46
- Add _device_id field to VideoFrame, set it in from_dlpack and HW decode path
- Store CudaContext on frame (_cuda_ctx) to prevent premature GC
- Fix plane.py to use frame._device_id

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WyattBlue and others added 2 commits February 7, 2026 23:34
…tream

- Pass is_hw_owned when cloning HWAccel so decoded frames stay on device
- Raise NotImplementedError when __dlpack__ is called with a CUDA stream
  since PyAV cannot perform stream synchronization
- Make stream keyword-only to match DLPack spec

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@WyattBlue WyattBlue merged commit 085b4d2 into PyAV-Org:main Feb 8, 2026
6 checks passed
@caffeinism caffeinism deleted the dlpack branch February 8, 2026 05:18
@caffeinism

Copy link
Copy Markdown
Contributor Author

Thanks!

If anyone happened to come here via the issue, torch's default from_dlpack implementation seems to use stream != None by default, so you can use something like torch.from_dlpack(frame.planes[0].__dlpack__()).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants