There are a few fully multimodal / omni-style large models, but if the more general goal is “I want my OSS local chat/RAG setup to call T2I/T2V”, I would usually make the image/video models separate local server processes and connect them as a pipeline. The execution cost, debugging cost, and replacement cost are usually lower that way. Existing frameworks already cover a lot of this:
##…