VAE HERE: https://civitai.com/models/146075
"SD-XL Inpainting 0.1 is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input, with the extra capability of inpainting the pictures by using a mask.
The SD-XL Inpainting 0.1 was initialized with the stable-diffusion-xl-base-1.0
weights. The model is trained for 40k steps at resolution 1024x1024 and 5% dropping of the text-conditioning to improve classifier-free classifier-free guidance sampling. For inpainting, the UNet has 5 additional input channels (4 for the encoded masked-image and 1 for the mask itself) whose weights were zero-initialized after restoring the non-inpainting checkpoint. During training, we generate synthetic masks and, in 25% mask everything."
Guide for CompfyUI: https://mybyways.com/blog/using-the-sdxl-inpainting-01-model
source: https://huggingface.co/diffusers/stable-diffusion-xl-1.0-inpainting-0.1