Learning controllable visual representations: advancing spatial and 3D primitive guidance for image synthesis