Towards Multimodal Scene Graph Generation Approaches to Video Understanding