The term Multimodal Large Language Model (MLLM) refers to a type of generative AI that is specialized in processing and generating text data along with other forms of data like images, video, or audio. Unlike purely text-based Large Language Models (LLMs), MLLMs can access and combine various data types to deliver more comprehensive and precise answers to questions and prompts.
In building operations, MLLMs are particularly valuable because they can, for example, combine sensor data with visual information such as a plan of a heating, ventilation, and air conditioning (HVAC) system. The MLLM would not only analyze the image but also incorporate other available sources, such as operating manuals or real-time operational data from the building, to generate informed responses. Building owners and technical operations teams can thus retrieve information from multiple sources in natural language, test hypotheses, and receive immediate support in their daily decisions, significantly enhancing operational efficiency.