The Future of Multimodal AI Applications

Stefania Druga, Infobip Shift Miami | May 6, 2025

Beyond Text: AI That Perceives the World

Imagine AI that doesn't just process text, but perceives the world alongside us – seeing our experiments, hearing our questions, sensing the environment.

The Limitation

Current text-centric AI often misses the richness of real-world context and lacks direct perception.

The Need

Why AI that sees, hears, senses? To build more intuitive, grounded, and truly helpful systems.

Vision: Real-Time Multimodal AI

AI systems that seamlessly integrate and synthesize information from diverse, real-time data streams:

Live Webcams & Video Feeds
Microphone Audio
Connected Sensors (Temperature, Location, etc.)

Understanding context, anticipating needs, and responding dynamically through multi-sensory feedback loops.

Why Multimodal AI Matters

Beyond Text

Traditional AI systems rely primarily on text, limiting their ability to understand and interact with the rich, multimodal world humans naturally navigate.

Real-Time Interaction

The most impactful AI systems don't just analyze - they respond dynamically to changing inputs from multiple streams with minimal latency.

Grounding

By connecting language to sensory inputs (vision, audio, sensors), multimodal AI anchors abstract concepts in real-world perception, leading to deeper understanding and reduced ambiguity.

The Multimodal AI Blueprint

Case Studies

Let's see how this blueprint applies in practice with examples from our research...

Gemini Smart Home

Conversational control of smart devices directly within the Gemini AI.

ChemBuddy

Making abstract chemistry tangible through real-time sensing and interaction.

MathMind

Visually identifying and addressing mathematical misconceptions on the fly.

Cognimates Copilot

Supporting creative coding beyond text with multimodal assistance.

Case Study: Gemini & Google Home Integration

Bringing natural language smart home control directly into the Gemini AI chat experience.

Core Idea:

Control lights, climate, media, etc. via Gemini prompts.
Example: "Set the dining room for a romantic date night."
Reduces friction by keeping control within the AI chat context.

Project Goal: Make smart home interaction more intuitive and conversational.

Gemini Home Extension: Details & Early Access

Capabilities & Limitations:

Controls: Lighting, climate, window coverings, TVs, speakers, etc.
Requires Home App for: Security devices (cameras, locks), Routines.
Activation: May need '@Google Home' in prompts initially.
Context: Part of broader industry trend (cf. Alexa, Siri AI upgrades).

Sign Up for Public Preview:

Get early access to this feature (Android, English only initially) via the Google Home Public Preview program.

Join Google Home Public Preview

Requires signing into Gemini with the same account as Google Home.

Case Study 1: ChemBuddy

AI-Powered Chemistry Lab Assistant

Making abstract chemistry tangible through real-time sensing and interaction.

Core Features:

Real-world pH sensing via Jacdac
AI analyzes sensor data & user actions
Adaptive guidance based on experiment state

Goal: Bridge concrete actions with abstract concepts.

ChemBuddy: Architecture

ChemBuddy: Implementation & Insight

Implementation Highlight: Real-time Loop (Core Logic)

Using WebSockets, the client sends sensor/image data; the server fuses it and calls the AI for immediate feedback.

ChemBuddy: WebSocket Multimodal Fusion (Revised)

// Client: Send multimodal update via WebSocket
        async function sendUpdate(text, imageBlob, sensorData) {
          let imageBase64 = null;
          if (imageBlob) {
            imageBase64 = await blobToBase64(imageBlob); // Convert Blob to Base64
          }
          const messagePayload = {
            text: text,
            sensorData: sensorData,
            image: imageBase64 // Embed Base64 image data (or null)
          };
          // Send a single JSON message containing all data
          socket.send(JSON.stringify({ type: 'chem_update_request', payload: messagePayload }));
        }
        
        // Server: Handle WebSocket message
        socket.on('message', async (message) => {
          try {
            // Parse the incoming JSON message
            const received = JSON.parse(message.toString());
        
            if (received.type === 'chem_update_request' && received.payload) {
              const { text, sensorData, image } = received.payload; // Extract data
              let imageBuffer = null;
              if (image) {
                imageBuffer = Buffer.from(image, 'base64'); // Decode Base64 image if present
              }
              // --> Call Multimodal AI (Gemini) with combined data
              // Note: Pass imageBuffer (binary) or image (base64) based on API needs
              const aiResponse = await callGeminiApi({ text, sensorData, imageBuffer });
              // Send AI response back to client
              socket.send(JSON.stringify({ type: 'ai_response', payload: { text: aiResponse } }));
            }
          } catch (e) {
            // Handle potential errors (e.g., JSON parsing)
            console.error("Failed to process message:", e);
          }
        });

Key Insight/Finding:

Integrating real-time sensor data significantly improved the AI's ability to provide relevant, safety-conscious, and conceptually accurate guidance during experiments. It grounds the conversation in the physical reality of the lab.

Case Study 2: MathMind

Visually Identifying Math Misconceptions

Visually identifying and addressing mathematical misconceptions on the fly.

Core Features:

Real-time vision analysis of handwritten work
Classification of errors against misconception taxonomy
Targeted, multimodal feedback (visual hints, explanations)

Goal: Provide timely, personalized scaffolding for math learning.

MathMind: Architecture

MathMind: Vision API Call for Misconception Detection

// Client: Call backend API to analyze image
        async function analyzeMathWork(imageBase64, promptText, taxonomy) {
          try {
            const response = await fetch('/api/analyze', {
              method: 'POST',
              headers: { 'Content-Type': 'application/json' },
              body: JSON.stringify({ imageBase64, prompt: promptText, taxonomy })
            });
            if (!response.ok) throw new Error(`HTTP error! status: ${response.status}`);
            const analysis = await response.json();
            displayFeedback(analysis); // Show results in UI
          } catch (error) {
            console.error("Error analyzing math work:", error);
            // Display error to user
          }
        }
        
        // Server: Process request and call AI (e.g., Express.js)
        app.post('/api/analyze', async (req, res) => {
          try {
            const { imageBase64, prompt, taxonomy } = req.body;
            if (!imageBase64 || !prompt) {
              return res.status(400).json({ error: 'Missing imageBase64 or prompt.' });
            }
        
            const model = getGeminiVisionModel(); 
            // Construct the full prompt including taxonomy if needed by the model API
            const fullPrompt = `${prompt} (Consider taxonomy: ${JSON.stringify(taxonomy)})`;
        
            const aiResponse = await model.generateContent([fullPrompt, { inlineData: { mimeType: 'image/jpeg', data: imageBase64 } }]);
            // Assuming aiResponse structure allows direct sending or needs transformation
            res.json({ analysis: aiResponse.response.candidates[0].content.parts[0].text });
          } catch (error) {
            console.error("Error processing AI analysis:", error);
            res.status(500).json({ error: 'Failed to analyze image.' });
          }
        });

Key Insight/Finding:

Direct visual analysis of student work allows for highly specific and timely misconception detection, enabling the generation of truly personalized feedback and practice.

MathMind: Evaluation Results

MathMind Evaluation Example showing student work analysis

Case Study 3: Cognimates Copilot

Cognimates Copilot: Implementation Details

Implementation Highlight: AI Call & Response Parsing

User input triggers an AI call; the response (text, code, or image data) is parsed and integrated back into the coding environment (like Scratch).

Cognimates: Multimodal Response Handling (Conceptual)

// Core logic when user asks for help or an asset
async function handleUserInput(prompt, context) {
  // --> 1. Call AI Model (e.g., Gemini) with prompt and context
  const aiResponse = await callAIModelAPI(prompt, context);

  // 2. Parse response to determine type (text, code, image)
  const parsed = parseAIResponse(aiResponse);

  // 3. Integrate back into UI
  if (parsed.type === 'image_asset') {
    // --> Optional: Call background removal API/library
    const finalImage = await removeBackground(parsed.imageBase64);
    displayImageAsset(finalImage); // Add to Scratch assets
  } else if (parsed.type === 'code_suggestion') {
    displayCodeSuggestion(parsed.codeBlocks); // Show blocks in UI
  } else {
    displayExplanation(parsed.text); // Show text in chat
  }
}

Key Insight/Finding:

Multimodal copilots can significantly lower barriers to creative expression in coding by offering contextual help and asset generation directly within the workflow, moving beyond simple text-based assistance.

Ethical Touchpoint: Important to consider user agency and avoid over-reliance, ensuring the AI assists rather than dictates the creative process.

Copilot Evaluation

Goal: Evaluate copilot performance in real-world scenarios.

The Future Trajectory: Beyond Today's Examples

The principles behind these examples point towards a broader future for real-time multimodal AI:

Personalized Assistance

AI understanding context (location, activity, sensors) for proactive help.

Accessibility Tools

Real-time translation between visual, audio, and haptic information.

Robotics

Machines perceiving, understanding, and interacting naturally and safely.

Creative Tools

AI partners collaborating via sketches, gestures, voice, code.

Enhanced Learning

Truly adaptive education responding to diverse styles and real-world needs.

Human-AI Collaboration

Richer, intuitive partnerships via shared perception.

Evaluating Multimodal Apps: Benchmarks & Strategies

Assessing the capabilities and reliability of multimodal applications requires diverse evaluation methods:

Key Benchmarks

MMMU, MathVista, MMStar (General)
DocVQA, TextVQA (Documents)
Video-MME, CinePile (Video)
Domain-Specific (e.g., Healthcare, Robotics)

Evaluating capabilities across diverse tasks and modalities.

Testing Strategies

Component & End-to-End Testing
Real-world Scenario Simulation
Robustness Testing (Noise, Adversarial)
Monitoring & Feedback Loops in Deployment
Measuring Latency, Throughput, Resource Use

Ensuring reliability beyond standard metrics.

Challenges include metrics reflecting real user experience and testing complex interactions.

Evaluating Creative Coding Copilots

Assessing AI assistance in open-ended creative tasks requires different approaches than traditional benchmarks.

Challenges & Approaches:

Defining "success" in creative tasks is subjective.
Need to evaluate the *process* as much as the *product*.
Measuring impact on user learning, exploration, and overcoming blocks.
Developing benchmarks that simulate real coding scenarios (e.g., completing a partial project, debugging, generating specific assets).

Example: Cognimates Copilot Evaluation

Using project-based scenarios to evaluate the copilot's ability to provide relevant code suggestions, explain concepts, and generate useful visual assets within the Scratch environment.

Cognimates Copilot Evaluation Benchmark Example

AI Assistants Evaluation: A User-Centered Framework

https://github.com/kaushal0494/UnifyingAITutorEvaluation

The Unifying AI Tutor Evaluation framework proposes a taxonomy to assess the pedagogical abilities of LLM-based tutors across key dimensions.

This structured approach uses a detailed JSON format to capture assistant responses and annotate them across dimensions like mistake identification, guidance quality, coherence, and tone.

Diagram illustrating the AI Tutor Evaluation Framework components

Evaluation Structure (Simplified JSON)

{
  "conversation_id": "...",
  "conversation_history": "...",
  "Ground_Truth_Solution": "...",
  "anno_llm_responses": {
    "Model_Name": {
      "response": "Tutor response...",
      "annotation": {
        "Mistake_Identification": "Yes/No/...",
        "Mistake_Location": "Yes/No/...",
        "Revealing_of_the_Answer": "Yes/No/...",
        "Providing_Guidance": "Yes/No/...",
        "Actionability": "Yes/No/...",
        "Coherence": "Yes/No/...",
        "Tutor_Tone": "Encouraging/Neutral/...",
        "Humanlikeness": "Yes/No/..."
      }
    }
  }
}

Advanced Data Fusion Techniques

Integrating diverse data streams (vision, audio, sensor, text) effectively is key. Moving beyond simple concatenation to leverage techniques like:

Deep Learning Fusion: End-to-end models with modality-specific branches.
Attention Mechanisms: Cross-modal attention to weigh feature importance dynamically.
Transformer Variants: Models like ViLBERT, CoCa for powerful cross-modal interaction.
Graph Neural Networks (GNNs): Modeling relationships in structured data/robotics.
Recurrent Architectures: LSTMs/GRFs adapted for multimodal time-series.

Goal: Create unified representations capturing richer context than single modalities alone.

Tackling Real-Time Challenges: Latency & Synchronization

Real-time multimodal systems require overcoming critical engineering hurdles:

Low Latency

Efficient Models (SmolVLM, SSMs)
Optimized Inference (TensorRT-LLM, MLX)
Quantization (4-bit, FP8)
Edge Computing & Asynchronous Processing
API Optimization (Batching, Caching)

Goal: Minimize delay for interactive experiences (e.g., <100ms).

Data Synchronization

Crucial for coherent understanding (e.g., 32-45ms window)

Algorithms: Time/Feature/Model-based Alignment
Adaptive Temporal Mapping (Handles Jitter)
Cross-Modal Correlation Detection
Unified Data Platforms (Minimize Integration Sync Issues)

Goal: Ensure temporal alignment across streams despite network variability.

Tackling Real-Time Challenges: Latency & Synchronization

Real-time multimodal systems require overcoming critical engineering hurdles:

Graph showing latency spikes and synchronization

Goal: Ensure temporal alignment across streams despite network variability.

Architectures for Parallel Processing

Handling multiple data streams efficiently often requires parallel processing strategies:

Modality-Specific Pipelines: Dedicated processing paths for vision, audio, sensors before fusion.
Asynchronous Task Handling: Utilizing background tasks/queues for non-critical processing (e.g., logging, detailed analysis).
Hardware Acceleration: Leveraging GPUs, TPUs, or specialized AI chips for computationally intensive tasks.
Distributed Systems / Edge Computing: Processing data closer to the source to reduce central load and latency.
Optimized Scheduling: Efficiently managing compute resources across parallel tasks.

Goal: Maximize throughput and responsiveness by handling concurrent data streams effectively.

User Research & Testing in Multimodal AI

Understanding user interaction and experience is critical, especially in Human-Robot Interaction (HRI) and copilots:

Observational Studies: Analyzing how users interact naturally with the system.
Task-Based Evaluations: Measuring success rates, efficiency, and errors on specific tasks.
Qualitative Feedback: Interviews, surveys to capture user perception, satisfaction, and pain points.
Analyzing Non-Verbal Cues: Using AI to understand user state (engagement, confusion) during interaction.
Iterative Design: Incorporating feedback into development cycles.

Goal: Build systems that are intuitive, effective, and meet user needs in real-world contexts.

Future Application: Accelerating Material Science Discovery

Multimodal AI can analyze experimental data, simulations, and literature to predict properties of novel materials.

Future Application: AI-Assisted CAD & Engineering Design

AI agents can understand design sketches, suggest optimizations, and automate routine CAD tasks based on multimodal input.

Future Application: Enhancing Cultural Heritage Preservation

Analyzing artifacts, translating ancient texts, and creating immersive virtual reconstructions using multimodal data.

Future Application: Documenting & Revitalizing Languages

AI assisting with language documentation

Using audio, video, and text to document endangered languages, create learning tools, and facilitate translation.

Synthetic Data for Evaluation & Fine-tuning

Adapting models for specific tasks or domains often involves specialized training techniques:

Fine-tuning: Adapting pre-trained models (e.g., VLMs) on domain-specific datasets.
Synthetic Data Generation: Creating artificial data (images, sensor readings, text) to augment limited real-world data, especially for rare events or specific scenarios.
Few-Shot / Zero-Shot Learning: Enabling models to perform tasks with minimal or no specific training examples.
Custom Sensor Models: Training smaller models on specific sensor inputs (e.g., IMU for activity recognition) for efficiency and specialized tasks.
Parameter-Efficient Fine-Tuning (PEFT): Methods like LoRA to adapt large models with fewer trainable parameters.

Goal: Improve performance on target tasks, handle data scarcity, and enable deployment on resource-constrained devices.

References

ArXiv. (2024). Multimodality of AI for Education: Towards Artificial General Intelligence.
ArXiv. (2024). Multimodal Alignment and Fusion: A Survey.
Cartesia Raises $27 Million to Build the Next Generation of Real-Time AI Models - PRWeb (2025)
GM Insights. (2025). Multimodal AI Market Size & Share, Statistics Report 2025-2034.
LiveKit Blog. (2024). An open source stack for real-time multimodal AI.
MIT Technology Review. (2024). Multimodal: AI's new frontier.
Mobius Labs - Efficient Multimodal AI for Enterprise Applications (EliteAI Tools)
Multimodal Fusion Artificial Intelligence Model to Predict Risk for MACE and Myocarditis... (PMC, 2024)
Nature. (2025). On opportunities and challenges of large multimodal foundation models in education.
NVIDIA Riva - Speech and Translation AI (NVIDIA)
ResearchGate. (2025). SmolVLM: Redefining small and efficient multimodal models.
Science Direct. (2025). Taking the next step with generative artificial intelligence: The transformative role of multimodal large language models in science education.
U.S. Department of Education. (2024). AI Report.
World Economic Forum. (2024). The future of learning: AI is revolutionizing education 4.0.
Zilliz. (2024). Top 10 Multimodal AI Models of 2024.
Druga, S. et al. (Relevant publications for ChemBuddy, MathMind, Cognimates - *Add Specific Citations*)
Advances in Computer AI-assisted Multimodal Data Fusion Techniques (ResearchGate, 2024)
Real-Time Multimodal Signal Processing for HRI in RoboCup... (ResearchGate, 2025)
Designing the User Interface for Multimodal Speech and Pen-Based Gesture Applications... (ResearchGate)
SmolVLM: Redefining small and efficient multimodal models (arXiv, 2025)
USER-VLM 360°: Personalized Vision Language Models... (arXiv, 2025)
Retrieval Augmented Generation and Understanding in Vision... (arXiv, 2025)
GraphRAG with MongoDB Atlas... (MongoDB Blog, 2024)
An open source stack for real-time multimodal AI (LiveKit Blog, 2024)
Build Real-Time Multimodal XR Apps with NVIDIA AI Blueprint... (NVIDIA Blog, 2024)
AIRLab-POLIMI/ROAMFREE (GitHub)
MPE™ IMU & Sensor Fusion Software Solutions (221e)
Development of an artificial intelligence-based multimodal diagnostic system for early detection of biliary atresia (PMC, 2024)
MuDoC: An Interactive Multimodal Document-grounded Conversational AI System (arXiv, 2025)
Kaushal, V., et al. (2024). Unifying AI Tutor Evaluation... GitHub.