Multimodal Input

Multimodal input lets a message carry images as well as text, so the model can see what the player sees — a screenshot, a photo, a piece of in-game art — and answer about it.

Image input is available only on vision-capable models (those that accept image input).

Message content

A message's content can be a list of parts — text and image — instead of a plain string. An image part accepts a URL, a base64 string, or a data URL, with an optional MIME type:

{
  "role": "user",
  "content": [
    { "type": "text", "text": "What's happening in this screenshot?" },
    { "type": "image", "image": "https://example.com/screen.png", "mimeType": "image/png" }
  ]
}

You can include multiple image parts in one message.

When to use it

Describe a screenshot — let an NPC or helper react to what's on screen.
Read an image — extract text or details from a picture the player provides.
Visual Q&A — answer questions grounded in an image.

For the exact call in your language, see the JavaScript, Unity, or Unreal text generation guide.

Multimodal Input

Multimodal Input

Message content

When to use it

On this page