Text Generation
Multimodal Input
Send images alongside text in a message
Multimodal Input
Multimodal input lets a message carry images as well as text, so the model can see what the player sees — a screenshot, a photo, a piece of in-game art — and answer about it.
Image input is available only on vision-capable models (those that accept image input).
Message content
A message's content can be a list of parts — text and image — instead of a plain string. An image part accepts a URL, a base64 string, or a data URL, with an optional MIME type:
{
"role": "user",
"content": [
{ "type": "text", "text": "What's happening in this screenshot?" },
{ "type": "image", "image": "https://example.com/screen.png", "mimeType": "image/png" }
]
}You can include multiple image parts in one message.
When to use it
- Describe a screenshot — let an NPC or helper react to what's on screen.
- Read an image — extract text or details from a picture the player provides.
- Visual Q&A — answer questions grounded in an image.
For the exact call in your language, see the JavaScript, Unity, or Unreal text generation guide.