PlayKit.ai
Text Generation

Multimodal Input

Send images alongside text in a message

Multimodal Input

Multimodal input lets a message carry images as well as text, so the model can see what the player sees — a screenshot, a photo, a piece of in-game art — and answer about it.

Image input is available only on vision-capable models (those that accept image input).

Message content

A message's content can be a list of parts — text and image — instead of a plain string. An image part accepts a URL, a base64 string, or a data URL, with an optional MIME type:

{
  "role": "user",
  "content": [
    { "type": "text", "text": "What's happening in this screenshot?" },
    { "type": "image", "image": "https://example.com/screen.png", "mimeType": "image/png" }
  ]
}

You can include multiple image parts in one message.

When to use it

  • Describe a screenshot — let an NPC or helper react to what's on screen.
  • Read an image — extract text or details from a picture the player provides.
  • Visual Q&A — answer questions grounded in an image.

For the exact call in your language, see the JavaScript, Unity, or Unreal text generation guide.