多模态输入

部分模型支持在文本旁同时接收图像、音频和视频。ModelMax 遵循 OpenAI 多部分内容格式 — 传递内容部分数组而非纯文本字符串。

查看支持的模型页面了解每个模型支持的输入模态。

图像输入

以 base64 数据 URL 或托管 URL 的形式在 image_url 内容部分中发送图像。

curl -X POST https://api.modelmax.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $MODELMAX_API_KEY" \
  -d '{
    "model": "gemini-3-flash-preview",
    "messages": [
      {
        "role": "user",
        "content": [
          { "type": "text", "text": "What do you see in this image?" },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://example.com/photo.jpg"
            }
          }
        ]
      }
    ]
  }'

response = client.chat.completions.create(
    model="gemini-3-flash-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What do you see in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/photo.jpg"},
                },
            ],
        }
    ],
)

const response = await client.chat.completions.create({
  model: "gemini-3-flash-preview",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "What do you see in this image?" },
        {
          type: "image_url",
          image_url: { url: "https://example.com/photo.jpg" },
        },
      ],
    },
  ],
});

Base64 图像

对于本地图像，将其编码为 base64 数据 URL：

import base64

with open("photo.jpg", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gemini-3-flash-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{b64}"},
                },
            ],
        }
    ],
)

多图像

单条消息中最多可包含 3 张图像：

response = client.chat.completions.create(
    model="gemini-3-flash-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Compare these two photos."},
                {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}},
                {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}},
            ],
        }
    ],
)

音频输入

以 base64 编码的 input_audio 内容部分发送音频。

curl -X POST https://api.modelmax.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $MODELMAX_API_KEY" \
  -d '{
    "model": "gemini-3-flash-preview",
    "messages": [
      {
        "role": "user",
        "content": [
          { "type": "text", "text": "Transcribe this audio." },
          {
            "type": "input_audio",
            "input_audio": {
              "data": "<base64-encoded-audio>",
              "format": "webm"
            }
          }
        ]
      }
    ]
  }'

支持的音频格式取决于供应商。常见格式：webm、mp3、wav、ogg。

支持视觉的模型

并非所有模型都支持图像输入。具有视觉能力的模型：

模型	图像	音频	视频
`gemini-3.1-pro-preview`	支持	支持	支持
`gemini-3-pro-preview`	支持	支持	支持
`gemini-3-flash-preview`	支持	支持	支持
`gemini-3.1-flash-lite-preview`	支持	支持	支持
`qwen3-vl-235b-a22b`	支持	不支持	不支持

纯文本模型（如 deepseek-v3.2）在收到图像或音频内容时会返回错误。