マルチモーダル入力

一部のモデルは、テキストに加えて画像、音声、動画を受け付けます。ModelMax は OpenAI の multipart content 形式に従います。通常の文字列ではなく、content parts の配列を渡してください。

各モデルが対応する入力モダリティは、対応モデルページで確認できます。

画像入力

画像は image_url content part として、Base64 data URL またはホストされた URL で送信します。

curl -X POST https://api.modelmax.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $MODELMAX_API_KEY" \
  -d '{
    "model": "gemini-3-flash-preview",
    "messages": [
      {
        "role": "user",
        "content": [
          { "type": "text", "text": "What do you see in this image?" },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://example.com/photo.jpg"
            }
          }
        ]
      }
    ]
  }'

response = client.chat.completions.create(
    model="gemini-3-flash-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What do you see in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/photo.jpg"},
                },
            ],
        }
    ],
)

const response = await client.chat.completions.create({
  model: "gemini-3-flash-preview",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "What do you see in this image?" },
        {
          type: "image_url",
          image_url: { url: "https://example.com/photo.jpg" },
        },
      ],
    },
  ],
});

Base64 画像

ローカル画像は Base64 data URL としてエンコードします。

import base64

with open("photo.jpg", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gemini-3-flash-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{b64}"},
                },
            ],
        }
    ],
)

複数画像

1 つのメッセージに最大 3 枚の画像を含められます。

response = client.chat.completions.create(
    model="gemini-3-flash-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Compare these two photos."},
                {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}},
                {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}},
            ],
        }
    ],
)

音声入力

音声は input_audio タイプの Base64 エンコード content part として送信します。

curl -X POST https://api.modelmax.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $MODELMAX_API_KEY" \
  -d '{
    "model": "gemini-3-flash-preview",
    "messages": [
      {
        "role": "user",
        "content": [
          { "type": "text", "text": "Transcribe this audio." },
          {
            "type": "input_audio",
            "input_audio": {
              "data": "<base64-encoded-audio>",
              "format": "webm"
            }
          }
        ]
      }
    ]
  }'

対応音声形式はプロバイダーに依存します。一般的な形式: webm、mp3、wav、ogg。

Vision 対応モデル

すべてのモデルが画像入力に対応しているわけではありません。Vision 機能を持つモデルは次の通りです。

モデル	画像	音声	動画
`gemini-3.1-pro-preview`	はい	はい	はい
`gemini-3-pro-preview`	はい	はい	はい
`gemini-3-flash-preview`	はい	はい	はい
`gemini-3.1-flash-lite-preview`	はい	はい	はい
`qwen3-vl-235b-a22b`	はい	いいえ	いいえ

テキスト専用モデル（例: deepseek-v3.2）に画像や音声 content を送信するとエラーになります。