Multimodal Capabilities
Multimodal Inputs: Images, Audio, and Video
Send images, audio, video, and text together in a single Gemini API call for powerful multimodal analysis.
Native Multimodality
Gemini is built from the ground up to understand multiple modalities. Unlike models that bolt on vision as a separate feature, Gemini processes text, images, audio, video, and code natively in the same model and the same API call.
Supported Input Types
| Type | Formats | Max Size |
|---|---|---|
| Image | JPEG, PNG, WebP, GIF, HEIC | 20 MB per request |
| Audio | MP3, WAV, FLAC, AAC, OGG | 20 MB, up to 9.5 hours |
| Video | MP4, MOV, AVI, WebM | 20 MB inline, up to 1 hour via File API |
| Text | Any UTF-8 text | Context window limit |
| PDF documents | 300 pages, 20 MB |
Inline vs File API
For files under 20 MB, pass them inline as base64 data. For larger files, use the File API to upload first and then reference the file URI.
Common Multimodal Use Cases
- Image analysis — Describe, classify, or extract data from images
- Document understanding — Parse PDFs, diagrams, forms, receipts
- Video analysis — Describe scenes, extract timestamps, transcribe audio
- Chart/graph reading — Extract data from visual charts
- OCR — Extract text from images and scanned documents
- Code screenshots — Analyze and explain code in images
Sending Multiple Images
You can include multiple images in a single request — useful for comparison tasks, product catalogs, or sequential analysis.
Example
typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import * as fs from "fs";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const model = genAI.getGenerativeModel({ model: "gemini-1.5-pro" });
// --- Inline image from file ---
const imageBytes = fs.readFileSync("./screenshot.png");
const base64Image = imageBytes.toString("base64");
const imageResult = await model.generateContent([
{
inlineData: {
data: base64Image,
mimeType: "image/png",
},
},
"Describe any UI issues you see in this screenshot.",
]);
console.log(imageResult.response.text());
// --- Analyze a chart ---
const chartBytes = fs.readFileSync("./chart.jpg");
const chartResult = await model.generateContent([
{ inlineData: { data: chartBytes.toString("base64"), mimeType: "image/jpeg" } },
"Extract all data points from this chart as a JSON array with { label, value } objects.",
]);
// --- PDF document understanding ---
const pdfBytes = fs.readFileSync("./contract.pdf");
const pdfResult = await model.generateContent([
{ inlineData: { data: pdfBytes.toString("base64"), mimeType: "application/pdf" } },
"Summarize the key obligations and deadlines in this contract.",
]);
console.log(pdfResult.response.text());Try it yourself — TYPESCRIPT