How to Build a Multimodal RAG Application with Gemini API File Search: A Step-by-Step Developer Guide

By

Introduction

If you're building a retrieval-augmented generation (RAG) application that handles both text and images, the Gemini API File Search tool now makes it easier than ever. With the addition of Gemini Embedding 2, images like charts, product photos, and diagrams can be natively indexed and searched in the same store as your text documents—no separate OCR pipeline needed.

How to Build a Multimodal RAG Application with Gemini API File Search: A Step-by-Step Developer Guide
Source: dev.to

In this step-by-step guide, you'll learn how to set up a multimodal File Search store, upload documents and images, perform queries with grounded generation, and extract image citations from the results. By the end, you'll have a fully functional RAG system that returns both text and visual answers with source references.

What You Need

  • A Google Cloud project with the Gemini API enabled
  • Python 3.9 or later installed on your machine
  • pip package manager
  • Latest google-genai Python SDK (pip install -U google-genai)
  • A Gemini API key (set as environment variable GOOGLE_API_KEY or configured in your client)
  • Sample files: at least one PDF and one image (e.g., product photo, chart, or diagram)

Step-by-Step Instructions

Step 1: Create a File Search Store

A File Search Store is a managed, persistent container for your document embeddings. Think of it as a vector database that the API handles for you—chunking, embedding, indexing, and retrieval are all automated.

To enable multimodal search, you must specify gemini-embedding-2 as the embedding model. If you omit this parameter, the default gemini-embedding-001 (text-only) is used, and you cannot change it later. Use the following Python code:

from google import genai
from google.genai import types

client = genai.Client()

file_search_store = client.file_search_stores.create(
    config={
        "display_name": "product-catalog",
        "embedding_model": "models/gemini-embedding-2"
    }
)
print(f"Created store: {file_search_store.name}")

Once created, the store is ready to accept files. Note the display_name is for your reference; the returned name (e.g., projects/.../fileSearchStores/...) is used in subsequent steps.

Step 2: Upload Documents and Images

Next, upload your files (PDFs, images, etc.) to the Gemini API and associate them with your store. The API automatically chunks and indexes each file using the embedding model you chose.

First, upload a file using the client, then add it to the store. Here's an example for an image:

# Upload a file (PDF or image) to the Gemini API
image_file = client.files.upload(
    file="path/to/product_photo.jpg",
    config={"display_name": "Product Photo A"}
)

# Associate the file with your File Search store
client.file_search_stores.add_file(
    file_search_store=file_search_store.name,
    file=image_file
)

Repeat for each file you want to index. You can mix PDFs and images in the same store—the model handles both formats. For optimal results, ensure your images are clear and contain visual information the model can embed (avoid text-heavy images that rely solely on OCR).

Step 3: Query with Grounded Generation

Now you can ask questions that leverage both text and images from your store. Use the file_search tool in generate_content to let the model automatically retrieve relevant chunks and produce a grounded response.

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="What are the key features shown in the product photo?",
    config=types.GenerateContentConfig(
        tools=[
            types.Tool(
                file_search=types.FileSearch(
                    stores=[
                        types.FileSearchStore(
                            name=file_search_store.name
                        )
                    ]
                )
            )
        ]
    )
)
print(response.text)

The model will search the store, retrieve relevant text and image context, and generate an answer that references the source files. If your query is about an image, the model uses the embeddings to match the visual content—without needing OCR.

How to Build a Multimodal RAG Application with Gemini API File Search: A Step-by-Step Developer Guide
Source: dev.to

Step 4: Retrieve Image Citations

One of the key benefits of the File Search tool is built-in citations. When the model uses a specific file (especially images), the response includes grounding metadata with downloadable references. To extract image citations from the response object, use:

if response.grounding_metadata:
    for chunk in response.grounding_metadata.grounding_chunks:
        if chunk.retrieved_context:
            # Each chunk has a 'context' with file URI and other metadata
            print(f"File: {chunk.retrieved_context.uri}")
            print(f"Title: {chunk.retrieved_context.title}")
            # For images, you can get the download URL
            if chunk.retrieved_context.mime_type.startswith('image/'):
                print(f"Download URL: {chunk.retrieved_context.signed_url}")

This code iterates through grounding chunks and prints details for each source file. The signed_url (if present) provides a temporary, authenticated link to download the image directly from the API's storage.

Tips for Success

  • Choose the right embedding model early – Once you create a store, you cannot change the embedding model. If you think you'll need image search later, always use gemini-embedding-2 from the start.
  • Optimize your files – PDFs should be well-structured with clear headings. Images should have high contrast and minimal text to avoid ambiguity during embedding.
  • Test with diverse queries – Try questions that mix text and visual context (e.g., “Compare the chart in the PDF to the product photo”) to verify the multimodal retrieval works as expected.
  • Handle large stores – If you need to index thousands of files, consider batching uploads and monitoring your API quota. The storage cost for embeddings is free, but indexing tokens are billable.
  • Inspect grounding metadata – Always check response.grounding_metadata to understand which sources the model used. This helps debug retrieval quality and ensures your citations are accurate.
  • Use AI Studio for prototyping – Before coding, try the AI Studio example app to see multimodal File Search in action. It gives you a no-code preview of the queries and citations.

By following these steps, you have built a powerful multimodal RAG endpoint that goes beyond text-only search. The Gemini API does the heavy lifting—chunking, embedding, indexing, and retrieval—so you can focus on crafting great user experiences.

Related Articles

Recommended

Discover More

Anchorage Digital and M0 Join Forces to Streamline US-Regulated Stablecoin LaunchesDesign Leadership Unplugged: How Managers and Lead Designers Can Thrive TogetherUnderstanding and Mitigating the 'Copy Fail' Linux Privilege Escalation Vulnerability: A Comprehensive GuideLinux Gaming on Steam: April Retreats from March’s All-Time High, but Momentum Remains StrongRetro Handheld Revolution: C64 and ZX Spectrum Go Portable in Clamshell Designs