Generative AI Chatbots

Student: Jorge Enrique Castañeda Centurion

Course: Business Intelligence

Teacher: CUADROS QUIROGA, PATRICK JOSE

Gemini

Google Gemini is an advanced artificial intelligence model created by Google, designed to outperform its predecessor, PaLM, and compete with OpenAI's GPT. This model seeks to lead the market thanks to its multimodal capacity, allowing it to natively process text, images, audio and programming code. This makes it an extremely versatile tool adapted to multiple applications.

Gemini comes in three tiered versions: Ultra, Pro and Nano. Ultra is the most advanced, designed to outperform GPT-4 in current tests, while Pro competes with GPT-3.5 and is now available on Google Bard. For its part, Nano is optimized for mobile devices, allowing its use offline, which represents a great advantage over OpenAI models.

The Google Gemini API allows developers to integrate the advanced capabilities of this artificial intelligence model into their applications and services. This API offers access to Gemini's multimodal functions, allowing you to work with text, images, audio and programming code natively. This makes it ideal for creating innovative solutions that combine multiple data sources into a single workflow.

Example:

The code begins by importing several libraries necessary for its core functionalities. It includes tools to interact with the Gemini API (google.generativeai), processing tabular data with Pandas, handling vector indexes with FAISS, and generating embeddings with SentenceTransformer. Additionally, pytesseract is used for optical character recognition (OCR), googletrans for text translation and cv2 for image preprocessing. Dotenv is also configured to load the API key from an .env file, ensuring secure credential handling.

import os
from dotenv import load_dotenv
import google.generativeai as palm
import pandas as pd
import faiss
from sentence_transformers import SentenceTransformer
from PIL import Image
import pytesseract
from googletrans import Translator
import cv2
from langdetect import detect

The Gemini API key is loaded from the .env file and configured using palm.configure(). If the key is not available, the program throws an error, ensuring that the API is ready to use. This configuration allows you to access the capabilities of the Gemini generative model to generate text-based content.

The program includes a feature to upload a CSV file containing frequently asked questions and answers. Validate that the file contains the necessary columns (question and answer) before processing it. Questions are converted to embeddings using SentenceTransformer, and a FAISS vector index is created for quick similarity searches between questions. This makes it easy to find the question closest to a user-provided query.

For user questions, the program looks for the closest one in the FAISS index and uses Gemini to generate an answer based on that question and its context. The answer is constructed using a prompt that includes both the question and its associated answer, which helps the model to provide more precise and relevant answers.

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'

# Cargar clave de API desde .env
load_dotenv()
gemini_api_key = os.getenv("GEMINI_API_KEY")
if not gemini_api_key:
    raise ValueError("No se encontró la clave GEMINI_API_KEY en el archivo .env")

# Configurar la API de Gemini
palm.configure(api_key=gemini_api_key)

# Modelo de embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')

# Función para cargar datos del CSV
def load_faq_data(file_path):
    df = pd.read_csv(file_path)
    if "question" not in df.columns or "answer" not in df.columns:
        raise ValueError("El archivo CSV debe tener las columnas 'question' y 'answer'")
    return df

# Crear índice FAISS
def create_faiss_index(questions):
    embeddings = model.encode(questions)
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(embeddings)
    return index, embeddings

# Encontrar la pregunta más cercana
def find_closest_question(user_question, index, questions, k=1):
    user_embedding = model.encode([user_question])
    distances, indices = index.search(user_embedding, k)
    closest_idx = indices[0][0]
    return questions[closest_idx], distances[0][0]

# Función para obtener respuesta usando Gemini
def get_gemini_response(question, context):
    # Construimos la consigna que incluye el contexto
    prompt = (
        f"Contexto:\n{context}\n\n"
        f"Pregunta: {question}\n"
        "Por favor responde de manera clara y concisa:"
    )

    # Usamos el modelo generativo para crear contenido
    model = palm.GenerativeModel("gemini-1.5-flash")
    response = model.generate_content(prompt)

    # Devolvemos el texto generado
    return response.text if response else "No se pudo generar una respuesta."

If the user provides an image instead of text, the program processes the image using OpenCV to improve its quality and then extracts text with pytesseract. Subsequently, it detects the language of the text with langdetect and, if it is English, it translates it into Spanish using googletrans. This allows visual input to be handled and converted into textual data useful to the user.

# Nueva funcionalidad: Procesar imágenes
def process_image(image_path):
    # Verificar si el archivo de imagen existe
    if not os.path.exists(image_path):
        raise FileNotFoundError(f"No se encontró el archivo de imagen '{image_path}'.")

    print(f"\nProcesando imagen: {image_path}")

    # Leer la imagen y convertirla a escala de grises
    image = cv2.imread(image_path)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Aplicar binarización para mejorar el contraste
    _, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

    # Extraer texto usando Tesseract con configuración optimizada
    custom_config = r'--psm 6 -l eng+spa'
    extracted_text = pytesseract.image_to_string(binary, config=custom_config)

    # Detectar idioma y traducir si es necesario
    translator = Translator()
    try:
        detected_language = detect(extracted_text)
        if detected_language == "en":
            translated_text = translator.translate(extracted_text, src="en", dest="es").text
            print(f"Texto extraído (Inglés detectado):\n{extracted_text}")
            print(f"Traducción al español:\n{translated_text}")
        elif detected_language == "es":
            print(f"Texto extraído (Español detectado):\n{extracted_text}")
        else:
            print("Texto extraído (Idioma no identificado):")
            print(extracted_text)
    except Exception as e:
        print(f"Error al detectar/traducir el idioma: {e}")

# Programa principal
def main():
    faq_path = "faq.csv"
    if not os.path.exists(faq_path):
        raise FileNotFoundError(f"No se encontró el archivo '{faq_path}'.")

    print("Cargando datos de FAQ...")
    faq_data = load_faq_data(faq_path)
    questions = faq_data["question"].tolist()
    answers = faq_data["answer"].tolist()

    print("Creando índice vectorial...")
    index, embeddings = create_faiss_index(questions)

    print("\nEscribe tu pregunta o ruta de imagen (o 'salir' para terminar):")
    while True:
        user_input = input("> ").strip()
        if user_input.lower() == "salir":
            print("¡Hasta luego!")
            break

        if user_input.lower().endswith((".jpg", ".jpeg", ".png", ".bmp")):
            try:
                process_image(user_input)
            except Exception as e:
                print(f"Error al procesar la imagen: {e}")
        else:
            closest_question, distance = find_closest_question(user_input, index, questions)
            print(f"\nPregunta más cercana encontrada: {closest_question} (distancia: {distance:.2f})")

            context = f"Pregunta: {closest_question}\nRespuesta: {faq_data.iloc[questions.index(closest_question)]['answer']}"
            response = get_gemini_response(user_input, context)
            print(f"\nRespuesta del asistente: {response}\n")

if __name__ == "__main__":
    main()

And here we can see the executed code where we enter our questions and the AI looks for the closest one and gives us the correct answer.

For the images function, we must enter the path where our image is located and the AI will do the rest.

Conclusion:

Using the Google Gemini API to develop a chatbot that answers questions and translates text in images from English to other languages represents a highly innovative and efficient solution. Thanks to its multimodal capabilities, Gemini allows you to process text, images and audio natively, eliminating the need for external tools to perform these tasks. This not only simplifies chatbot development, but also ensures more accurate and consistent responses by combining multiple sources of information in real time.

Additionally, the integration of text translation into images benefits from Gemini's advanced visual processing system, which can interpret text embedded in graphics with high precision. This makes it an ideal tool for educational, technical support scenarios, or for users looking for an interactive and versatile experience.

BIBLIOGRAPHY:

Chatbot de la API de Gemini | Gemini API Developer Competition | Google AI for Developers. (s. f.). Google AI For Developers. https://ai.google.dev/competition/projects/gemini-api-chatbot?hl=es-419

Gemini API reference | Google AI for Developers. (s. f.). Google AI For Developers. https://ai.google.dev/api?hl=es-419&lang=python

Fernández, Y. (2024, 1 febrero). Google Gemini: qué es, cómo funciona, diferencias con GPT y cuándo podrás usar este modelo de. . . Xataka. https://www.xataka.com/basics/google-gemini-que-como-funciona-diferencias-gpt-cuando-podras-usar-este-modelo-inteligencia-artificial