Skip to content

PyVisionAI API Documentation


PyVisionAI is a Python library for extracting and describing content from documents using Vision Language Models (LLMs). It supports multiple file formats and provides both cloud-based and local image description capabilities.


pip install pyvisionai

Core Components

Content Extraction

The library provides extractors for various file formats: - PDF documents - Microsoft Word (DOCX) files - PowerPoint (PPTX) presentations - HTML pages

Each extractor supports two methods: - page_as_image: Converts each page to an image and describes it using Vision LLM - text_and_images: Extracts text and images separately (not available for HTML)

Image Description

Two Vision LLM options are available: - OpenAI's GPT-4 Vision (cloud-based, recommended) - Llama Vision model (local, via Ollama)

API Reference

Factory Function

from pyvisionai import create_extractor

def create_extractor(
    file_type: str,
    extractor_type: str = "page_as_image",
    model: str = "gpt4",
    api_key: Optional[str] = None,
    prompt: Optional[str] = None
) -> BaseExtractor:
    Create an extractor instance for the specified file type.

        file_type: Type of file to process ("pdf", "docx", "pptx", "html")
        extractor_type: Extraction method ("page_as_image" or "text_and_images")
        model: Vision model to use ("gpt4" or "llama")
        api_key: OpenAI API key (required for GPT-4 Vision)
        prompt: Custom prompt for image description

        An instance of the appropriate extractor class

        ValueError: If file_type or extractor_type is invalid

Base Extractor

All extractors inherit from BaseExtractor and implement its interface:

class BaseExtractor:
    def extract(self, input_file: str, output_dir: str) -> str:
        Extract content from a file.

            input_file: Path to the input file
            output_dir: Directory to save extracted content

            str: Path to the generated markdown file

            FileNotFoundError: If input_file doesn't exist
            ExtractionError: If extraction fails

Image Description Functions

from pyvisionai import describe_image_openai, describe_image_ollama

def describe_image_openai(
    image_path: str,
    model: str = "gpt-4-vision-preview",
    api_key: Optional[str] = None,
    prompt: Optional[str] = None,
    max_tokens: int = 300
) -> str:
    Describe an image using OpenAI's Vision model.

        image_path: Path to the image file
        model: OpenAI model name
        api_key: OpenAI API key
        prompt: Custom prompt for image description
        max_tokens: Maximum tokens in response

        str: Description of the image

        FileNotFoundError: If image file doesn't exist
        APIError: If API call fails

def describe_image_ollama(
    image_path: str,
    model: str = "llama3.2-vision",
    prompt: Optional[str] = None
) -> str:
    Describe an image using local Llama model via Ollama.

        image_path: Path to the image file
        model: Ollama model name
        prompt: Custom prompt for image description

        str: Description of the image

        FileNotFoundError: If image file doesn't exist
        ConnectionError: If Ollama server is not running

Usage Examples

Basic Usage

from pyvisionai import create_extractor

# Extract from PDF using default settings (page_as_image + GPT-4 Vision)
extractor = create_extractor("pdf")
output_path = extractor.extract("document.pdf", "output/")

# Extract from DOCX using text_and_images method
extractor = create_extractor("docx", extractor_type="text_and_images")
output_path = extractor.extract("document.docx", "output/")

# Extract from HTML using local Llama model
extractor = create_extractor("html", model="llama")
output_path = extractor.extract("page.html", "output/")

Custom Image Description

from pyvisionai import describe_image_openai

# Describe image with custom prompt
description = describe_image_openai(
    prompt="List all the objects visible in this image",

# Use local Llama model
from pyvisionai import describe_image_ollama
description = describe_image_ollama(
    prompt="Describe the main subject of this image"

Error Handling

from pyvisionai import create_extractor
from pyvisionai.exceptions import ExtractionError

    extractor = create_extractor("pdf")
    output = extractor.extract("document.pdf", "output/")
except FileNotFoundError:
    print("Input file not found")
except ExtractionError as e:
    print(f"Extraction failed: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")


Environment Variables

  • OPENAI_API_KEY: Your OpenAI API key (required for GPT-4 Vision)
  • OLLAMA_HOST: Ollama server host (default: "http://localhost:11434")

Default Settings

  • Default extraction method: page_as_image
  • Default vision model: gpt4
  • Default image description prompt: "Extract the exact text as present in the image and write one sentence about each visual in the image"

Performance Considerations

  1. Memory Usage
  2. Large PDF files may require significant memory when using page_as_image method
  3. Consider using text_and_images method for large documents

  4. Processing Speed

  5. Cloud-based GPT-4 Vision is generally faster than local Llama model
  6. HTML processing requires browser rendering and may be slower

  7. Batch Processing

  8. Process multiple files in parallel for better performance
  9. Monitor memory usage when processing large batches

Best Practices

  1. File Types
  2. Use text_and_images for text-heavy documents
  3. Use page_as_image for documents with complex layouts
  4. Always use page_as_image for HTML (only supported method)

  5. Image Description

  6. Use specific prompts for better results
  7. Consider token limits when using GPT-4 Vision
  8. Ensure Ollama server is running when using local model

  9. Error Handling

  10. Always implement proper error handling
  11. Check input file existence before processing
  12. Validate output directory permissions