PyVisionAI API Documentation
Overview
PyVisionAI is a Python library for extracting and describing content from documents using Vision Language Models (LLMs). It supports multiple file formats and provides both cloud-based and local image description capabilities.
Installation
Core Components
Content Extraction
The library provides extractors for various file formats: - PDF documents - Microsoft Word (DOCX) files - PowerPoint (PPTX) presentations - HTML pages
Each extractor supports two methods:
- page_as_image
: Converts each page to an image and describes it using Vision LLM
- text_and_images
: Extracts text and images separately (not available for HTML)
Image Description
Two Vision LLM options are available: - OpenAI's GPT-4 Vision (cloud-based, recommended) - Llama Vision model (local, via Ollama)
API Reference
Factory Function
from pyvisionai import create_extractor
def create_extractor(
file_type: str,
extractor_type: str = "page_as_image",
model: str = "gpt4",
api_key: Optional[str] = None,
prompt: Optional[str] = None
) -> BaseExtractor:
"""
Create an extractor instance for the specified file type.
Args:
file_type: Type of file to process ("pdf", "docx", "pptx", "html")
extractor_type: Extraction method ("page_as_image" or "text_and_images")
model: Vision model to use ("gpt4" or "llama")
api_key: OpenAI API key (required for GPT-4 Vision)
prompt: Custom prompt for image description
Returns:
An instance of the appropriate extractor class
Raises:
ValueError: If file_type or extractor_type is invalid
"""
Base Extractor
All extractors inherit from BaseExtractor
and implement its interface:
class BaseExtractor:
def extract(self, input_file: str, output_dir: str) -> str:
"""
Extract content from a file.
Args:
input_file: Path to the input file
output_dir: Directory to save extracted content
Returns:
str: Path to the generated markdown file
Raises:
FileNotFoundError: If input_file doesn't exist
ExtractionError: If extraction fails
"""
Image Description Functions
from pyvisionai import describe_image_openai, describe_image_ollama
def describe_image_openai(
image_path: str,
model: str = "gpt-4-vision-preview",
api_key: Optional[str] = None,
prompt: Optional[str] = None,
max_tokens: int = 300
) -> str:
"""
Describe an image using OpenAI's Vision model.
Args:
image_path: Path to the image file
model: OpenAI model name
api_key: OpenAI API key
prompt: Custom prompt for image description
max_tokens: Maximum tokens in response
Returns:
str: Description of the image
Raises:
FileNotFoundError: If image file doesn't exist
APIError: If API call fails
"""
def describe_image_ollama(
image_path: str,
model: str = "llama3.2-vision",
prompt: Optional[str] = None
) -> str:
"""
Describe an image using local Llama model via Ollama.
Args:
image_path: Path to the image file
model: Ollama model name
prompt: Custom prompt for image description
Returns:
str: Description of the image
Raises:
FileNotFoundError: If image file doesn't exist
ConnectionError: If Ollama server is not running
"""
Usage Examples
Basic Usage
from pyvisionai import create_extractor
# Extract from PDF using default settings (page_as_image + GPT-4 Vision)
extractor = create_extractor("pdf")
output_path = extractor.extract("document.pdf", "output/")
# Extract from DOCX using text_and_images method
extractor = create_extractor("docx", extractor_type="text_and_images")
output_path = extractor.extract("document.docx", "output/")
# Extract from HTML using local Llama model
extractor = create_extractor("html", model="llama")
output_path = extractor.extract("page.html", "output/")
Custom Image Description
from pyvisionai import describe_image_openai
# Describe image with custom prompt
description = describe_image_openai(
"image.jpg",
prompt="List all the objects visible in this image",
max_tokens=500
)
# Use local Llama model
from pyvisionai import describe_image_ollama
description = describe_image_ollama(
"image.jpg",
prompt="Describe the main subject of this image"
)
Error Handling
from pyvisionai import create_extractor
from pyvisionai.exceptions import ExtractionError
try:
extractor = create_extractor("pdf")
output = extractor.extract("document.pdf", "output/")
except FileNotFoundError:
print("Input file not found")
except ExtractionError as e:
print(f"Extraction failed: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Configuration
Environment Variables
OPENAI_API_KEY
: Your OpenAI API key (required for GPT-4 Vision)OLLAMA_HOST
: Ollama server host (default: "http://localhost:11434")
Default Settings
- Default extraction method:
page_as_image
- Default vision model:
gpt4
- Default image description prompt: "Extract the exact text as present in the image and write one sentence about each visual in the image"
Performance Considerations
- Memory Usage
- Large PDF files may require significant memory when using
page_as_image
method -
Consider using
text_and_images
method for large documents -
Processing Speed
- Cloud-based GPT-4 Vision is generally faster than local Llama model
-
HTML processing requires browser rendering and may be slower
-
Batch Processing
- Process multiple files in parallel for better performance
- Monitor memory usage when processing large batches
Best Practices
- File Types
- Use
text_and_images
for text-heavy documents - Use
page_as_image
for documents with complex layouts -
Always use
page_as_image
for HTML (only supported method) -
Image Description
- Use specific prompts for better results
- Consider token limits when using GPT-4 Vision
-
Ensure Ollama server is running when using local model
-
Error Handling
- Always implement proper error handling
- Check input file existence before processing
- Validate output directory permissions