Usage Guide
Command Line Interface
Command Parameters
file-extract
Command
file-extract [-h] -t TYPE -s SOURCE -o OUTPUT [-e EXTRACTOR] [-m MODEL] [-k API_KEY] [-v]
Required Arguments:
-t, --type TYPE File type to process (pdf, docx, pptx, html)
-s, --source SOURCE Source file or directory path
-o, --output OUTPUT Output directory path
Optional Arguments:
-h, --help Show help message and exit
-e, --extractor TYPE Extraction method:
- page_as_image: Convert pages to images (default)
- text_and_images: Extract text and images separately
Note: HTML only supports page_as_image
-m, --model MODEL Vision model for image description:
- gpt4: GPT-4 Vision (default, recommended)
- llama: Local Llama model
-k, --api-key KEY OpenAI API key (can also be set via OPENAI_API_KEY env var)
-v, --verbose Enable verbose logging
-p, --prompt TEXT Custom prompt for image description
describe-image
Command
describe-image [-h] -i IMAGE [-m MODEL] [-k API_KEY] [-t MAX_TOKENS] [-v] [-p PROMPT]
Required Arguments:
-i, --image IMAGE Path to image file
Optional Arguments:
-h, --help Show help message and exit
-m, --model MODEL Vision model to use:
- gpt4: GPT-4 Vision (default, recommended)
- llama: Local Llama model
-k, --api-key KEY OpenAI API key (can also be set via OPENAI_API_KEY env var)
-t, --max-tokens NUM Maximum tokens for response (default: 300)
-p, --prompt TEXT Custom prompt for image description
-v, --verbose Enable verbose logging
Examples
File Extraction Examples
# Basic usage with defaults (page_as_image method, GPT-4 Vision)
file-extract -t pdf -s document.pdf -o output_dir
file-extract -t html -s webpage.html -o output_dir # HTML always uses page_as_image
# Specify extraction method (not applicable for HTML)
file-extract -t docx -s document.docx -o output_dir -e text_and_images
# Use local Llama model for image description
file-extract -t pptx -s slides.pptx -o output_dir -m llama
# Process all PDFs in a directory with verbose logging
file-extract -t pdf -s input_dir -o output_dir -v
# Use custom OpenAI API key
file-extract -t pdf -s document.pdf -o output_dir -k "your-api-key"
# Use custom prompt for image descriptions
file-extract -t pdf -s document.pdf -o output_dir -p "Focus on text content and layout"
Image Description Examples
# Basic usage with defaults (GPT-4 Vision)
describe-image -i photo.jpg
# Use local Llama model
describe-image -i photo.jpg -m llama
# Use custom prompt
describe-image -i photo.jpg -p "List the main colors and their proportions"
# Customize token limit
describe-image -i photo.jpg -t 500
# Enable verbose logging
describe-image -i photo.jpg -v
# Use custom OpenAI API key
describe-image -i photo.jpg -k "your-api-key"
# Combine options
describe-image -i photo.jpg -m llama -p "Describe the lighting and shadows" -v
Python Library Usage
from pyvisionai import create_extractor, describe_image_openai, describe_image_ollama
# 1. Extract content from files
extractor = create_extractor("pdf") # or "docx", "pptx", or "html"
output_path = extractor.extract("input.pdf", "output_dir")
# With specific extraction method
extractor = create_extractor("pdf", extractor_type="text_and_images")
output_path = extractor.extract("input.pdf", "output_dir")
# Extract from HTML (always uses page_as_image method)
extractor = create_extractor("html")
output_path = extractor.extract("page.html", "output_dir")
# 2. Describe images
# Using GPT-4 Vision (default, recommended)
description = describe_image_openai(
"image.jpg",
model="gpt-4o-mini", # default
api_key="your-api-key", # optional if set in environment
max_tokens=300, # default
prompt="Describe this image focusing on colors and textures" # optional custom prompt
)
# Using local Llama model
description = describe_image_ollama(
"image.jpg",
model="llama3.2-vision", # default
prompt="List the main objects in this image" # optional custom prompt
)
Custom Prompts
PyVisionAI supports custom prompts for both file extraction and image description:
Using Custom Prompts
-
CLI Usage
-
Library Usage
# File extraction with custom prompt extractor = create_extractor( "pdf", extractor_type="page_as_image", prompt="Extract all text exactly as it appears and provide detailed descriptions of any charts or diagrams" ) output_path = extractor.extract("input.pdf", "output_dir") # Image description with custom prompt description = describe_image_openai( "image.jpg", prompt="Focus on spatial relationships between objects and any text content" )
-
Environment Variable
Writing Effective Prompts
- For Page-as-Image Method
- Include instructions for both text extraction and visual description
-
Example: "Extract the exact text as it appears on the page and describe any images, diagrams, or visual elements in detail"
-
For Text-and-Images Method
- Focus only on image description since text is extracted separately
-
Example: "Describe the visual content, focusing on what the image represents and any visual elements it contains"
-
For Image Description
- Be specific about what aspects to focus on
- Example: "Describe the main elements, their arrangement, and any text visible in the image"
Supported File Types
PDF Files
- Text extraction
- Image description
- Table recognition
- Layout analysis
Word Documents (DOCX)
- Text and formatting
- Embedded images
- Tables and lists
PowerPoint (PPTX)
- Slide content
- Speaker notes
- Embedded media
HTML Pages
- Text content
- Images
- Dynamic content (with JavaScript rendering)
Best Practices
- Performance Optimization
- Process files in batches when possible
- Use appropriate model for your needs
-
Cache results for frequently accessed documents
-
Error Handling
-
Resource Management
Examples
Basic Extraction
from pyvisionai import create_extractor
# PDF extraction with default settings (page_as_image + GPT-4 Vision)
extractor = create_extractor("pdf")
output_path = extractor.extract("input.pdf", "output/pdf")
# DOCX extraction using text_and_images method
extractor = create_extractor("docx", extractor_type="text_and_images")
output_path = extractor.extract("input.docx", "output/docx")
# PPTX extraction with custom prompt
extractor = create_extractor(
"pptx",
prompt="List all text content and describe any diagrams or charts"
)
output_path = extractor.extract("input.pptx", "output/pptx")
# HTML extraction (always uses page_as_image)
extractor = create_extractor("html")
output_path = extractor.extract("https://example.com", "output/html")
Specialized Extraction
from pyvisionai import create_extractor, describe_image_openai
# Technical documentation extraction
extractor = create_extractor(
"pdf",
prompt=(
"Extract all code snippets, technical terms, and command examples. "
"For diagrams, describe the technical architecture and components shown."
)
)
output_path = extractor.extract("technical_doc.pdf", "output/technical")
# Business report extraction
extractor = create_extractor(
"pptx",
prompt=(
"Extract key business metrics, financial figures, and trends. "
"For charts, provide detailed analysis of the data presented."
)
)
output_path = extractor.extract("business_report.pptx", "output/business")
# Image description with custom prompt
description = describe_image_openai(
"image.jpg",
prompt="Analyze the chart type, axes labels, and data trends. "
"Provide key insights and numerical values where visible."
)