Skip to content

Usage Guide

Command Line Interface

Command Parameters

file-extract Command

file-extract [-h] -t TYPE -s SOURCE -o OUTPUT [-e EXTRACTOR] [-m MODEL] [-k API_KEY] [-v]

Required Arguments:
  -t, --type TYPE         File type to process (pdf, docx, pptx, html)
  -s, --source SOURCE     Source file or directory path
  -o, --output OUTPUT     Output directory path

Optional Arguments:
  -h, --help             Show help message and exit
  -e, --extractor TYPE   Extraction method:
                         - page_as_image: Convert pages to images (default)
                         - text_and_images: Extract text and images separately
                         Note: HTML only supports page_as_image
  -m, --model MODEL      Vision model for image description:
                         - gpt4: GPT-4 Vision (default, recommended)
                         - llama: Local Llama model
  -k, --api-key KEY      OpenAI API key (can also be set via OPENAI_API_KEY env var)
  -v, --verbose          Enable verbose logging
  -p, --prompt TEXT      Custom prompt for image description

describe-image Command

describe-image [-h] -i IMAGE [-m MODEL] [-k API_KEY] [-t MAX_TOKENS] [-v] [-p PROMPT]

Required Arguments:
  -i, --image IMAGE      Path to image file

Optional Arguments:
  -h, --help            Show help message and exit
  -m, --model MODEL     Vision model to use:
                        - gpt4: GPT-4 Vision (default, recommended)
                        - llama: Local Llama model
  -k, --api-key KEY     OpenAI API key (can also be set via OPENAI_API_KEY env var)
  -t, --max-tokens NUM  Maximum tokens for response (default: 300)
  -p, --prompt TEXT     Custom prompt for image description
  -v, --verbose         Enable verbose logging

Examples

File Extraction Examples

# Basic usage with defaults (page_as_image method, GPT-4 Vision)
file-extract -t pdf -s document.pdf -o output_dir
file-extract -t html -s webpage.html -o output_dir  # HTML always uses page_as_image

# Specify extraction method (not applicable for HTML)
file-extract -t docx -s document.docx -o output_dir -e text_and_images

# Use local Llama model for image description
file-extract -t pptx -s slides.pptx -o output_dir -m llama

# Process all PDFs in a directory with verbose logging
file-extract -t pdf -s input_dir -o output_dir -v

# Use custom OpenAI API key
file-extract -t pdf -s document.pdf -o output_dir -k "your-api-key"

# Use custom prompt for image descriptions
file-extract -t pdf -s document.pdf -o output_dir -p "Focus on text content and layout"

Image Description Examples

# Basic usage with defaults (GPT-4 Vision)
describe-image -i photo.jpg

# Use local Llama model
describe-image -i photo.jpg -m llama

# Use custom prompt
describe-image -i photo.jpg -p "List the main colors and their proportions"

# Customize token limit
describe-image -i photo.jpg -t 500

# Enable verbose logging
describe-image -i photo.jpg -v

# Use custom OpenAI API key
describe-image -i photo.jpg -k "your-api-key"

# Combine options
describe-image -i photo.jpg -m llama -p "Describe the lighting and shadows" -v

Python Library Usage

from pyvisionai import create_extractor, describe_image_openai, describe_image_ollama

# 1. Extract content from files
extractor = create_extractor("pdf")  # or "docx", "pptx", or "html"
output_path = extractor.extract("input.pdf", "output_dir")

# With specific extraction method
extractor = create_extractor("pdf", extractor_type="text_and_images")
output_path = extractor.extract("input.pdf", "output_dir")

# Extract from HTML (always uses page_as_image method)
extractor = create_extractor("html")
output_path = extractor.extract("page.html", "output_dir")

# 2. Describe images
# Using GPT-4 Vision (default, recommended)
description = describe_image_openai(
    "image.jpg",
    model="gpt-4o-mini",  # default
    api_key="your-api-key",  # optional if set in environment
    max_tokens=300,  # default
    prompt="Describe this image focusing on colors and textures"  # optional custom prompt
)

# Using local Llama model
description = describe_image_ollama(
    "image.jpg",
    model="llama3.2-vision",  # default
    prompt="List the main objects in this image"  # optional custom prompt
)

Custom Prompts

PyVisionAI supports custom prompts for both file extraction and image description:

Using Custom Prompts

  1. CLI Usage

    # File extraction with custom prompt
    file-extract -t pdf -s document.pdf -o output_dir -p "Extract all text verbatim and describe any diagrams or images in detail"
    
    # Image description with custom prompt
    describe-image -i image.jpg -p "List the main colors and describe the layout of elements"
    

  2. Library Usage

    # File extraction with custom prompt
    extractor = create_extractor(
        "pdf",
        extractor_type="page_as_image",
        prompt="Extract all text exactly as it appears and provide detailed descriptions of any charts or diagrams"
    )
    output_path = extractor.extract("input.pdf", "output_dir")
    
    # Image description with custom prompt
    description = describe_image_openai(
        "image.jpg",
        prompt="Focus on spatial relationships between objects and any text content"
    )
    

  3. Environment Variable

    # Set default prompt via environment variable
    export FILE_EXTRACTOR_PROMPT="Extract text and describe visual elements with emphasis on layout"
    

Writing Effective Prompts

  1. For Page-as-Image Method
  2. Include instructions for both text extraction and visual description
  3. Example: "Extract the exact text as it appears on the page and describe any images, diagrams, or visual elements in detail"

  4. For Text-and-Images Method

  5. Focus only on image description since text is extracted separately
  6. Example: "Describe the visual content, focusing on what the image represents and any visual elements it contains"

  7. For Image Description

  8. Be specific about what aspects to focus on
  9. Example: "Describe the main elements, their arrangement, and any text visible in the image"

Supported File Types

PDF Files

  • Text extraction
  • Image description
  • Table recognition
  • Layout analysis

Word Documents (DOCX)

  • Text and formatting
  • Embedded images
  • Tables and lists

PowerPoint (PPTX)

  • Slide content
  • Speaker notes
  • Embedded media

HTML Pages

  • Text content
  • Images
  • Dynamic content (with JavaScript rendering)

Best Practices

  1. Performance Optimization
  2. Process files in batches when possible
  3. Use appropriate model for your needs
  4. Cache results for frequently accessed documents

  5. Error Handling

    try:
        # Process your document
        extractor = create_extractor("pdf")
        result = extractor.extract("document.pdf", "output_dir")
    except FileNotFoundError:
        print("File not found")
    except PermissionError:
        print("Permission denied")
    except Exception as e:
        print(f"An error occurred: {e}")
    

  6. Resource Management

    # Initialize resources properly
    extractor = create_extractor("pdf")
    
    # Use appropriate configuration
    extractor = create_extractor(
        "pdf",
        extractor_type="text_and_images",
        prompt="Extract text and describe visual elements"
    )
    

Examples

Basic Extraction

from pyvisionai import create_extractor

# PDF extraction with default settings (page_as_image + GPT-4 Vision)
extractor = create_extractor("pdf")
output_path = extractor.extract("input.pdf", "output/pdf")

# DOCX extraction using text_and_images method
extractor = create_extractor("docx", extractor_type="text_and_images")
output_path = extractor.extract("input.docx", "output/docx")

# PPTX extraction with custom prompt
extractor = create_extractor(
    "pptx",
    prompt="List all text content and describe any diagrams or charts"
)
output_path = extractor.extract("input.pptx", "output/pptx")

# HTML extraction (always uses page_as_image)
extractor = create_extractor("html")
output_path = extractor.extract("https://example.com", "output/html")

Specialized Extraction

from pyvisionai import create_extractor, describe_image_openai

# Technical documentation extraction
extractor = create_extractor(
    "pdf",
    prompt=(
        "Extract all code snippets, technical terms, and command examples. "
        "For diagrams, describe the technical architecture and components shown."
    )
)
output_path = extractor.extract("technical_doc.pdf", "output/technical")

# Business report extraction
extractor = create_extractor(
    "pptx",
    prompt=(
        "Extract key business metrics, financial figures, and trends. "
        "For charts, provide detailed analysis of the data presented."
    )
)
output_path = extractor.extract("business_report.pptx", "output/business")

# Image description with custom prompt
description = describe_image_openai(
    "image.jpg",
    prompt="Analyze the chart type, axes labels, and data trends. "
           "Provide key insights and numerical values where visible."
)