Mochan Shrestha (mochan.org)

Tutorials, Education, Learning and Knowledge

Reduce PDF Tool

Mochan Shrestha

2024/01/07

This is a tool to reduce the size of PDF files by removing or compressing images. It can also be asked to only output a section of pages from the original PDF file.

When uploading PDFs for LLMs, the file size is usually limited. Some PDFs contain a large number of high quality images that bump up the file size.

Whenever this happened, I would just copy/paste the text. But, I don’t know if the PDF processing on the server is getting better and that the visual context of the PDF is also being used. So, an easy solution is to have a tool that can reduce the size of the PDF file by compressing the images since high quality images are probably not necessary.

Reduce-PDF Icon

Source code is available on GitHub and the PyPi package is available here.

Usage

$ reduce-pdf <input-file> <output-file> [options]

Options

Implementation

The tool is implemented in Python using the PyPDF2.

A PDF file contains a collection of objects. When removing images, we just need to find the object and delete it.

# Open the PDF file
doc = fitz.open(pdf_file_path)

# Iterate through the pages
for page_no in range(len(doc)):
    page = doc.load_page(page_no)
    # Iterate through the images and delete them
    img_list = page.get_images(full=True)
    for img in img_list:
        xref = img[0]
        doc._deleteObject(xref)

# Save the changes
doc.save("images_removed.pdf", garbage=4, deflate=True, clean=True)
doc.close()

When compressing images, we need to find the images, get the image data, compress it, put it in the same place as the original image and then delete the original image.

# Open the original PDF file
doc = fitz.open(pdf_file_path)

# Iterate through the pages
for page_no in range(len(doc)):
    page = doc.load_page(page_no)

    # Get list of images on the page
    img_list = page.get_images(full=True)

    for img in img_list:
        xref = img[0]
        base_image = doc.extract_image(xref)
        img_bytes = base_image["image"]

        # Open the image using PIL
        image = Image.open(io.BytesIO(img_bytes))

        # Compress the image - adjust quality as needed
        with io.BytesIO() as output:
            image.save(output, format='JPEG', quality=25)  # Adjust quality for higher compression
            compressed_image_bytes = output.getvalue()

        # Replace the original image with the compressed one
        image_rect = page.get_image_rects(xref)[0]

        # Delete the original image from the PDF
        doc._deleteObject(xref)

        page.insert_image(rect=image_rect, stream=compressed_image_bytes)

# Save the changes to a new file
doc.save("images_compressed.pdf", garbage=4, deflate=True, clean=True)
doc.close()

When removing pages, if we just delete the pages the pages go away but the internal contents of the page is still in the PDF file and the file size is not reduced. So, we have to go into usused pages and delete all of the images there.

The full script for the tool is available here

Limitations