Skip to main content

Extract images from PDF page

You can extract images from the PDF page objects using the render method of the PDFiumImageObject class. The render method accepts the render option, which specifies the rendering engine to use. You can use the sharp rendering engine to convert the bitmap image to PNG format or you can use the bitmap rendering engine to get the raw bitmap image data as a buffer.

const document = await library.loadDocument(buff);

let index = 0;

// Iterate over pages in the document
for (const page of document.pages()) {

// Iterate over objects in the page
for (const object of page.objects()) {

// We are interested only in image objects
if (object.type === "image") {

// Render the image using the sharp rendering engine
const { data: image } = await object.render({
render: renderFunction,
});

// Save the PNG image to the output folder
await fs.writeFile(`output/${index}.png`, Buffer.from(image.data));

index++;
}
}
}

Reduce image size of extracted images

When images are extracted from the PDF page objects, they are rendered without any compression by default. You can use sharp and custom render function to compress the extracted images and reduce their size.

import sharp from 'sharp';
import { PDFiumPageRenderOptions } from '@hyzyla/pdfium';

const document = await library.loadDocument(buff);
const page = document.getPage(0);
const object = page.getObject(0);

const image = await object.render({
render: async (options: PDFiumPageRenderOptions): Promise<Uint8Array> => {
return await sharp(options.data, {
raw: {
width: options.width,
height: options.height,
channels: 4,
},
})
.jpeg({ quality: 80 }) // Use JPEG format with 80% quality
.toBuffer();
},
});

Extract raw uncompressed image data

You can also extract the raw uncompressed image data from the PDF page objects using the getImageDataRaw method of the PDFiumImageObject class. The getImageDataRaw method returns the raw uncompressed image data as a buffer, along with the image's width, height, and the filters/decoders used to decode the image data.

const document = await library.loadDocument(buff);

const page = document.getPage(0);
const object = page.getObject(0);

cosnt {
data,
width,
height,
filters,
} = object.getImageDataRaw();
/*
Example output:
{
data: [...], // Raw uncompressed image data
width: 100, // Image width
height: 100, // Image height
filters: ["DCTDecode"], // Filters/decoders used to decode the image data
}
*/

⚠️ Images in PDF documents are not stored in specific image formats like PNG or JPEG. Instead, image bitmaps are stored in compressed formats such as DCTDecode and FlateDecode. You need to decode the image data using the appropriate decoder/filter to obtain the raw uncompressed image data. Alternatively, you can use the render method with a rendering engine like sharp to convert the image into a specific format like PNG or JPEG.