
Be careful, PageObjects are in a list, so the method uses a zero-based index. Perhaps the most important method is getPage(page_num) which returns one page of the file as a separate PageObject. You can also get the total number of pages with reader.numPages.

For example, reader.documentInfo is an attribute that contains the document information dictionary in this format: You can get a number of general information about your document with this reader object. The parameter is the path to a pdf document we want to work with. The first object we need is a PdfFileReader: reader = PyPDF2.PdfFileReader('Complete_Works_Lovecraft.pdf') PyPDF2Īs a first step, install the package: pip install PyPDF2 For more information on this project, please refer to my GitHub repo. Then, in the second part, we are going to work on one project, which is about splitting a 708-page long pdf file into separate smaller files, extracting the text information, cleaning it, and then exporting to easily readable text files. We will discuss the different classes and methods we need. As their name suggests, they are libraries written specifically to work with pdf files.

In the first part, we are going to have a look at two Python libraries, PyPDF2 and PDFMiner. There is a pdf, there is text in it, we want the text out, and I am going to show you how to do that using Python. I don’t think there is much room for creativity when it comes to writing the intro paragraph for a post about extracting text from a pdf file.
