Remove marks in pen from a scanned copy of a book

I have a scanned copy of a book many lines in which are underlined with a pen, there are notes in the margins also. I need a program to remove these marks or to extract the text without the loss of the formatting and save it as a PDF document. The book was printed on dark paper. My OS is Windows 7. Would be very grateful for recommendations.

asked Jun 18, 2017 at 14:00 123 5 5 bronze badges Are the pen markings in the same colour as the actual text? Commented Jun 18, 2017 at 15:16 No, the markings are dark blue and the text is black. Commented Jun 18, 2017 at 15:21

1 Answer 1

ImageMagick convert can be used in batch mode to filter out the pen mark-up and at the same time reduce the images to monochrome, (usually better for OCR in any case). I would select a few typical images, scans, first and test to get the filter values that you need, GIMP can be used to sample the ink colour(s) or you can use the ImageMagick histogram function to identify them.

But the pdf file will be one of cleaned up scanned images. To make this searchable you will need to run an OCR, (Optical Character Recognition), program on the cleaned up images.

OCR has a varying level of success depending on the quality of the images, the font(s) used, the number of diagrams, the training of the program, (some can be trained), and to a degree how obscure the text is - a lot of OCR programs try to correct based on the spelling and context - if you are OCRing science, maths, or psychology for example you can expect a lot of errors as there is a lot of terminology that doesn't fit the standard English dictionary.

Tesseract is well worth a look for performing the OCR. It is:

PS:

I have to say that in most cases given the time and effort needed to do this for a substantial book it would probably make sense to spend enough time working behind a bar, or just about any part time minimum wage job, to purchase a brand-new copy of the same book, as an ebook or pdf if available, from the publishers.