How to Remove All Text Box from Pdf

Posted on Sun Apr 09 2023

Recently, I needed to get a printed version of a 100-year-old book that was in the public domain. However, the pdf contained a small watermark on each page of the book. It would have looked terrible in print because the book was in Hindi and Sanskrit but the watermark was in English.

Fortunately, the watermark was only a text frame (or text box). I could remove it one by one from each page using Libre Office Draw, but the e-book had over 400 pages. I was looking for a way to remove those text frames programmatically. I discovered two viable solutions after a long search.

Libre Office Draw - Macros

If you are unfamiliar with bash, this may be a simpler method. You only need to run this Macro in Libre Office Draw. Thank you to the gentleman from the Libreoffice help desk who suggested this macro.

1Sub removeAllTextShapeFromScannedDocument()
2	doc0 = ThisComponent
3	dPgs = doc0.DrawPages
4	u1   = dPgs.Count - 1
5
6	For k = 0 To u1
7	  dPg = dPgs(k)
8	  u2 = dPg.Count - 1
9
10	  For j = u2 To 0 Step -1
11	    sh = dPg(j)
12	    If sh.ShapeType = "com.sun.star.drawing.TextShape" Then dPg.remove(sh)
13	  Next j
14
15	Next k
16End Sub

But, as I previously stated, this e-book has over 400 pages, all of which are scanned images. This was simply too much for my laptop. Libre Office Draw is a heavy program itself. So I'll need something else for my case.

Ghost Script

Ghost Script is most likely included with your Linux distribution. It is a PDF interpreter. After a lengthy search on Stack Overflow, I discovered the correct question. I found the answer to my issue in that question. Here is the command:

1gs -o no-texts.pdf -sDEVICE=pdfwrite -dFILTERTEXT with-texts.pdf

Thank you to the gentleman from Stack Overflow who asked this question.