Recently, I needed to get a printed version of a 100-year-old book that was in the public domain. However, the pdf contained a small watermark on each page of the book. It would have looked terrible in print because the book was in Hindi and Sanskrit but the watermark was in English.
Fortunately, the watermark was only a text frame (or text box). I could remove it one by one from each page using Libre Office Draw, but the e-book had over 400 pages. I was looking for a way to remove those text frames programmatically. I discovered two viable solutions after a long search.
If you are unfamiliar with bash, this may be a simpler method. You only need to run this Macro in Libre Office Draw. Thank you to the gentleman from the Libreoffice help desk who suggested this macro.
Sub removeAllTextShapeFromScannedDocument() doc0 = ThisComponent dPgs = doc0.DrawPages u1 = dPgs.Count - 1 For k = 0 To u1 dPg = dPgs(k) u2 = dPg.Count - 1 For j = u2 To 0 Step -1 sh = dPg(j) If sh.ShapeType = "com.sun.star.drawing.TextShape" Then dPg.remove(sh) Next j Next k End Sub
But, as I previously stated, this e-book has over 400 pages, all of which are scanned images. This was simply too much for my laptop. Libre Office Draw is a heavy program itself. So I'll need something else for my case.
Ghost Script is most likely included with your Linux distribution. It is a PDF interpreter. After a lengthy search on Stack Overflow, I discovered the correct question. I found the answer to my issue in that question. Here is the command:
gs -o no-texts.pdf -sDEVICE=pdfwrite -dFILTERTEXT with-texts.pdf
Thank you to the gentleman from Stack Overflow who asked this question.