en.osm.town is one of the many independent Mastodon servers you can use to participate in the fediverse.
An independent, community of OpenStreetMap people on the Fediverse/Mastodon. Funding graciously provided by the OpenStreetMap Foundation.

Server stats:

255
active users

#OCRmyPDF

0 posts0 participants0 posts today
Replied in thread
@meatbag I'm on linux and the best I have found working for me is #ocrmypdf github.com/ocrmypdf/OCRmyPDF
It uses #tesseract under the hood and for static text it's okay. For tables and other material that is difficult to parse it's not usefull.
When PDF has a text then the tools I am using for reading these include #firefox and #evince
GitHubGitHub - ocrmypdf/OCRmyPDF: OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searchedOCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched - ocrmypdf/OCRmyPDF

When you find a webpage that offers you a book but you can't download it, and you can't right-click to save the images of its pages, well – the page has loaded the images. Therefore the images are somewhere in your browser. What to do?

Knowing a bit of how web pages are structured and built helps make the most of what you see online.

1. In your browser, open the developer tools (push F12).

2. Go to the "Network" tab and restrict the view to "Images" and "Media" (see the upper right side).

3. Zoom into the book to ensure pages are of high resolution, then pass the pages.

4. You will notice new rows appearing into the table of the "Network" tab of the Developer Tools.

5. Now move your mouse over them and the image may even be shown to you; in any case just right-click and save it.

There are scripts online to automate this, but if all you are after are a few pages, this suffices.

To montage the pages into a PDF, use e.g.:

$ img2pdf *jpg -o book.pdf

... and even OCR them if you like:

$ ocrmypdf book.pdf book-OCR.pdf

Both programs can be installed with:

$ sudo apt get install img2pdf ocrmypdf

... in ubuntu, debian, and the like.

Or, import each into a page of a multi-page #Inkscape document and save it as a PDF.

I am a rulebook hoarder. Whenever I take a closer look at a game downloading the rulebook is the first thing I do. I have over 2500 boardgame related pdf files. I access them using pdf-tools in #Emacs, index them using #recoll and I use a small hack to make M-x pdfgrep search using the recoll index. I use the #OCRmyPDF tool to OCR the ones that didn't come with embedded text.
#boardgames