How to Make a Searchable PDF from Scanned Pages
Scanned PDF documents are great for reading, but fail to deliver anything beyond that. You cannot select text on a scanned page or copy some fragment to the clipboard. You cannot search such a PDF file. This tutorial explains how you can turn the scanned PDF to the searchable document using the PDFium C# library and Tesserat .Net OCR SDK. To learn how to create a PDF from scanned pages, please read
this tutorial instead.
How it works
The idea is simple. We take the scanned pages of the original PDF, recognize them using the OCR (optical character recognition) library and add an invisible layer to the PDF file that contains all the recognized text in addition to the main visible layer with scanned pages. This allows a user to view and read the document as before, but also enables them to search the text, select it, copy selection to the clipboard and so on.
How to do that
1. Enable required namespaces
To turn a scanned PDF to the searchable one, we need to use the following namespaces:
Code:using Patagames.Pdf.Net;
using Patagames.Pdf.Enums;
These ones are required to work with PDF documents.
Code:using Patagames.Ocr;
using Patagames.Ocr.Enums;
These ones provide OCR capabilities.
Code:using System;
using System.Drawing;
And we also need some standard ones.
2. Initialize libraries
We need to initialize the
Pdfium.Net SDK library and the
Tesseract.Net SDK OCR library.
This line initializes PDFium. The process has some nuances, because initialization is static. Read more about it
here.
Then we need to initialize the OCR library:
Code:var ocr = OcrApi.Create();
To create an instance of the
OcrApi class, we call the
Create() static method. The OcrApi class implements the IDisposable interface, so we either need to call ocr.Dispose() or simply use the using clause. After we created an instance of the OcrApi class, we need to initialize it. The
Init() method does this.
The method looks as follows:
Code:public void Init(
Languages language,
string dataPath = null,
OcrEngineMode oem = OcrEngineMode.OEM_DEFAULT,
string[] configs = null,
string[] varsVec = null,
string[] varsValues = null,
bool setOnlyNonDebugParams = false
);
We don’t need all of these parameters to convert a scanned PDF to the searchable one in our example. In fact, you can call Init without any parameters at all (see below). However, other tasks may require them, so we provide a brief description of what can be passed to Init here.
language
This parameter specifies the language or languages for OCR. You can recognize multi-language documents, but the more languages you include, the more memory the app consumes. More languages also mean lower OCR quality. In our case we just stick with English, which is also the default language of the OCR engine.
Note: the tessdata folder should contain data files for all languages you use in the OCR. You can download Tesseract language modules
here.
dataPath
This is a path to the parent folder of the tessdata folder – the folder where the language data files are. This is either a full path, or a relative path. The path should end with a trailing backslash. For example, if the path to the tessdata folder is
c:\MyApp\tessdata\ the path passed in the dataPath parameter should be
c:\MyApp\.
If you don’t specify the parameter, the path defaults to the current folder of the app.
Note:if the current folder changes somehow (for instance, when the user changes the current folder in Open or Save dialogs), the omitted dataPath will point to this new location too! Therefore, the good practice is to explicitly specify the path in this parameter.
oem
This parameter specifies the OcrEngineMode with the following available options:
OEM_TESSERACT_ONLY for the fastest OCR,
OEM_CUBE_ONLY for slower but accurate recognition,
OEM_TESSERACT_CUBE_COMBINED for extreme accuracy and
OEM_DEFAULT. The latter determines the OCR mode based on variables in the language-specific config, command-line configs or (if there are no any of them) defaults to
OEM_TESSERACT_ONLY.
Note:The tessdata folder should have the corresponding language files in order for the OCR modes to initialize. Language filenames for the OCR modes are:
- *.trained – for the OEM_TESSERACT_ONLY mode;
- *.cube.* – for the OEM_CUBE_ONLY mode;
- *.tesseract_cube.* – for the OEM_TESSERACT_CUBE_COMBINED mode.
If the corresponding file is missing, initialization will fail.
configs
The array of configuration file names. The corresponding configuration files should be located in the
configs or
tessconfigs subfolder in the tessdata folder.
varsVec
The array of configuration variable names. This is an alternative way to configure Tesseract.
varsValues
The array of configuration variable values. The list of supported variables can be found
hereVariables passed this way have higher priority over configuration files. This allows you to overwrite certain settings simply by passing the corresponding variables directly using varsVec and varsValues parameters.
setOnlyNonDebugParams
Disable for debug purposes. Enable for final build.
For now, we only use one parameter and initialize the OCR as follows:
Code:ocr.Init(Languages.English);
All other parameters are omitted and are set to their corresponding default values as described above.
So, here is the C# code we’ve got so far:
Code:using Patagames.Pdf.Net;
using Patagames.Pdf.Enums;
using Patagames.Ocr;
using Patagames.Ocr.Enums;
using System;
using System.Drawing;
namespace SearchablePdfFromScannedPdf
{
class Program
{
static void Main(string[] args)
{
PdfCommon.Initialize();
using (var ocr = OcrApi.Create())
{
ocr.Init(Languages.English);
...
}
}
}
}
Once we are done with initialization it is time to do some work.
3. OCR pages
To build a PDF from images we need a renderer. In our case we need one specific renderer called
OcrPdfRenderer.
Code:var renderer = OcrPdfRenderer.Create(@"d:\0\output_pdf_file", @"tessdata\");
Here, we tell the static method
Create the filename to save the recognized PDF file as (the first parameter) and where the language data files are (the second parameter). Please note that unlike the initialization procedure above, this method needs the path to the tessdata folder, not the parent folder.
The OcrPdfRenderer class implements IDisposable too, so don’t forget to call Dispose or stick with using like shown below.
Now that we have the renderer created, we pass pages of our scanned PDF file to it. For each page, we render it to a Bitmap, then we recognize it and add the recognized text to the page. Here is the fragment of code that does all of this:
Code:using (var renderer = OcrPdfRenderer.Create(@"d:\0\output_pdf_file", @"tessdata\"))
{
renderer.BeginDocument("document title");
using (var doc = PdfDocument.Load(@"d:\0\test_big.pdf"))
{
foreach (var page in doc.Pages)
{
int width = (int)(page.Width / 72.0 * 96);
int height = (int)(page.Height / 72.0 * 96);
using (var bitmap = new PdfBitmap(width, height, true))
{
bitmap.FillRect(0, 0, width, height, FS_COLOR.White);
page.Render(bitmap, 0, 0, width, height, PageRotate.Normal, RenderFlags.FPDF_LCD_TEXT);
using(var pix = OcrPix.FromBitmap(bitmap.Image as Bitmap))
{
ocr.ProcessPage(pix, renderer);
}
}
}
}
renderer.EndDocument();
}
Let’s elaborate this chunk of code a bit.
This line starts a new document:
Code:renderer.BeginDocument("document title");
The next line loads our source PDF document, the one with scanned images. The PdfDocument object requires the final Dispose(), hence the using clause again.
Code:using (var doc = PdfDocument.Load(@"d:\0\test_big.pdf"))
We want to create a bitmap of each page, so we calculate the required width and height of the bitmap in pixels converted from the dimensions of the PDF page in Points. Each Point is 1/72 of an inch, so we basically take the vertical or horizontal DPI of the image (96 in our example), multiply it to corresponding dimension and divide by 72.
Our next line creates a new
PdfBitmap using the dimensions we just computed. The last parameter of the constructor tells to use the true color mode.
Code:using (var bitmap = new PdfBitmap(width, height, true))
Then we fill the entire bitmap with white and render the page to it:
Code:bitmap.FillRect(0, 0, width, height, FS_COLOR.White);
page.Render(bitmap, 0, 0, width, height, PageRotate.Normal, RenderFlags.FPDF_LCD_TEXT);
Finally, the line that does all the job:
Code:ocr.ProcessPage(OcrPix.FromBitmap(bitmap.Image as Bitmap), renderer);
The method takes four parameters: the image to recognize, the debug configuration file we don’t currently need, the maximum timeout (zero means no timeout), and the renderer.
OcrPix takes a bitmap in the .Net format as the parameter, so we simply pass one using the bitmap.Image property.
That was quite a long step, but the actual code we received is pretty short. Here is the entire program:
Code:using Patagames.Pdf.Net;
using Patagames.Pdf.Enums;
using Patagames.Ocr;
using Patagames.Ocr.Enums;
using System;
using System.Drawing;
namespace SearchablePdfFromScannedPdf
{
class Program
{
static void Main(string[] args)
{
PdfCommon.Initialize();
using (var ocr = OcrApi.Create())
{
ocr.Init(Languages.English);
using (var renderer = OcrPdfRenderer.Create(@"d:\0\output_pdf_file", "tessdata\\"))
{
renderer.BeginDocument("document title");
using (var doc = PdfDocument.Load(@"d:\0\test_big.pdf"))
{
foreach (var page in doc.Pages)
{
int width = (int)(page.Width / 72.0 * 96);
int height = (int)(page.Height / 72.0 * 96);
using (var bitmap = new PdfBitmap(width, height, true))
{
bitmap.FillRect(0, 0, width, height, FS_COLOR.White);
page.Render(bitmap, 0, 0, width, height, PageRotate.Normal, RenderFlags.FPDF_LCD_TEXT);
using(var pix = OcrPix.FromBitmap(bitmap.Image as Bitmap))
{
ocr.ProcessPage(pix, renderer);
}
}
}
}
renderer.EndDocument();
}
}
}
}
}
Final notes
The call to the
EndDocument() method is required to finalize the PDF document. Also note that Tesseract OCR cannot reliably recognize symbols smaller than 20 pixels, so make sure the DPI of the scanned pages is enough to provide at least that line height.
Edited by user Friday, November 3, 2023 4:26:31 AM(UTC)
| Reason: Not specified