How to Convert Image to PDF in .NET ApplicationDeveloping your own way to deal with scanned documents eventually faces the need to convert those scanned images to PDF. And while the task doesn’t seem overcomplicated, finding a turnkey solution turns out to be a surprisingly complex affair. Is there a way to convert images to searchable PDF with few lines of code in your .Net app? Let’s find out.
Introducing TesseractThere is a great open-source .Net library out there called Tesseract. Providing industry-leading OCR capabilities, Tesseract is basically all you need to add some PDF exporting capabilities to an application. Basic Tesseract features include multi-language scanning with more than 60 languages supported as standard including right-to-left text, hieroglyph languages like Chinese and even multiple languages on the same page. The engine can analyze and recognize page layout, determine images, table and multi-column texts. Tesseract recognizes fonts, styles and paragraph parameters allowing a developer to work with scanned image on high and low levels. Moreover, the accuracy of recognition is beyond any criticism.
Simply put, Tesseract has got everything for a decent image to PDF conversion built into the .NET application. So, what the problem then?
There’s a simpler way to deal with scanned imagesWhile the Tesseract .Net library is definitely not rocket science in any sense, it is still nothing more but a tool, a skeleton, a barebone of OCR capabilities of your application. But you also need some meet, don’t you?
That’s where Tesseract.Net SDK gets in. Powered by the most technologically advanced OCR engine, it is also surprisingly easy to embed to your application. This way you can quickly implement image to PDF conversion functionality in your .Net application with (literally!) 5 lines of code. Aside from smoothing the complexity of the Tesseract library, the .Net SDK also adds some gimmicks to take benefit of Tesseract’s intrinsic capabilities in the best and the easiest way.
But let’s get back to the topic and see how you can bolt on some PDF converting to your .NET app.
How to convert scanned images to PDF in a .Net appFirst of all, you should install Tesseract.Net SDK package. It takes no more than 5 minutes and the process is thoroughly described
here.
After you installed the package, added all required files to the project and configured Tesseract according to your preferences, it’s time to cast some magic and put these five lines of code wherever appropriate:
Code:
public void Tiff2Pdf()
{
using (var api = OcrApi.Create())
{
api.Init(null, "eng");
var renderer = OcrPdfRenderer.Create("output.pdf", @"c:\YourApp\tessdata\");
api.ProcessPages(@"c:\multipage.tif", null, 0, renderer);
}
}
As you can see, working with Tesseract.Net SDK is simple. In this example we convert a multi-page TIFF to a PDF document. An instance of the OcrApi class gives you access to all OCR functions of the library on the high-level. The nuts and bolts of the process are hidden inside the SDK, so everything you need to do is to feed some picture of one of the supported formats (JPEG, TIFF, PNG are welcome) to the OcrPdfRenderer class. It is this class here that does all the work on converting the supplied image to the PDF file.
Now, let’s review a bit more complex piece of code.
Code:
public void MultiplyImages2Pdf(List<string> filenames)
{
var tempfile = Path.GetTempFileName();
File.WriteAllLines(tempfile, filenames.ToArray());
using (var api = OcrApi.Create())
{
api.Init(null, "eng");
var renderer = OcrPdfRenderer.Create("output.pdf", @"c:\YourApp\tessdata\");
api.ProcessPages(tempfile, null, 0, renderer);
}
}
In this example, we are translating to PDF a bunch of individual images. Often, scans are not sewed together into a single TIFF, but are scattered as multiple PNG or JPEG images. Hell, the formats may even vary across the entire stack of images! In this example, we pass a text file to the ProcessPage function to tell it we need to render each image file mentioned there and put all of them into one PDF file. The plain text file therefore should contain filenames of images we want to convert:
C:\scanned_document\page001.png
C:\scanned_document\page002.png
C:\scanned_document\page003.png
C:\scanned_document\page004.png
...
Final notesWith Tesseract.Net SDK creating PDF files from scanned images in your .NET application is now a matter of minutes, not hours. The library supports source images of variety of formats including TIFF, JPEG and PNG. And if you need to adjust PDF exporting, there some advanced options as well.
Edited by user 7 years ago
| Reason: Not specified