logo
Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

Options
Go to last post Go to first unread
Guest  
#1 Posted : Saturday, January 23, 2016 1:39:32 AM(UTC)
Guest

Rank: Guest

Groups: Guests
Joined: 1/5/2016(UTC)
Posts: 162

Was thanked: 5 time(s) in 5 post(s)
Hello,

while evaluating and using Pdfium.NET SDK, we observed a problem by extracting text from pages.


We have a document that can be opened by Adobe Acrobat Reader and Google Chrome (Pdf Viewer - Pdfium).

If we select the from the first page the drawn rectangles (extracted glyphs) are correct.


When we extract the text from the first page with Patagames.Pdf (using PdfTextObject.GetCharRect()) the detected rectangles for the characters are incorrect.

However, the BoundingBox of the whole text-row (PdfTextObject) is correct.
Paul Rayman  
#2 Posted : Saturday, January 23, 2016 1:45:19 AM(UTC)
Paul Rayman

Rank: Administration

Groups: Administrators
Joined: 1/5/2016(UTC)
Posts: 1,113

Thanks: 8 times
Was thanked: 130 time(s) in 127 post(s)
Hi,

This method returns the raw data without applying the transformation matrices.

You have to do it yourself. Something like this:

Code:
var bb = obj.GetCharRect(i);
var matrix = obj.TextMatrix;
bb.left = bb.left * matrix.a + matrix.e;
bb.right = bb.right* matrix.a + matrix.e;
bb.top = bb.top * matrix.d + matrix.f;
bb.bottom = bb.bottom * matrix.d + matrix.f;


Please take a look at the code below. It's illustrates how to convert the raw char rect into page's coordinates and then into user control's coordinate.
I check it on your file (incorrect_rectangles.pdf) it correctly fills all letters on current page.

Code:

private void button45_Click(object sender, EventArgs e)
{
    var page = pdfViewer1.Document.Pages.CurrentPage;

    using (var g = Graphics.FromHwnd(pdfViewer1.Handle))
    {
        foreach (var o in page.PageObjects)
        {
            var obj = o as PdfTextObject;
            if (obj == null)
                continue;

            for (int i = 0; i < obj.CharsCount; i++)
            {
                var bb = obj.GetCharRect(i);
                var matrix = obj.TextMatrix;
                bb.left = bb.left * matrix.a + matrix.e;
                bb.right = bb.right* matrix.a + matrix.e;
                bb.top = bb.top * matrix.d + matrix.f;
                bb.bottom = bb.bottom * matrix.d + matrix.f;

                var pt1 = pdfViewer1.PageToClient(
                    pdfViewer1.Document.Pages.CurrentIndex,
                    new PointF(bb.left, bb.top));
                var pt2 = pdfViewer1.PageToClient(
                    pdfViewer1.Document.Pages.CurrentIndex,
                    new PointF(bb.right, bb.bottom));

                g.FillRectangle(
                    new SolidBrush(Color.FromArgb(50, 99, 0, 0)),
                    pt1.X, pt1.Y, pt2.X - pt1.X, pt2.Y - pt1.Y);
            }
        }
    }
}

Edited by user Saturday, January 23, 2016 10:06:15 AM(UTC)  | Reason: Not specified

Users browsing this topic
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.