logo
Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

Options
Go to last post Go to first unread
rhnatiuk  
#1 Posted : Wednesday, July 17, 2019 3:58:55 AM(UTC)
rhnatiuk

Rank: Advanced Member

Groups: Registered
Joined: 4/30/2019(UTC)
Posts: 35
Man
Finland
Location: Raisio

Thanks: 9 times
We have bumped into a strange PDF file. Unfortunately, the file is confidential (from our customer), so I cannot post it here.

The PageObjects collection has only one child, of type PdfFormObject, which in turn contains about 15000 children.

QUESTION 1: Does it mean, that to find all PdfTextRuns we have to traverse all PageObjects collections of a page and its children, and their children, and so on recursively?

The other problem is the those text runs contain exactly one character each. So, the PDF text

Code:
CONT. ON
DRG 2


will be "reported" to us as following PdfTextRuns (@location width x height: text):

Code:
@(561.2549, 829.8203) 6.301392 x 7.397034 : C
@(568.4346, 829.8203) 6.829834 x 7.397034 : O
@(576.6263, 829.7001) 5.613464 x 7.166809 : N
@(583.2775, 829.7001) 5.65332 x 7.166809 : T
@(590.0793, 823.5343) 0.9970703 x 1.000977 : .
@(591.961, 822.5333) 0 x 0 :
@(595.2524, 829.8203) 6.829834 x 7.397034 : O
@(603.3329, 829.7001) 5.613464 x 7.166809 : N
@(561.5441, 817.8964) 5.902527 x 7.166809 : D
@(568.7637, 817.8964) 6.291382 x 7.166809 : R
@(575.7059, 818.0165) 6.600464 x 7.397034 : G
@(582.949, 810.7296) 0 x 0 :
@(586.0271, 817.9264) 4.726074 x 7.196838 : 2


whereas we would expect those to be either two text runs ("CONT. ON", "DRG 2") or four ("CONT.", "ON", "DRG", "2").

Adobe and Foxit both allow searching and Ctrl+A selects all text runs.

QUESTION 2: Are we doing something wrong, or are Adobe and Foxit doing some kind of post-processing to join those text runs into "real" text runs? Is there anything we can do to rectify such problem?

Edited by user Wednesday, July 17, 2019 3:59:45 AM(UTC)  | Reason: Not specified

Paul Rayman  
#2 Posted : Wednesday, July 17, 2019 4:27:55 AM(UTC)
Paul Rayman

Rank: Administration

Groups: Administrators
Joined: 1/5/2016(UTC)
Posts: 844

Thanks: 2 times
Was thanked: 103 time(s) in 101 post(s)
Originally Posted by: rhnatiuk Go to Quoted Post

QUESTION 1: Does it mean, that to find all PdfTextRuns we have to traverse all PageObjects collections of a page and its children, and their children, and so on recursively?


Yes it does. PdfFormObject may contain any objects as well as other PdfFormObject

Originally Posted by: rhnatiuk Go to Quoted Post

QUESTION 2: Are we doing something wrong, or are Adobe and Foxit doing some kind of post-processing to join those text runs into "real" text runs? Is there anything we can do to rectify such problem?


The algorithm for converting text objects into a flow text is rather complicated. As you can see, text objects can be any. They can contain both words and separate characters. They can follow in any order. And neighboring objects can have coordinates pointing to completely different places on the page. Even more, adjacent text objects can contain letters of different words. Words may be interrupted by other page objects, like paths, shadings, images, and so on and so forth.
This is normal in terms of the PDF specification. There is nothing unusual here. Unfortunately, this is a normal document.

Fortunately, such algorithm is implemented in the engine. You can get the flow text (with a description of the location on the page) using the PdfPage.Text property
thanks 1 user thanked Paul Rayman for this useful post.
rhnatiuk on 7/17/2019(UTC)
rhnatiuk  
#3 Posted : Wednesday, July 17, 2019 5:16:11 AM(UTC)
rhnatiuk

Rank: Advanced Member

Groups: Registered
Joined: 4/30/2019(UTC)
Posts: 35
Man
Finland
Location: Raisio

Thanks: 9 times
Originally Posted by: Paul Rayman Go to Quoted Post
Originally Posted by: rhnatiuk Go to Quoted Post

QUESTION 2: Are we doing something wrong, or are Adobe and Foxit doing some kind of post-processing to join those text runs into "real" text runs? Is there anything we can do to rectify such problem?


The algorithm for converting text objects into a flow text is rather complicated. As you can see, text objects can be any. They can contain both words and separate characters. They can follow in any order. And neighboring objects can have coordinates pointing to completely different places on the page. Even more, adjacent text objects can contain letters of different words. Words may be interrupted by other page objects, like paths, shadings, images, and so on and so forth.
This is normal in terms of the PDF specification. There is nothing unusual here. Unfortunately, this is a normal document.

Fortunately, such algorithm is implemented in the engine. You can get the flow text (with a description of the location on the page) using the PdfPage.Text property


The issue with PdfPage.Text is that is it stripped off of all coordinates and sizes. We are trying to find text runs in the page that point to some "real-world" objects (normally, our documents are engineering drawings), and then create links around those texts. This is why we need to keep coordinates. We have own algorithms for identifying such text labels, but with so high number of one-character text runs it becomes quite taxing.

Anyway, thank you for your quick response!
Paul Rayman  
#4 Posted : Wednesday, July 17, 2019 5:53:55 AM(UTC)
Paul Rayman

Rank: Administration

Groups: Administrators
Joined: 1/5/2016(UTC)
Posts: 844

Thanks: 2 times
Was thanked: 103 time(s) in 101 post(s)
Quote:
The issue with PdfPage.Text is that is it stripped off of all coordinates and sizes.


But this is not so! Yes, there is no refs with page objects, but there are the bounding boxes which represents the place and the size of the flow text. In fact, this is how the selection and copying of text with the mouse in the PdfViewer works. Perhaps your task can still be solved with PdfText.

You can
1. Get all text on a page as a string;
2. Find index of first char and length of substring in this string;
3. Get rectangles describing found text (several rectangles are mean that the found substring may be located on a several rows);

Code:
PdfTextInfo textInfo = tmpDoc.Pages[0].Text.GetTextInfo(idx, len);
string text = textInfo.Text;
IEnumerable<FS_RECTF> rects = textInfo.Rects;

Edited by user Wednesday, July 17, 2019 6:35:21 AM(UTC)  | Reason: Not specified

rhnatiuk  
#5 Posted : Wednesday, July 17, 2019 6:47:13 AM(UTC)
rhnatiuk

Rank: Advanced Member

Groups: Registered
Joined: 4/30/2019(UTC)
Posts: 35
Man
Finland
Location: Raisio

Thanks: 9 times
Originally Posted by: Paul Rayman Go to Quoted Post
Quote:
The issue with PdfPage.Text is that is it stripped off of all coordinates and sizes.


But this is not so! Yes, there is no refs with page objects, but there are the bounding boxes which represents the place and the size of the flow text. In fact, this is how the selection and copying of text with the mouse in the PdfViewer works. Perhaps your task can still be solved with PdfText.

You can
1. Get all text on a page as a string;
2. Find index of first char and length of substring in this string;
3. Get rectangles describing found text (several rectangles are mean that the found subscring may be located on a several rows);

Code:
PdfTextInfo textInfo = tmpDoc.Pages[0].Text.GetTextInfo(idx, len);
string text = textInfo.Text;
IEnumerable<FS_RECTF> rects = textInfo.Rects;


Interesting!

I tried it quickly:

1. The text "joining" algorithm seems to be quite aggressive. In the screenshot below I circled text lines the textInfo.Text has identified, and drew dashed-line where new texts are really starting. As you can see, it joined together "POS", "COMPONENT DESCRIPTION" , and "SIZE" (!?), joined together "1" and "Pipe ASME B36.10M BE SMLS ASTM A 106 GR. B", and then joined together "3 50800016 9.6M 107.74" while those are clearly not together in the picture.

Annotation 2019-07-17 133153.png (96kb) downloaded 0 time(s).

2. I've got 118 text lines from textInfo.Text, but 1071 rectangles. Not sure how to make sense out of those. But, if I read your explanation correctly ("several rectangles are mean that the found subscring may be located on a several rows"), this approach works better for the case "What text is in this place?", whereas we need "Where is this text located?", i.e. the opposite direction.
Paul Rayman  
#6 Posted : Wednesday, July 17, 2019 7:32:30 AM(UTC)
Paul Rayman

Rank: Administration

Groups: Administrators
Joined: 1/5/2016(UTC)
Posts: 844

Thanks: 2 times
Was thanked: 103 time(s) in 101 post(s)
Please look at this video


SmoothSelection.None - illustrates how GetTextInfo is actually works. As you can see it returns rectangles which bounds a text on a page.
You can implement any algorithm to smooth the selection.

PdfViewer currently use SmoothSelection.ByLine
You can see our implementation in the PdfViewer source code here
https://github.com/Patag...blob/master/PdfViewer.cs

By the way, PdfViewer has a method that can return already smoothed rectangles, so you can use GetHighlightedRects instead of low-level GetTextInfo.
Code:
HighlightInfo hi = new HighlightInfo();
hi.CharIndex = 10;
hi.CharsCount = 15;
var rect = pdfViewer1.GetHighlightedRects(0, hi);



Quote:
The text "joining" algorithm seems to be quite aggressive.

As I said, the algorithm is very complex and may not work perfectly on "terrible" text objects.
But it works about the same in all major viewers.

Edited by user Wednesday, July 17, 2019 7:41:18 AM(UTC)  | Reason: Not specified

ESchunk  
#7 Posted : Monday, September 23, 2019 5:57:42 PM(UTC)
ESchunk

Rank: Member

Groups: Registered
Joined: 8/13/2019(UTC)
Posts: 12
United States
Location: New York

Hi Paul,

I reviewed your response above and was hoping you can provide guidance on a related issue I'm facing. We are trying to create a tool to extract text from PDF files. The idea is to have the user enter key search words to find in the file and indicate if something is above, below, etc to other previously found search words. The problem I'm facing is when I get the text from the file page.Text.GetText() I do not have the coordinates and from some of the sample files I'm using the last line that user would see on a page is the first line of text returned from the GetText method. I tried using GetTextInfo for each character and tried to sort the characters based on their bounding box. The problem I'm facing is the bounding box of the characters cannot be used easily to sort because you cannot tell how they would line up --- there is padding/white space above and below each character when they are displayed to a user. Is there a way to determine how characters would line up if there were to appear in the same line of text if the only information you have is from GetTextInfo for each character separately?

Thanks,

Ted
Paul Rayman  
#8 Posted : Wednesday, September 25, 2019 1:46:15 AM(UTC)
Paul Rayman

Rank: Administration

Groups: Administrators
Joined: 1/5/2016(UTC)
Posts: 844

Thanks: 2 times
Was thanked: 103 time(s) in 101 post(s)
Hi,

From the post it is difficult to understand exactly what you want to achieve.
I don’t understand why you need to examine single character, instead of working with blocks of text as a whole.

Try to look at PdfViewer.GetSelectedRects(idx, selInfo) method (it is available in source code). This method returns the bounding box of a block of text in a line (several rectangles for several lines), taking into account previous and subsequent characters in a line.
Users browsing this topic
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.