logo
Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

Options
Go to last post Go to first unread
rhnatiuk  
#1 Posted : Wednesday, June 12, 2019 2:27:06 AM(UTC)
rhnatiuk

Rank: Member

Groups: Registered
Joined: 4/30/2019(UTC)
Posts: 11
Finland

Thanks: 5 times
Hi,

We have hit into a major road-block. After modifying a PDF page (adding a few link annotations, a few boxes, and corresponding rectangles), we are calling pdfPage.GenerateContent(), and it takes a lot of time - more than 1 minute on large files, and at least 5-6 seconds on smaller files.

The library we have used before Pdfium.Net had no such problem - the creation of modified PDFs was near-instant.

Any ideas? Can we get improved GenerateContent, or can we somehow avoid using it? It seems that GenerateContent is somehow re-creating the whole PDF (but more than 1 minute?) instead of an incremental update with adding new objects only.

Thank you!
Paul Rayman  
#2 Posted : Friday, June 14, 2019 5:45:07 AM(UTC)
Paul Rayman

Rank: Administration

Groups: Administrators
Joined: 1/5/2016(UTC)
Posts: 743

Thanks: 1 times
Was thanked: 90 time(s) in 89 post(s)
Are you using latest version of SDK? Page's content generation was improved in the version 3.38.2704


1. GenerateContent() recreates the contents of the whole page. This is a feature of the PDF structure.
2. GenerateContent() works only on the page content, but not on the entire document.
3. Perhaps your page contains a lot of objects.
4. Perhaps you have subscribed to the processing of notifications about the generation of each object and the slowdown occurs somewhere in this handler in your code.
5. If you add annotations only, then you should not call the GenerateContent(), because annotations are not part of the page content.

Could you please provide your document for analysis?
rhnatiuk  
#3 Posted : Friday, June 14, 2019 6:07:46 AM(UTC)
rhnatiuk

Rank: Member

Groups: Registered
Joined: 4/30/2019(UTC)
Posts: 11
Finland

Thanks: 5 times
Originally Posted by: Paul Rayman Go to Quoted Post
Are you using latest version of SDK? Page's content generation was improved in the version 3.38.2704


Yes, we are using the latest version. 4.7.2704 (from the web site, or NuGet), the dll says that its file version is 4.2.3.471 though.

Originally Posted by: Paul Rayman Go to Quoted Post
1. GenerateContent() recreates the contents of the whole page. This is a feature of the PDF structure.
2. GenerateContent() works only on the page content, but not on the entire document.


Yes, that is what we have figured out...

Originally Posted by: Paul Rayman Go to Quoted Post
3. Perhaps your page contains a lot of objects.


With 1600 objects (a rather small drawing) it already takes about one second. The document itself 722-SWC-10.1.pdf (43kb) downloaded 3 time(s).. The same happens with all document, on all computers. The longest I had to wait for GenerateContent was in the direction of one minute (ok, those files are big - 5 meters x 1 meter, a lot of graphics=drawing), on rather modern i7.

Originally Posted by: Paul Rayman Go to Quoted Post
4. Perhaps you have subscribed to the processing of notifications about the generation of each object and the slowdown occurs somewhere in this handler in your code.


No - all intact.

Originally Posted by: Paul Rayman Go to Quoted Post
5. If you add annotations only, then you should not call the GenerateContent(), because annotations are not part of the page content.


Unfortunately, we have to add boxes for links (background and border). And those are content then.

Originally Posted by: Paul Rayman Go to Quoted Post
Could you please provide your document for analysis?


Sure, it is attached. But, as I said, reproduces with all other documents.

Our old library did not have kind of problem (had many others though). Also, I do not necessarily understand, why the whole page must be recreated, especially that we are only adding to it, and not removing. I have seen that it is possible to pass a list of objects to GenerateContent on the pdfium side, so I thought about some kind of incremental generation. But, needed APIs are missing then.
rhnatiuk  
#4 Posted : Thursday, June 20, 2019 3:50:56 AM(UTC)
rhnatiuk

Rank: Member

Groups: Registered
Joined: 4/30/2019(UTC)
Posts: 11
Finland

Thanks: 5 times
Hi Paul Rayman ,

Any news on the topic?

We would like to use Pdfium.Net only in our product, instead of using Pdfium.Net to read the document and get all the necessary information from it, and then the old library to add links and rectangles and to save the document. But so far, it seems that it is the only viable solution for us...

Thank you in advance!

Edited by user Thursday, June 20, 2019 3:51:37 AM(UTC)  | Reason: Not specified

Paul Rayman  
#5 Posted : Thursday, June 20, 2019 7:45:04 AM(UTC)
Paul Rayman

Rank: Administration

Groups: Administrators
Joined: 1/5/2016(UTC)
Posts: 743

Thanks: 1 times
Was thanked: 90 time(s) in 89 post(s)
I looked at the document that you sent - it has a rather complicated structure and a large number of objects on a page. Therefore, the full regeneration of these objects takes some time. It is unlikely that something can be optimized.

However, in your situation there is a solution!
In fact, you do not need to regenerate all the objects, since you do not change them.

I can suggest several solutions:
1. Use the rectangle and/or line annotations to add a border around the link instead of adding objects to the page content; or
2. Use the annotation's NormalAppearance collection to draw a border instead of drawing in the page content. An example can be found here; or
3. Convert the page's content into an array of contents and add new page content to this array. In this case, you do not need to regenerate all objects. You will generate only the objects you have added, it works very fast. A detailed example is here:
Code:
static void Main(string[] args)
{
    using (var doc = PdfDocument.Load(@"e:\0\722-SWC-10.1.pdf"))
    {
        var page = doc.Pages[0];
        var contentsArray = ConvertContentsToArray(page);
        using (var form = PdfFormObject.Create(page))
        {
            //Add some objects to the form object
            form.PageObjects.Add(PdfTextObject.Create("Sample Text", 150, 150, PdfFont.CreateStock(doc, FontStockNames.Arial), 45.0f));
            //Create empty stream and add it to the list of indirect objects
            var stream = PdfTypeStream.Create();
            var list = PdfIndirectList.FromPdfDocument(doc);
            list.Add(stream);
            //Generate content of form object to that stream
            bool b = Pdfium.FPDF_GenerateContentToStream(doc.Handle, form.PageObjects.Handle, stream.Handle, IntPtr.Zero);
            //Add stream to the contentsArray
            contentsArray.AddIndirect(list, stream);
        }
        doc.Save(@"e:\0\722-SWC-10.1_modified.pdf", SaveFlags.NoIncremental);
    }
}


and ConvertContentsToArray

Code:
public static PdfTypeArray ConvertContentsToArray(PdfPage page)
{
    var pageDict = page.Dictionary;
    var list = PdfIndirectList.FromPdfDocument(page.Document);

    if (!pageDict.ContainsKey("Contents"))
    {
        var array = PdfTypeArray.Create();
        //Add array into list of indirect objects
        list.Add(array);
        //And set it as a contents of the page
        pageDict.SetIndirectAt("Contents", list, array);
        return array;
    }

    var contents = pageDict["Contents"];

    //check the original content whether it's an array
    if (contents is PdfTypeArray)
        return contents as PdfTypeArray;  //if contents is a array just return it
    else if (contents is PdfTypeIndirect)
    {
        if ((contents as PdfTypeIndirect).Direct is PdfTypeArray)
            return (contents as PdfTypeIndirect).Direct as PdfTypeArray; //if contents is a reference to array then return that array
        else if ((contents as PdfTypeIndirect).Direct is PdfTypeStream)
        {
            //if contents is a reference to a stream then create a new array and insert stream as a first element of array
            var array = PdfTypeArray.Create();
            array.AddIndirect(list, (contents as PdfTypeIndirect).Direct);
            //Add array into list of indirect objects
            list.Add(array);
            //And set it as a contents of the page
            pageDict.SetIndirectAt("Contents", list, array);
            return array;
        }
        else
            throw new Exception("Unexpected content type");
    }
    else if(contents is PdfTypeStream)
    {
        //if contents is a stream instead of reference to a stream then try to convert it to a reference then create a new array and insert stream as a first element of array
        var array = PdfTypeArray.Create();
        array.AddIndirect(list, contents);
        //Add array into list of indirect objects
        list.Add(array);
        //And set it as a contents of the page
        pageDict.SetIndirectAt("Contents", list, array);
        return array;
    }
    else 
        throw new Exception("Unexpected content type");
}

Edited by user Thursday, June 20, 2019 7:52:08 AM(UTC)  | Reason: Not specified

thanks 1 user thanked Paul Rayman for this useful post.
rhnatiuk on 6/20/2019(UTC)
rhnatiuk  
#6 Posted : Thursday, June 20, 2019 8:16:34 AM(UTC)
rhnatiuk

Rank: Member

Groups: Registered
Joined: 4/30/2019(UTC)
Posts: 11
Finland

Thanks: 5 times
Originally Posted by: Paul Rayman Go to Quoted Post
I looked at the document that you sent - it has a rather complicated structure and a large number of objects on a page. Therefore, the full regeneration of these objects takes some time. It is unlikely that something can be optimized.

However, in your situation there is a solution!
In fact, you do not need to regenerate all the objects, since you do not change them.

I can suggest several solutions:
1. Use the rectangle and/or line annotations to add a border around the link instead of adding objects to the page content; or
2. Use the annotation's NormalAppearance collection to draw a border instead of drawing in the page content. An example can be found here; or
3. Convert the page's content into an array of contents and add new page content to this array. In this case, you do not need to regenerate all objects. You will generate only the objects you have added, it works very fast. A detailed example is here:


Wow! You ARE magician! :) Your third proposal with ConvertContentsToArray works very well (at least with a quick try)! Thank you very much!!!
Users browsing this topic
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.