PDF documents. Orphaned objects and references.
Google published the PDFium source code, a PDF engine that renders PDF documents in the Chrome browser, under the BSD free license. The engine uses a technology to work with PDF files initially developed by Foxit Software, a company that offers a number of PDF proprietary software products. The developer states that its software is at least three times faster than any other software for PDF processing including native Adobe programs. Application developers can use PDFium to deliver PDF content to users regardless of the platform or device they use to read it.
However, PDFium by Google has certain functional limitations. For instance, it does not offer any means to edit documents.
So we took the PDFium engine as a basis and developed our PDFium.NET SDK that we keep extending with new functionality. The product delivers new PDF editing capabilities. Specifically: working with FDF; editing bookmarks; access to page objects (creating, editing and deleting); access to dictionaries and inner objects of PDF including indirect objects, that is objects referenced from other objects; access to the Cross Reference Table; incremental saving of PDF files and so on.
Currently, we continue enhancing product editing and content creation capabilities.
In our work we often face certain issues. Some of them we would like to shed light on. Please note that we do not make pretence on possessing omniscience, and this article is a subject for discussion.
In a PDF file not all of objects are referenced from the same one page (an object of the PAGE type). Some objects may be referenced from other pages of the document. This means that while the PDF document is edited, the object in the file should not be changed, because this can lead to distorting the document in multiple places at once. Often, this is not what you need. For instance, if you change the font on one page, it can also change on another page if that page refers to the same object. Additionally, the layout of the pages where no changes were made may break. Therefore, it is advisable to create a copy of the object and modify it, then replace the reference to the original object with the reference to the copy. At the same time, you should not delete the old object, because there still may be references to it from other pages of the document.
This approach leads to a possibility that at some moment the old object may lose its last reference. Typically, such a collision remains unnoticed by PDF viewing and editing programs, and such orphaned objects are simply ignored. As a result, in some cases the file size may uncontrollably grow.
The second problem is the opposite: some PDF documents may contain references to non-existing objects. Such a collision may lead to errors while parsing and rendering the file, and usually PDF rendering and editing programs prevent this, but still such files are found.
Below, we take a closer look at these two cases and offer a way to fix them.
Some theoryA PDF file consists of 4 parts (Fig.1).
Fig.1 – Example of a PDF file
Note:Some PDF files may have compressed elements. To make such elements look readable, like on the above figure, you can run the “qpdf” utility with the following parameter: “--stream-data=uncompress”.
The file starts with a one-line header that determines the version of the PDF specification this file meets.
Then, there is a body of the document containing objects that form the document (in this example these are lines form 2 to 26).
Then, there is a cross-reference table that contains information about indirect objects in the file (lines 27-36).
The document ends with a trailer that specifies the location of the cross-reference table and some other special objects within the body of the file.
The body of the PDF file consists of a set of indirect objects forming contents of the document.
A PDF document can be represented as a hierarchy of objects contained in the body section of the PDF file. In the root of hierarchy is the catalog object. The reference to this object is in the trailer of the PDF document, the Root parameter.
Any object in a PDF file can be seen as an indirect object. An indirect object must have a unique identifier other objects can refer to. The identifier of the object consists of two parts:
- Positive integer object number.
- Non-negative integer generation number.
Together, object number and generation number, unambiguously identify a given indirect object. The object retains the same object number and generation number during its entire lifespan.
Other objects can link to an object from other place in the file using an indirect reference consisting of an object number, a generation number and the R keyword. For example: 2 0 R (Fig.2).
Such references can be included to dictionaries:
<</Outlines 2 0 R /Pages 3 0 R /Type/Catalog>> and arrays:
/Kids[ 4 0 R ].
Fig. 2 – References to objects
In this case, the object number 2 0 (lines 5-7) is referenced from the object 1 0 (line 3), and the Kids array element (line 9) is a reference to the object 4 0 (lines 11-13).
Note that an object can refer to another object that in turn refers to one more object. The nesting level of such references can be rather deep.
Therefore, to solve the issues we formulated above, we need to recursively search through objects of the document and collect all existing references to objects. Then, based on this list we can make a decision if it is necessary to delete some objects or references.
All objects are recorded in the cross-reference table. This table contains information to arbitrary access indirect objects in the file, so there is no need to read the entire file to find a specific object.
This table contains a one-line record for each object that specifies position of this object in the body of the file. In fact, this is just an offset in bytes from the beginning of the file.
Maintaining integrity of data in this table is crucial. This means when objects are deleted from the body of the document, we need to calculate new offsets for the rest of objects and correspondingly modify the cross-reference table.
SolutionTo solve the issues described above, we offer the
PdfRefObjectsCollection class in our product. This class is a collection of all indirect objects used in the document, that is objects that have at least one reference to them across other objects of the PDF file.
An indirect object is described by the
REFOBJ class and has the following properties:
int ObjectNumber – the number of the indirect object.
PdfTypeBase ReferTo – the indirect object itself. If the object does not exist, the value of this property is NULL.
PdfTypeBase[] ReferredBy – an array of objects that refer to this indirect object.
Situation when the
ObjectNumber has a non-zero value while
ReferTo is
NULL means the file has a reference to a non-existing object. We can find that object in the body of the document by its number and delete it.
Note:The object number of each ReferredBy object may be equal to 0 if the reference is from an element of a direct object (in the PDFium object model).
Here is, for example, a dictionary that itself is an element of another dictionary:
<</Contents 5 0 R /Parent 3 0 R /Resources<</Font
<</F1 7 0 R >>/ProcSet 9 0 R >>/Type/Page>>
Now, let’s see the solution in all details.
1. Removing unnecessary objects (without references to them in the file).One of possible ways to solve the orphaned objects problem is the following procedure:
- Load the document
Code:
var doc = PdfDocument.Load(@"d:\test.pdf");
- Get a cross-reference table and a list of all numbered objects in the document
Code:
var cross = PdfCrossReferenceTable.FromPdfDocument(doc);
var list = PdfIndirectList.FromPdfDocument(doc);
- Receive a list of objects that have at least one reference in the document by calling the static method of the PdfRefObjectsCollection class.
Code:
var refObjects = PdfRefObjectsCollection.FromPdfDocument(doc);
This call starts the recursive algorithm explained above, starting from the trailer of the document.
- Then we enumerate objects in the list and check if there is an object with a given number in the list of objects referenced in the document. If there is no such an object, we handle the issue (delete the object).
After deleting the object, we delete it from the cross-reference table too.
Code:
foreach(var obj in cross)
{
REFOBJ refObj= refObjects.GetBy(obj.ObjectNumber);
if (refObj== null)
{
list.Remove(obj.ObjectNumber);
cross.Remove(obj.ObjectNumber);
}
}
- Save the document
Code:
doc.Save(@"d:\test_modified.pdf", SaveFlags.NoIncremental);
We save to another file here, because the original file is blocked for writing. This is because PdfDocument.Load loads the file when needed, so it cannot be modified unless the original document is closed.
Note:Please note that we enumerate objects using the
cross-reference table, not the list object collection obtained from
PdfIndirectList. The reason is that the list of objects obtained from PdfIndirectList can be incomplete.
This is worth some elaboration:
Probably you noticed how even a very large PDF document with hundreds thousands of pages loads very fast. This becomes possible because there is not need to parse the entire document body. At first, only the cross-reference table is read and all the rest of data is loaded when needed. As a result, if a page was not loaded to memory, its object won’t come to the list. However, an attempt to request an object with a certain number in that list results in that object is parsed and put to the list. Therefore, with some juggling you still can use PdfIndirectList, but that will make the algorithm a bit more complex, and for the purposes stated in this article PdfCrossReferenceTable is enough. Nevertheless, we often come across files with a damaged cross-reference table. In this case you probably need to use the PdfIndirectList collection. By the way, a broken cross reference table can be fixed by making a call to
PdfCrossReferenceTable.Rebuild().
Hence, the entire program will look like this:
Code:
using Patagames.Pdf;
using Patagames.Pdf.Enums;
using Patagames.Pdf.Net;
using Patagames.Pdf.Net.BasicTypes;
namespace RemoveLinks
{
class Program
{
static void Main(string[] args)
{
//Initialize engine
PdfCommon.Initialize();
//Load the document
using (var doc = PdfDocument.Load(@"d:\test.pdf"))
{
//Get a cross-reference table and a list of indirect objects in the document
var cross = PdfCrossReferenceTable.FromPdfDocument(doc);
var list = PdfIndirectList.FromPdfDocument(doc);
//Receive a list of objects that have at least one reference in the document
var refObjects = PdfRefObjectsCollection.FromPdfDocument(doc);
//Enumerate objects in the list and check if there is an object with a given number in the list of objects referenced in the document
foreach(var obj in cross)
{
//Get indirect object
REFOBJ refObj= refObjects.GetBy(obj.ObjectNumber);
if (refObj == null)
{
//If there is no such an object then delete it
list.Remove(obj.ObjectNumber);
cross.Remove(obj.ObjectNumber);
}
}
//Save the document
doc.Save(@"d:\test_modified.pdf", SaveFlags.NoIncremental);
}
}
}
}
Note:When an object with no references to it is found in the document, you need to make a decision, what to do with such objects. You can simply remove the object like we’ve shown above, but you should do this only if you are sure that all such objects must be deleted. If you are not completely sure, you can use a modified version of the algorithm using the same objects and methods.
For example, one of possible ways to solve the orphaned objects problem that appeared during editing and generating of the document is as follows:
At first, we collect object numbers in the PDF file that had at least one reference before the editing of the document started.
On the next step, we modify the PDF document.
Then, we collect the list of object numbers that had at least one reference to them in the modified document.
Then we compare these lists and in the modified document we remove objects which numbers are listed in the first list (prior to editing), but are lacking in the second list (after editing). Remember that objects must be removed both from the object model (in a collection of all objects of the modified document) and from the cross-reference table.
2. Removing references to non-existing objectsSolving the second problem – references to non-existing objects – can be solved with the following algorithm:
- Similarly to the previous example, we load the document, obtain the list of objects in the document and the list of objects that has at least one reference in the document.
- Enumerate all objects in the reference list and select those that are lacking in the file.
Code:
foreach(var refObj in refObjects)
{
if(refObj.ReferTo == null)
{
. . .
}
}
- For each indirect object, we enumerate all objects that have a reference to it. In each such object we delete that reference.
Code:
int LostObjNum = refObj.ObjectNumber;
foreach(var o in refObj.ReferredBy)
RemoveRefs(list, cross, o, LostObjNum);
- As in the previous example, the program ends with saving the document.
Code:
doc.Save(@"d:\test_modified.pdf", SaveFlags.NoIncremental);
The RemoveRefs function that deletes references takes a dictionary or an array, searches through the elements and if a specified reference is found in the element (parameter) – removes it. If an object contains a reference to a non-existing object directly (that is, not in a nested array or a dictionary), it is removed.
Code:
private static void RemoveRefs(PdfIndirectList list, PdfCrossReferenceTable cross, PdfTypeBase referredBy, int referTo)
{
if (referredBy is PdfTypeArray)
{
var arr = referredBy as PdfTypeArray;
for(int i= arr.Count-1; i>=0; i--)
{
if ((arr[i] is PdfTypeIndirect) && ((arr[i] as PdfTypeIndirect).Number == referTo))
arr.RemoveAt(i);
}
}
else if (referredBy is PdfTypeDictionary)
{
var dict = referredBy as PdfTypeDictionary;
foreach(var key in dict.Keys)
{
if ((dict[key] is PdfTypeIndirect) && ((dict[key] as PdfTypeIndirect).Number == referTo))
dict.Remove(key);
}
}
else if (referredBy is PdfTypeIndirect)
{
int n = (referredBy as PdfTypeIndirect).Number;
list.Remove(n);
cross.Remove(n);
}
else
return;
}
Therefore, the entire program will look as follows:
Code:
using Patagames.Pdf;
using Patagames.Pdf.Enums;
using Patagames.Pdf.Net;
using Patagames.Pdf.Net.BasicTypes;
namespace RemoveLinks
{
class Program
{
static void Main(string[] args)
{
//Initialize engine
PdfCommon.Initialize();
//Load the document
using (var doc = PdfDocument.Load(@"d:\test.pdf"))
{
//Get a cross-reference table and a list of indirect objects in the document
var list = PdfIndirectList.FromPdfDocument(doc);
var cross = PdfCrossReferenceTable.FromPdfDocument(doc);
//Receive a list of objects that have at least one reference in the document
var refObjects = PdfRefObjectsCollection.FromPdfDocument(doc);
foreach(var refObj in refObjects)
{
if(refObj.ReferTo == null)
{
//Found a references to the orphaned objects
int LostObjNum = refObj.ObjectNumber;
//All objects contained in refObj.ReferredBy array has a referebce to the orphaned object
foreach(var o in refObj.ReferredBy)
RemoveRefs(list, cross, o, LostObjNum);
}
}
//Save the document
doc.Save(@"d:\0\Annotations2_modified.pdf", SaveFlags.NoIncremental);
}
}
private static void RemoveRefs(PdfIndirectList list, PdfCrossReferenceTable cross, PdfTypeBase referredBy, int referTo)
{
if (referredBy is PdfTypeArray)
{
var arr = referredBy as PdfTypeArray;
for(int i= arr.Count-1; i>=0; i--)
{
if ((arr[i] is PdfTypeIndirect) && ((arr[i] as PdfTypeIndirect).Number == referTo))
arr.RemoveAt(i);
}
}
else if (referredBy is PdfTypeDictionary)
{
var dict = referredBy as PdfTypeDictionary;
foreach(var key in dict.Keys)
{
if ((dict[key] is PdfTypeIndirect) && ((dict[key] as PdfTypeIndirect).Number == referTo))
dict.Remove(key);
}
}
else if (referredBy is PdfTypeIndirect)
{
int n = (referredBy as PdfTypeIndirect).Number;
list.Remove(n);
cross.Remove(n);
}
else
return;
}
}
}
Note:In this case we need to keep in mind that if an object was changed in the object model, you may end up with some “unnecessary” objects, that is objects with no references to them in the file. Which brings us back to the collision number one.
ConclusionAs a result of our work, you now have a functional product that helps you create useful applications and enhance the structure of your files by removing unnecessary objects and orphaned references.
In conclusion, these are merely two of the problems you can face while working with PDF documents. There are others. Different products offer different ways to solve them. For the problem described above we offer our own solution, but we do not mean it’s the only solution possible, nor it’s the silver bullet. If you have thoughts on this topic, or other interesting cases, please share them with us.
Edited by moderator Sunday, October 13, 2019 6:18:24 AM(UTC)
| Reason: Not specified