PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)

APACHE-2.0 License

Stars
1.5K
Committers
32

Bot releases are hidden (Show)

PdfPig - Tamworth Latest Release

Published by EliotJones over 1 year ago

This is a release with various bug-fixes and quality of life improvements but no new major features. It adds many of the supporting classes necessary for PDF rendering.

Breaking Changes

  • IColor can now be of type PatternColor. This implementation will throw an error when calling ToRGBValues(). You might have to check for IColor.ColorSpace != ColorSpace.Pattern before calling this function
  • Remove Details suffix from ColorSpaceDetails property names
  • AlternateColorSpaceDetails renamed to AlternateColorSpace
  • BaseColorSpaceDetails renamed to BaseColorSpace
  • Seal IColor implementations
  • Use double instead of decimal in color spaces and colors
  • Move IColorSpaceContext from IOperationContext to CurrentGraphicsState
  • Removed ColorSpace property from IPdfImage. Use ColorSpaceDetails.Type to get the enum value
  • IColorSpaceContext's CurrentStrokingColorSpace and CurrentNonStrokingColorSpace are now of type ColorSpaceDetails (not a ColorSpace enum anymore). Use CurrentStrokingColorSpace.Type or CurrentNonStrokingColorSpace.Type to get the enum value
  • Logic change to DefaultWordExtractor, a logic bug in the existing implementation was fixed, meaning the output of the default page.GetWords() may change in this version

NET 4.5

Note that this version removes support for .NET 4.5. Consumers should upgrade to .NET 4.5.1 or 4.5.2

Release notes

  • Fix support for using the ZapfDingbats Standard 14 font when creating files
  • Address issue with extracting CJK text from PDFs
  • Fix issue with writing ShowText operations to output files when the text contained parentheses
  • Error handling for Type 2 charstring parsing
  • New letter properties, TextRenderingMode, StrokeColor and FillColor
  • Fix for copying inline images to output files
  • Enums for PDF/A-3 compliance
  • Fix for library embedding PNGs with invalid information on output
  • Resolve PageSize enum for landscape orientation documents
  • Fix to rotation handling. The coordinates used for letters etc. are different now for rotated and/or cropped pages
  • Fix to calculated positions of annotations
  • Fix to adding JPG files to output documents
  • Add height to Type 3 font bounding boxes and default width/height for zero values
  • CreationDate and ModifiedDate are now available in DocumentInformationBuilder
  • Images can be added to document builder without specifying placement rectangle, this will place the image at 0,0 with full width and height
  • PdfAction exposed by Annotation class. InReplyTo property also added
  • GetFields extensions method for AcroForm type
  • Fix for internal links when using existing documents with annotations with PdfDocumentBuilder
  • Handle name conflicts when using PdfDocumentBuilder with one or more existing documents
  • Swaps internal uses of Rijndael and RijndaelManaged to Aes since these were marked as obsolete
PdfPig - Gloucestershire Old Spots

Published by EliotJones almost 2 years ago

Changes since 0.1.6:

  • Add page.SetRotation for PdfPageBuilder
  • Add SkipMissingFonts to parsing options to ignore content where the font is not present or corrupt. Can result in content being missed during extraction but will enable partial extraction of retrievable content on page for corrupted files.
  • Multiple bug fixes thanks to @fnatzke
  • Fix to page number order bug on extraction thanks to @grinay
  • Various shape drawing utilities on PdfPageBuilder thanks to @Jonowa
  • Fix to issue in GrahamScan thanks to @BobLd
  • Remove stray Debugger.Break from the encryption handler
  • Various other bug fixes
PdfPig - Australian Yorkshire

Published by EliotJones over 2 years ago

Mainly bug fixes. There are some compatibility changes in the document layout analysis API. See here: https://github.com/UglyToad/PdfPig/wiki/Migration-to-0.1.6

  • Fix transparency being applied for PDF/A-1
  • Fixes to string handling
  • .NET 6.0 support
  • Handle null rather than missing encryption data
  • Fixes bug with size of JPG files in documents created by PdfPig
  • Better handling for unusual Type1 fonts
  • Support for invisible/hidden text in document builder
  • Fixes stack overflow when parsing page tree for some documents
  • Fixes bug in some glyph bounding boxes for Type2 fonts
  • Handle non-contiguous xref ranges when building a document
  • Better location of version headers for non-compliant documents
PdfPig - Finnish Landrace

Published by EliotJones about 3 years ago

PdfPig - 0.1.5 Second Alpha

Published by EliotJones over 3 years ago

Some more bug-fixes:

  • Fix for object streams in files which require brute force searching.
  • Handle NullToken presence when creating documents.
  • Support for PDFs where the filters are defined as indirect references (against specification).
  • Support for CMYK when generating PNG images from IPdfImage.
  • Support for indexed ColorSpaces where palette is stored in a string.
  • Handle UTF16 strings in encrypted document dictionaries.
  • Handle documents with a XMP metadata stream instead of an information dictionary.
  • CCITTFaxDecode filter support.
  • Tweaks to DefaultWordExtractor to try and detect word gap size based on preceding text instead of a global gap threshold.

Note that changes to DefaultWordExtractor may change the output of calls to Page.GetWords() in this version.

PdfPig - 0.1.5 First Alpha

Published by EliotJones over 3 years ago

First alpha version of 0.1.5

  • Fix glyph bounding boxes and paths for Type1 fonts using flexpoints.
  • Fix stack overflow when merging some documents.
  • Support loading existing documents into PdfDocumentBuilder.
  • Performance improvements for multithreaded scenarios.
  • Fix checked value for AcroForm checkboxes where the checked state is appearance only.
  • New page.GetOptionalContents() partial optional content retrieval support.
  • Partial support for colorspace details on IPdfImages.
  • Multiple bug-fixes for various font related issues.

Breaking changes:

  • PdfDocumentBuilder now implements IDisposable. This disposes the underlying stream by default but this is a MemoryStream normally so not any serious consequences if left undisposed.
  • PdfPageBuilder had the AdvancedEditing property removed. The API is now available in the ContentStream methods / properties (this was from #250).
PdfPig - British Lop

Published by EliotJones almost 4 years ago

  • Adds support for filling rectangles when using PdfDocumentBuilder. The DrawRectangle method now takes an optional boolean parameter, fill.
  • Fix bug recognising Standard 14 fonts with Arial MT naming.
  • Handle unusual object streams containing endobj tokens.
  • Support broken Differences arrays for encodings.
  • Support very long xref streams by making infinite loop detection more relaxed.
  • Fix issue with parsing Type0 fonts that are using indirect references.
  • Internal structure changes to support pdf to image work.
PdfPig - Göttingen Minipig

Published by EliotJones almost 4 years ago

  • Fixes a set of bugs for font handling and PDF parsing.
  • Improves font detection on Linux systems
  • Improves calculation of PointSize for letters accounting for rotation and other transformations
  • Improves document layout analysis results in some cases
  • Fixes writing UTF strings when using document builder
  • Improvements to PDF graphics path API
PdfPig - 0.1.3 First Alpha

Published by EliotJones about 4 years ago

First alpha version of 0.1.3

PdfPig - Happy Hog

Published by EliotJones over 4 years ago

Some new features, performance tweaks and improved Document Layout Analysis tools:

  • PDF/A compliance for PdfDocumentBuilder, use PdfDocumentBuilder.ArchiveStandard to select a PDF/A compliance level.
  • Performance improvements to parsing.
  • Clipping support for PdfPaths, now PdfSubpath. Use ParsingOptions.ClipPaths to enable clipping.
  • SVG Exporter in Document Layout Analysis
  • Improvements to Recursive XY Cut algorithm in Document Layout Analysis.
  • Fixes to PDF Merging to support more use-cases. Use PdfMerger.Merge to generate merged PDFs.
  • Proper support for letters and paths in rotated PDF documents, previous locations were incorrect when the page dictionary contained a rotation value.
  • Better support for guessing point size for letters.
  • ContentTextOrderExtractor in Document Layout Analysis uses the existing content order of text from the page's content stream to generate text as a string.
  • IPdfImage now supports TryGetBytes() instead of Bytes. TryGetBytes returns false for JPXDecode and DCTDecode image filters for which RawBytes represent a valid JPEG image.
  • Font flags such as bold and italic available on Letter.
  • Bugfix for CID fonts.
  • TextDirection is now TextOrientation, various fixes to the calculations of orientation and bounding box for Words.
  • Most Document Layout Analysis algorithms now take in a DlaOptions parameter to specify behaviour.
  • Bugfix to files with large amounts of trailing data.
  • Support for OpenType in CID fonts.
PdfPig - 0.1.2 Third Alpha

Published by EliotJones over 4 years ago

  • Many updates to document layout analysis algorithms
  • Bugfix for files with a large number of non-data trailing bytes
  • Bugfix for OpenType fonts
  • Paths and glyphs are now correctly rotated when the page itself has a rotation value
PdfPig - 0.1.2 Second Alpha

Published by EliotJones over 4 years ago

Adds letter font details and a couple of other bugfixes to the alpha version.

PdfPig - 0.1.2 First Alpha

Published by EliotJones over 4 years ago

First alpha version of 0.1.2

PdfPig - Cows That Move Backwards And Forwards

Published by EliotJones over 4 years ago

Many bug fixes for a whole range of document types. In addition:

  • Add support for JPG images in PdfDocumentBuilder using page.AddJpeg().
  • Access to marked content using page.GetMarkedContents()
  • Early access to PDF merging using PdfMerger.Merge()
  • Adds Doc-Comments back to the package.
  • Improvements to NearestNeighbourWordExtractor and other Document Layout Analysis classes to support rotated text.
PdfPig - 0.1.1 First Alpha

Published by EliotJones over 4 years ago

A whole bunch of bug fixes and other changes.

PdfPig - And It Comes Out As MIWK

Published by EliotJones almost 5 years ago

This version focuses on improving performance.

To enable this it replaces decimals with doubles for most of the public API. It also reorganizes the code internally to support access to font related classes.

For this reason consumers will need to update their code, see the migration guide on the wiki.

Other features:

  • Access to hyperlinks provides a convenience wrapper for retrieving annotations of type Link and their text content and destination. Use page.GetHyperlinks().
  • Bug fixes for glyph positions.
  • Access to the embedded files in the document. Use document.Advanced.TryGetEmbeddedFiles(out IReadOnlyList<EmbeddedFile> files).
  • Ability to provide a list of passwords to try when opening encrypted documents. Use ParsingOptions.Passwords to provide the list of passwords. Any password set in ParsingOptions.Password will be included in the list of passwords.
  • Many bug fixes for different documents.
PdfPig - 0.1.0 Second Beta

Published by EliotJones almost 5 years ago

Updates the 0.1.0 beta version with many bug fixes.

PdfPig - 0.1.0 First Beta

Published by EliotJones almost 5 years ago

First release which moves internal numerics from decimal to double where appropriate.

Reorganises internal project structure.

See migration details in the wiki: https://github.com/UglyToad/PdfPig/wiki/Migration-0.0.X-to-0.1.0

PdfPig - Farms With Fields Which Cross The Border

Published by EliotJones almost 5 years ago

This release fixes a major performance regression in 0.10.0.

It also adds bug-fixes for several new issues as well as additional methods for the geometry objects PdfPath, PdfLine and PdfRectangle.

PdfPig - Mixed Together With Whiskey

Published by EliotJones almost 5 years ago

This release adds two main new features:

  • Access to form elements (AcroForms) such as text input, checkboxes, radio-buttons, etc. Use document.TryGetForm(out AcroForm form) to get the form for the document if it contains one.
  • Access to bookmarks which define the document structure by linking to chapters, etc. Use document.TryGetBookmarks(out Bookmarks bookmarks) to get the document's bookmarks tree if it contains one.

It also aims to improve performance for most content retrieval operations resulting in up to double speed for the smallest documents.

It also adds bug-fixes, structure analysis tools and small improvements:

  • Adds document.GetPages() as a convenience method to enumerate all pages in a document.
  • Adds hOcr, AltoXml and PageXml format exporters to export the page content to standardized formats which can be used in other tools. These exporters implement the ITextExporter interface and are used to export each page to a compatible string.
  • Improves support for retrieving images from a page. The new page.GetImages() method enumerates all images on a page, images are either InlineImages or XObjectImages.
  • Adds support for extracting text which is defined in XObject forms (distinct from AcroForms) which was previously skipped, meaning text could have been missing from the page.Text on certain document types.
  • Adds support for vertical writing mode fonts (Japanese, etc).
  • Additional bug fixes.