Extract and verify text from PDF with C#

Last Updated on by

Post summary: How to extract text from PDF in C#.

PDF verification is pretty rare case in automation testing. Still it could happen.

iTextSharp

iTextSharp is library that allows you to manipulate PDF files. We need very small of this library. It has build in reader that iterates through pages and returns only text.

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.Text;

namespace PDFExtractor
{
	public class PDFExtractor
	{
		public static string ExtractTextFromPDF(string pdfFileName)
		{
			StringBuilder result = new StringBuilder();
			// Create a reader for the given PDF file
			using (PdfReader reader = new PdfReader(pdfFileName))
			{
				// Read pages
				for (int page = 1; page <= reader.NumberOfPages; page++)
				{
					SimpleTextExtractionStrategy strategy =
						new SimpleTextExtractionStrategy();
					string pageText =
						PdfTextExtractor.GetTextFromPage(reader, page, strategy);
					result.Append(pageText);
				}
			}
			return result.ToString();
		}
	}
}

Verification

Once extracted text can be verified against expected as described in Text verification post.

Read more...