Automation Features Blog Pricing Contact

AI Parse PDF Invoice to JSON

Parse any invoice PDF into a structured InvoiceDocument JSON using AI. The endpoint accepts native PDFs, scanned documents, and image-based PDFs, then runs OCR, semantic field extraction, and EN 16931 mapping. Any embedded XML in the PDF is ignored. Use /v1/extract/json if you want the embedded XML parsed instead.

POST /v1/parse/json

Code Example

curl -X POST https://api.invoicexml.com/v1/parse/json \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "[email protected]"

Try it out online, no coding required

Upload any invoice and download the structured JSON instantly, right in your browser.

Try It Online

Request

Parameter Type Description
file * binary The PDF invoice file to convert.
strict boolean Defaults to false. When true, treat validation warnings as errors, the request is rejected if any warning is raised, not just errors.

Content-Type: multipart/form-data

Headers

Header Value
Authorization * Bearer YOUR_API_KEY
Content-Type multipart/form-data

Response

200 Extracted JSON

Returns the parsed invoice data as a structured JSON object.

Content-Type: application/json

The response filename is derived from the uploaded file: {original-name}.json. The Content-Disposition header is set to attachment for direct download.

The root object contains an invoice key with all extracted fields and a top-level validationIssues array listing any discrepancies found during the validation passes. An empty array means the invoice passed all checks.

How the Parse Invoice JSON API Works

The API runs a four-stage pipeline on every uploaded document to guarantee a validated, integration-ready JSON object:

1

Document ingestion & OCR

The uploaded file is decoded and, if necessary, passed through an OCR engine. Native PDFs are parsed at the text layer; scanned PDFs, JPEG, PNG, TIFF, and WEBP files are processed via optical character recognition before any field extraction begins.

2

AI field extraction & semantic mapping

A large-language model reads the full document and identifies every invoice field, seller, buyer, invoice number, date, line items, tax rates, payment terms, bank details, regardless of layout, language, or formatting. Fields are mapped to a normalised schema with a per-field confidence score.

3

Multi-layer validation & cross-referencing

Four validation passes run sequentially:

  • Schema: all mandatory EN 16931 fields present, correct data types, valid code-list values.
  • Tax numbers: VAT IDs and business registration numbers verified against country-specific format rules.
  • Addresses: seller and buyer addresses parsed into components and checked for internal consistency.
  • Arithmetic: unit price × quantity = line total; sum of line totals + tax amounts = invoice grand total.

Discrepancies are recorded in the validationIssues array rather than blocking the API response, so your application can decide how to handle borderline cases.

4

Structured JSON delivery

The validated invoice object is serialised as application/json and returned with Content-Disposition: attachment. Field names are consistent across all source documents, languages, and invoice formats, no post-processing required before API integration.

Frequently Asked Questions

Does this work with scanned or photographed invoices?

Yes, that is the primary use case. The AI reads the document like a human, recognising fields regardless of layout, orientation, scan quality, or language. It handles thermal-printed receipts, faxed documents, and smartphone photos of paper invoices.

How does this differ from /v1/extract/json?

/v1/extract/json deterministically parses embedded or uploaded XML. /v1/parse/json runs the AI pipeline on the PDF itself and ignores any embedded XML. Use parse when the PDF has no XML or when you want a fresh AI read regardless of what is embedded.

What does the JSON output contain?

The BT-first InvoiceDocument: invoiceNumber, issueDate, currency, seller, buyer, paymentDetails, lines, totals, vatBreakdowns, and the rest of the EN 16931 model. Same shape as /v1/extract/json so you can swap the endpoints without changing your client.

What file formats are accepted?

PDF only (native and scanned). Maximum file size is 20 MB.

What happens if the document is not an invoice?

The AI classification step returns a 400 with errorCode 4008 (NotAnInvoice). If the PDF contains multiple invoices, errorCode 4009 (MultipleInvoices) is returned.