Automation MCP Server Features Blog Pricing Contact

Invoice JSON Extraction API Reference

Read a structured XML invoice and return it as an InvoiceDocument JSON. The endpoint accepts either a PDF (whose embedded CII / UBL XML attachment is used) or a standalone XML file. Pure XML parsing, no AI. Use /v1/parse/json if your PDF has no embedded XML and you need the data extracted with AI.

POST /v1/extract/json

Code Example

curl -X POST https://api.invoicexml.com/v1/extract/json \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "[email protected]"

Try it out online, no coding required

Upload any invoice and download the structured JSON instantly, right in your browser.

Try It Online

Request

Parameter Type Description
file * binary The invoice file to process.

Content-Type: multipart/form-data

The source and target formats are part of the endpoint path, and everything else (syntax, declared profile, specification identifier) is read from the document itself, so there is nothing more to configure.

Headers

Header Value
Authorization * Bearer YOUR_API_KEY
Content-Type multipart/form-data

Response

200 Extracted JSON

Returns the parsed invoice data as a structured JSON object.

Content-Type: application/json

The response filename is derived from the uploaded file: {original-name}.json. The Content-Disposition header is set to attachment for direct download.

The root object contains an invoice key with all extracted fields and a top-level validationIssues array listing any discrepancies found during the validation passes. An empty array means the invoice passed all checks.

How the Parse Invoice JSON API Works

The API runs a four-stage pipeline on every uploaded document to guarantee a validated, integration-ready JSON object:

1

Document ingestion & OCR

The uploaded file is decoded and, if necessary, passed through an OCR engine. Native PDFs are parsed at the text layer; scanned PDFs, JPEG, PNG, TIFF, and WEBP files are processed via optical character recognition before any field extraction begins.

2

AI field extraction & semantic mapping

A large-language model reads the full document and identifies every invoice field, seller, buyer, invoice number, date, line items, tax rates, payment terms, bank details, regardless of layout, language, or formatting. Fields are mapped to a normalised schema with a per-field confidence score.

3

Multi-layer validation & cross-referencing

Four validation passes run sequentially:

  • Schema: all mandatory EN 16931 fields present, correct data types, valid code-list values.
  • Tax numbers: VAT IDs and business registration numbers verified against country-specific format rules.
  • Addresses: seller and buyer addresses parsed into components and checked for internal consistency.
  • Arithmetic: unit price × quantity = line total; sum of line totals + tax amounts = invoice grand total.

Discrepancies are recorded in the validationIssues array rather than blocking the API response, so your application can decide how to handle borderline cases.

4

Structured JSON delivery

The validated invoice object is serialised as application/json and returned with Content-Disposition: attachment. Field names are consistent across all source documents, languages, and invoice formats, no post-processing required before API integration.

Frequently Asked Questions

What file formats are accepted?

A PDF containing embedded CII or UBL XML (Factur-X, ZUGFeRD, or Peppol PDF/A-3), or a standalone XML file (CII D16B / UBL 2.1). Maximum file size is 20 MB.

What happens if the PDF has no embedded XML?

The API returns a 400 response with errorCode 4006 (NoEmbeddedXml). For PDFs without an embedded XML attachment (typed, scanned, or photographed invoices), use POST /v1/parse/json instead, which runs the AI extraction pipeline.

Is the XML validated before parsing?

The XML is parsed against the CII or UBL schema by the parser. EN 16931 Schematron rules are not checked here. If you need a full validation pass, use POST /v1/validate/{format} on the same input.

What does the JSON output contain?

The BT-first InvoiceDocument: invoiceNumber, issueDate, currency, seller, buyer, paymentDetails, lines, totals, vatBreakdowns, and the rest of the EN 16931 model. Field names mirror the request bodies accepted by /v1/create/*.

How does this differ from /v1/parse/json?

/v1/extract/json is deterministic XML parsing. It does not call the AI model and produces an exact mapping of the source XML. /v1/parse/json is AI-driven extraction from PDFs that do not have embedded XML.