/v1/extract/json
Extract Invoice Data to JSON
Parse any invoice document into a validated, integration-ready JSON object. The endpoint accepts native PDFs, scanned documents, and image files, then runs a four-stage AI pipeline: OCR, semantic field extraction, multi-layer validation (schema, tax numbers, addresses, line-item arithmetic), and normalised JSON delivery. Works with messy, low-quality, and multi-language invoices where template-based parsers fail.
How It Works
The endpoint runs a four-stage AI pipeline on every uploaded document to guarantee a validated, integration-ready JSON object:
Document ingestion & OCR
The uploaded file is decoded and, if necessary, passed through an OCR engine. Native PDFs are parsed at the text layer; scanned PDFs, JPEG, PNG, TIFF, and WEBP files are processed via optical character recognition before any field extraction begins.
AI field extraction & semantic mapping
A large-language model reads the full document and identifies every invoice field — seller, buyer, invoice number, date, line items, tax rates, payment terms, bank details — regardless of layout, language, or formatting. Fields are mapped to a normalised schema with a per-field confidence score.
Multi-layer validation & cross-referencing
Four validation passes run sequentially:
- Schema: all mandatory EN 16931 fields present, correct data types, valid code-list values.
- Tax numbers: VAT IDs and business registration numbers verified against country-specific format rules.
- Addresses: seller and buyer addresses parsed into components and checked for internal consistency.
- Arithmetic: unit price × quantity = line total; sum of line totals + tax amounts = invoice grand total.
Discrepancies are recorded in the validationIssues array rather than blocking the response, so your application can decide how to handle borderline cases.
Structured JSON delivery
The validated invoice object is serialised as application/json and returned with Content-Disposition: attachment. Field names are consistent across all source documents, languages, and invoice formats — no post-processing required before integration.
Request
| Parameter | Type | Description |
|---|---|---|
| file * | binary | The PDF invoice file to convert. |
Content-Type: multipart/form-data
Headers
| Header | Value |
|---|---|
| Authorization * | Bearer YOUR_API_KEY |
| Content-Type | multipart/form-data |
Response
The response filename is derived from the uploaded file: {original-name}.json.
The Content-Disposition header is set to attachment for direct download.
The root object contains an invoice key with all extracted fields and a top-level
validationIssues array listing any discrepancies found during the validation passes. An empty array means the invoice passed all checks.
Code Example
curl -X POST https://api.invoicexml.com/v1/extract/json \ -H "Authorization: Bearer YOUR_API_KEY" \ -F "[email protected]"
Frequently Asked Questions
Does this work with scanned or photographed invoices?
Yes — that is the primary use case. The AI reads the document like a human, recognising fields regardless of layout, orientation, scan quality, or language. It handles thermal-printed receipts, faxed documents, and smartphone photos of paper invoices.
What validation is applied to the extracted data?
Four passes run sequentially: (1) schema validation — all mandatory EN 16931 fields present and correctly typed; (2) VAT/tax number format verification against country-specific rules; (3) address plausibility checks on seller and buyer data; (4) full arithmetic verification — unit price × quantity = line total, sum of lines + tax = invoice grand total. Discrepancies are listed in the validationIssues array rather than blocking the response.
What does the JSON output contain?
The root object has an invoice key with seller, buyer, invoiceNumber, issueDate, dueDate, currency, lineItems (description, quantity, unitPrice, lineTotal, taxRate, taxCategoryCode), taxBreakdown, paymentTerms, bankDetails, and a per-field confidenceScores map. A top-level validationIssues array lists any discrepancies found during validation.
How is this different from POST /v1/extract/xml?
The /extract/xml endpoint extracts the embedded CII XML from a Factur-X or ZUGFeRD PDF, falling back to AI generation if no XML is found. The /extract/json endpoint always uses the full AI pipeline and runs four validation layers, returning a normalised JSON object. Use JSON for app integration and data pipelines; use XML when you need a standards-compliant e-invoice document.
What file formats are accepted?
PDF (native and scanned), JPEG, PNG, TIFF, and WEBP. Maximum file size is 20 MB. For Factur-X or ZUGFeRD PDFs, the embedded XML is used as the primary data source alongside the visual content, which improves accuracy further.