Automatisierung Blog Preise Kontakt
POST /v1/extract/xml

Extract XML from PDF

Extract structured XML from a PDF invoice. If the PDF contains an embedded XML attachment (Factur-X, ZUGFeRD), it is extracted and returned directly. If no embedded XML is found, the API uses AI to extract invoice data from the PDF, builds an EN 16931 compliant CII XML document, validates it against Schematron business rules, and returns the result.

How It Works

The endpoint uses a two-step strategy to guarantee you always receive structured XML:

1

Extract embedded XML

The API inspects the PDF for an embedded XML attachment (e.g. factur-x.xml or zugferd-invoice.xml). If found, the XML is extracted and returned immediately — no AI processing required.

2

Generate XML from invoice data

If no embedded XML is found, the API extracts invoice data from the PDF using AI, maps it to the EN 16931 semantic model, builds a CII (Cross Industry Invoice) XML document, validates it against Schematron business rules, and returns the validated XML.

Request

Parameter Type Description
file * binary The PDF invoice file to convert.

Content-Type: multipart/form-data

Headers

Header Value
Authorization * Bearer YOUR_API_KEY
Content-Type multipart/form-data

Response

200 Returns the extracted or generated XML document as a file download.
Content-Type: application/xml

The response filename is derived from the uploaded file: {original-name}.xml. The Content-Disposition header is set to attachment for direct download.

Code Example

curl -X POST https://api.invoicexml.com/v1/extract/xml \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "[email protected]"

Frequently Asked Questions

What happens if the PDF already has embedded XML?

The API detects embedded XML attachments (factur-x.xml, zugferd-invoice.xml, or similar) inside PDF/A-3 containers. When found, the embedded XML is returned immediately without any AI processing — this is the fastest path.

What if the PDF has no embedded XML?

The API falls back to AI-based extraction. It reads the invoice text and layout, maps the data to the EN 16931 semantic model, generates a CII (Cross Industry Invoice) XML document, and validates it against Schematron business rules before returning it.

What XML format is generated?

When generating XML from scratch, the API produces UN/CEFACT CII D16B — the same syntax used inside ZUGFeRD and Factur-X files. This is one of the two official syntaxes of the EN 16931 European standard.

Is the generated XML validated?

Yes. When XML is generated via AI extraction, it is validated against EN 16931 Schematron rules before delivery. If validation fails, the API returns a 400 response with the specific rule violations. Extracted embedded XML is returned as-is.

Can I use this to check if a PDF has embedded XML?

Yes. If the PDF contains embedded XML, the response is near-instant. If it falls back to AI generation, the response takes longer. You can use the response time as an indirect indicator, or compare the output against the original PDF metadata.

What is the output filename?

The response filename is derived from the uploaded PDF: if you upload invoice-2025.pdf, you receive invoice-2025.xml. The Content-Disposition header is set to attachment for direct download.