Thursday 31 January 2008

Creating PDF Documents from XML Using Apache FOP, PHP Javabridge and PHP 5

Welcome to my first professional blog post!

I have recently been tasked with the project of creating a versatile engine to create PDF format reports from XML markup and separate data arrays.

After some research I found Apache FOP which is a very good Java XSL FO (eXtensible Style Sheet Formatting Objects) renderer. To get this to work with PHP 5 I also needed to use the PHP/Java Bridge.

In order to get from static XML Markup to a rendered PDF document I needed to follow the steps:

Parse XML to a multidimensional array with a PHP XMLReader
Parse array into valid XSL FO markup
Pass XSL FO markup though the PHP/Java Bridge to an Apache FOP renderer

However the implementation we needed had to be versatile enough to take a data array and an XML template then knit them together to allow dynamic document generation.


First, the Java FOP Implementation

The PHP/Java Bridge installs as a set of Java libraries so it's easy to implement in a singleton 'wrapper' class. The wrapper also needed to utilise the JAXP libraries to handle the XML transformation:


// Java
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.io.StringReader;

//JAXP
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.Source;
import javax.xml.transform.Result;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.sax.SAXResult;

// FOP
import org.apache.fop.apps.FOUserAgent;
import org.apache.fop.apps.Fop;
import org.apache.fop.apps.FOPException;
import org.apache.fop.apps.FopFactory;
import org.apache.fop.apps.MimeConstants;

public class FopWrapper
{
    protected String[] renderTypes = {MimeConstants.MIME_PDF, MimeConstants.MIME_POSTSCRIPT, MimeConstants.MIME_RTF, MimeConstants.MIME_PNG, MimeConstants.MIME_GIF, MimeConstants.MIME_JPEG};
    protected FopFactory fopFactory;
    protected static FopWrapper fopWrapper;

    protected FopWrapper()
    {
        fopFactory = FopFactory.newInstance();
    }

    public static FopWrapper getInstance()
    {
        if(FopWrapper.fopWrapper == null) FopWrapper.fopWrapper = new FopWrapper();
        return FopWrapper.fopWrapper;
    }

    public void render(String xmlInput, String outputFile, int renderType)
    {
        try
        {
            FOUserAgent foUserAgent = fopFactory.newFOUserAgent();
            OutputStream output = new BufferedOutputStream(new FileOutputStream(outputFile));
            Fop fop = fopFactory.newFop(this.renderTypes[renderType], foUserAgent, output);
            TransformerFactory factory = TransformerFactory.newInstance();
            Transformer transformer = factory.newTransformer();
            Source src = new StreamSource(new StringReader(" "+xmlInput));
            Result res = new SAXResult(fop.getDefaultHandler());
            transformer.transform(src, res);
            output.close();
        }
        catch (Exception e)
        {
            System.out.println(e.getMessage());
        }
    }
}


This wrapper can then be instantiated in PHP by the following code:

java_require($path_to_file);
$class = new JavaClass("FopWrapper");
$this->reporter = $class->getInstance();

And it can be made to output rendered XSL FO markup by the simple command

$this->reporter->render($xsl_markup, $outputFile, $renderType);

where:
$xsl markup is a string of valid XSL FO markup
$outputFile is the output file (including path) to create
$renderType is an integer giving the type of file to render to (index of the

The Java wrapper needs to be packaged into a JAR file to be utilised by the PHP/Java Bridge.
And that's it! The rest of the work is done purely in PHP.


Pages, Blocks and Formatting Objects

W3Schools has a great XSL FO tutorial, detailing the elements needed to build a page in XSL FO, the tags and their arguments.

Effectively every valid XSL FO markup must have a <fo:root> node containing one or more <fo:page> nodes which themselves contain one or more <fo:block> nodes which may contain any number of formatting objects (plain text, tables, lists, images etc). These can be very nicely represented in PHP5 Objects.

The way I constructed the objects was to have them all implement an interface that simply defined an init() and a render() function which would always output the data as valid XSL FO.
Page objects can be added to Root objects, Block objects can be added to Page objects, FO objects can be added to Block objects and as the render function is called in one, it then calls the render function in all the objects it holds, recursing through the object tree.

I won't go too much into the implementation of these objects, my implementation (within the APLC Repository) uses init functions to set default attributes (in this case I'm using 'attribute' to mean the arguments within the XML tags) and also has setter functions for each which are named by upper-casing the first letter of the attribute name and prefixing it with 'set'. The 'method_exists' function is useful here to check if there is a setter method defined for an attribute.
I pass the data and parameters to the formatting object or renderer's 'init' function as a single array (actually as an instance of Aplc_Registry, which is a feature-rich wrapper around a multidimensional array, but for the purposes of this blog it's unnecessary complication).
If you are looking to create large multi-page reports it's useful to render each Page to a temporary file as it's added to the Root object.

So now we have a set of objects that can be created, updated, combined and rendered to produce an output document.
It is a fairly intuitive task then to create a document 'template' as a multidimensional array and parse it like so:



$renderedDocument = Xslfo_Document();

$parsedHeader = new Xslfo_Header();
$parsedHeader->init($document['header']);
$renderedDocument->addHeader($parsedHeader);

$parsedFooter = new Xslfo_Footer();
$parsedFooter->init($document['Footer']);
$renderedDocument->addFooter($parsedFooter);

foreach($document['pages'] as $page)
{
    $renderedPage = new Xslfo_Page();
    $renderedPage->init($page['attributes']);

    foreach($page['blocks'] as $block)
    {
        $renderedBlock = new Xslfo_Block();
        $renderedBlock->init($block['attributes']);

        foreach($block['formatting_objects'] as $fo)
        {
            $renderedFo = new Xslfo_Fo();
            $renderedFo->init($fo);
            $renderedBlock->addFo($renderedFo);
        }

        $renderedPage->addBlock($renderedBlock);
    }

    $renderedDocument->addPage($renderedPage);
}

$xsl = $renderedDocument->render();


This assumes that $document is the multidimensional array and 'header', 'footer' and 'formatting_objects' elements contain both formatting 'attributes' and data.


Bringing in the XML

I started by creating my own DTD to define my XML schema for these documents.
The DTD is not complicated it simply defines what formatting attributes are allowed to be included in document, header, footer, page and block tags and also defines 'fo' (formatting object) and 'renderer' tags.
'Fo' and 'Renderer' tags contain the actual data to be put in the document or references to elements within the data array. They also name the class that will be used to process them and give class-specific parameters.
In simple terms the difference between an 'Fo' and a 'Renderer' is that an 'Fo' produces a standard element with the data it is passed from the template or the data array. A renderer will perform more advanced operations such as extracting and filtering data from the database according to the defined parameters. A renderer will create and return a formatting object.


$data = array(array('renderFile' => '/tmp/leave.pdf', 'renderType' => 0, 'sender' => 8543, 'recipient' => 7365, 'senderaddress' => 65465, 'recipientaddress' => 34566, 'body' =>'Please make an appointment to see me regarding your son\'s behaviour. Frankly I didn\'t know you could do that with a melon and a block of soft cheese.', 'image1' => 'http://www.thedaddy.org/images/banner.png'));

<document>
<header align="right" fontsize="12" fontfamily="verdana" fontweight="bold">Letter Template</header>
<page>
    <block align="left">
        <fo type="Aplc_Report_Fo_Image">
            <data>image1</data>
        </fo>
    </block>
    <block align="right">
        <renderer type="Iris_Renderer_Site_Id">
            <data>senderaddress</data>
            <parameter name="newline">true</parameter>
            <parameter name="field">name</parameter>
            <parameter name="field">postcode</parameter>
        </renderer>
        <fo type="Aplc_Report_Fo_Break"></fo>
        <fo type="Aplc_Report_Fo_Date"></fo>
    </block>
    <block align="left">
        <renderer type="Iris_Renderer_Site_Id">
            <data>recipientaddress</data>
            <parameter name="newline">true</parameter>
            <parameter name="field">name</parameter>
            <parameter name="field">postcode</parameter>
        </renderer>
        <fo type="Aplc_Report_Fo_Break"></fo>
    </block>
    <block align="left" linefeedtreatment="none">
        Dear
        <renderer type="Iris_Renderer_User_Id">
            <data>recipient</data>
            <parameter name="field">fullname</parameter>
        </renderer>
    </block>
    <block align="justify">
        <fo type="Aplc_Report_Fo_Text">
            <data>body</data>
        </fo>
    </block>
    <block>
        <fo type="Aplc_Report_Fo_Break"></fo>
        Sincerely
        <fo type="Aplc_Report_Fo_Break"></fo>
        <renderer type="Iris_Renderer_User_Id">
            <data>sender</data>
            <parameter name="field">fullname</parameter>
        </renderer>
    </block>
</page>
</document>


This XML is parsed by a simple wrapper that we build in APLC around the PHP XMLReader class which validates it against the DTD and then turns it into a multidimensional array.
This array is iterated over, creating the node objects, passing the data and parameters and adding them to their parent nodes. For the Renderers and Formatting objects I simply took the name and checked if it was valid using class_exists. If not I ignored that whole node.

Renderers and Formatting Objects

How these are implemented really just comes down to personal style. Formatting objects are quite simple, so long as it's ensured that valid XSL FO is returned by the render function. You'll see that I created some that take data to produce tables, lists and plain text and others that will simply render a line break or print the date.

The way I made the renderers is to use singleton 'data objects' which connects to the database, pulls all the data specified data into a 2 dimensional array and then serializes it into a cache file. Then the renderers pull the rows and columns asked for in the template or data. This greatly lowers processor and memory requirements.

Mostly the data you pass to the parser will be from the data array but every now and then, particularly for plain text, you will want to embed it directly in the template. For this I used a simple trick - when the parser encounters a data tag it checks to see if the contents of the tag is a key in the data array. If so then the data from the key is rendered, if not then the contents of the data tag itself is used. This also helps greatly with debugging as it will print to the document any key that it couldn't find in the data array.

Gotchas

When blocks are added to a page in FOP they are always added vertically. There is no native way to specify you want to align a block horizontally with another. In order to do this I made another class to handle 'horizontal' tags which simply creates a block containing an xsl fo table and any blocks placed within it are set in table cells within a single row.