Introduction In a previous article, we became acquainted with the internal structure of a package OpenXML, it is time for the implementation of all such knowledge, through the creation of a framework written in PHP giving access the contents of these files. The main objective of this framework is to enable the developer who uses access to the contents of a document OpenXML through a simple API, and that it is hidden the complexity of the internal structure of the package.
The framework which will be described in the following sections, although quite functional, is not intended to be used in production, but to illustrate the concepts we have studied, and possibly serve as a basis for your own developments.
1. Technical specifications & pre-requisites To keep the code of our framework relative simplicity and an educational role, we will limit its role to play OpenXML documents, leaving aside their creation and modification. The features of the framework is to summarize this list: Recognition automatic document OpenXML office of Office 2007 (Word, Excel) Access to metadata documents Access to the properties of materials Viewing an extract (preview) as HTML documents The implementation of these features will require a lot of manipulation related to XML. The extension SimpleXML seems to be the ideal candidate to meet these needs, while controlling the complexity and length of the code. However, if SimpleXML is perfect for handling XML documents with the scheme simple enough, like those containing the metadata and OpenXML document properties, its use will prove problematic when it comes to generating HTML retrieved of the document, obtained from the main document XML (main part) and whose pattern is particularly complex. To avoid the challenge of having to go to a content WordprocessingML or SpreadsheetML with SimpleXML, we will therefore pursue a solution based style sheets written in XSLT.
A document OpenXML Office 2007 is a tree of files contained in a zipped package, we need a tool to access any of these files and so that the internal management of the Zip format (decompression) is the most transparent and as simple as possible for the client code, and preferably without the need to decompress the package on the local file system. Zip Among the tools made available by PHP, class ZipArchive meets all these conditions.
Finally, to ensure that our framework robustness and scalability, classes that will be the must observe the rules of the art Today's POO (encapsulation, visibility, inheritance, etc..), Which excludes the use of PHP4, whose implementation object is too rudimentary.
In short, the ideal platform for our framework is version 5.2 or higher PHP, with extensions and XSL ZipArchive activated.
For Windows, make sure that these two lines are present in php.ini: extension = php_xsl.dll extension = php_zip.dll For Unix / Linux, PHP must be configured with the options - with-xsl and - enable-zip.
2. The structure of framework The types of Office documents managed by our framework, Word and Excel, each have a dedicated class in our framework, respectively appointed WordDocument and ExcelWorkbook. Each instance (object) of these classes is an instance of a OpenXML document presented to it in the form of an object ZipArchive. The methods presented by these classes allow read-only access to the document properties, and display of the extract (preview) of the document.
These 2 classes will derive from a class mother OpenXMLDocument abstract, which will contain the code to manage the internal structure of the package, and methods of access to metadata of the document (remember that metadata do not depend on the type of document Office to which they refer, they are common to all files OpenXML) and the code to the XSLT transformation.
Among the features mentioned in the technical specifications, there is automatic recognition by the framework of the type of document Office. We must avoid being in cases where such client code attempts to create a forum for WordDocument while OpenXML document underlying is an Excel ... To that end, the framework adopts the boss design Fabrique (factory), which responds perfectly to this problem. In the presence of a document OpenXML, client code, rather than instantiate itself one of 2 classes of document will pass the document name the static method getDocument () class OpenXMLDocumentFactory who will be responsible for detecting the type of document and instantiate class ad hoc, and to return him.
It is noteworthy that this device is contournable by the developer who may still persist in instantiate itself one of these classes, there is no PHP as Java visibility limited to "package" which applied to builders subclasses OpenXMLDocument, would force to be instanced by the factory. A small model of weakness OO PHP5 which would, in my opinion, to be corrected in a later version ...
Regarding the management of internal errors in our Framework, it is provided by two classes derived from the class Exception: OpenXMLException and OpenXMLFatalException. Although similar implementations, they do not intervene in the same scenario: OpenXMLException is triggered when logical error involved in the interrogation of the structure of the package (no URI sought in a file relations, for example), while OpenXMLFatalException is triggered by an error input / output, or when an error involved in a parsing XML. OpenXMLException, unlike OpenXMLFatalException, does not necessarily a serious mistake to stop the discontinuation of treatment. Thus, the attempt to read the metadata of a document that does not include (in the table under section 1.4.2 this part is declared optional) will trigger an exception OpenXMLException. In the current framework, OpenXMLException exceptions are handled internally by the framework, and only exceptions OpenXMLFatalException able to code client. It should be noted that all methods of class OpenXMLDocument who can explore the structure of the package and access to all its parts are static, because the factory needs access to these methods to determine the type of material then no class framework has yet been instantiated. 3. The PHP code In this section are reviewed the main parties that make up the code framework. 3.1. Reading and research in the files relations Before moving to a party, it is necessary to have his name in the file consultant relationship linked to the party source from which it is linked (if the source is the root of the package, it is designated by the constant OpenXMLDocument: : ROOT_PARTNAME). This function is provided by the static method OpenXMLDocument: getRelationTarget (): static function getRelationTarget ($ ZipArchive zip, $ sourcePartName, $ relationURI) ( / / Construction of the filename relations according to the standard OPC (Open Package Conventions) $ relation_file = dirname ($ sourcePartName). '_rels /'. basename ($ sourcePartName). '. tural'; / / Regulating filename of relations: \ returned by dirname () if working on a Windows platform are replaced by / $ relation_file = str_replace ( '\ \', '/', $ relation_file); / / On withdrew / headache, access to an item zipped is always relative to the root of the archive if ($ relation_file [0] == '/') ( $ relation_file = substr ($ relation_file, 1); ) $ relations_xml = self:: xml_getPart ($ zip, $ relation_file); if (empty ($ relations_xml)) ( throw new OpenXMLFatalException ( 'Unable to parse the file relations', __METHOD__); ) $ relations_xml-> registerXPathNamespace ( 'rns', self:: RELATIONSHIPS_NS); relation_targets $ = $ relations_xml-> xpath ( "/ rns: Relationships / rns: Relationship [@ Type = '$ relationURI'] / @ Target"); if (empty ($ relation_targets) or count ($ relation_targets) == 0) ( throw new OpenXMLException ( 'Unable to locate the target of the relationship. $ relationURI, __METHOD__); ) return $ relation_targets [0]; ) the first task of this method is to reconstitute the filename of relations, from the name of the source (for a description of the rule that defines the name of a file relationship, see section 1.4 of Article Structure of a document OpenXML). Once known name, the file is opened relations and parsed by the method OpenXMLDocument: xml_getPart () which returns an object SimpleXMLElement. The target (target) of the relationship is then recovered through a petition XPath whose URI relationship serves as an argument.
The reading itself of the parties is made by the method OpenXMLDocument: xml_getPart (): static function xml_getPart ($ ZipArchive zip, $ partName, $ ns = NULL) ( part_content $ = $ zip-> getFromName ($ partName); if (empty ($ part_content)) ( throw new OpenXMLFatalException ( 'Unable to read the part. $ partName, __METHOD__); ) $ xml = simplexml_load_string ($ part_content, NULL, NULL, $ ns, FALSE); if (empty ($ xml)) ( throw new OpenXMLFatalException ( 'Unable to parse the party. $ partName, __METHOD__); ) return $ XML; ) The only thing remarkable about this method is quite unusual number of parameters passed to the function simplexml_load_string (); parameter $ ns, which contains a namespace (URI), is necessary to be able to read XML documents whose root element (element) is predetermined by an alias namespace. One example is the part containing XML metadata of the document (see below).
3.2. Determination of the type of document The instantiation of the class document is provided by the static method OpenXMLDocumentFactory: openDocument (), which must be the first method framework called by the client code: static function openDocument ($ fileName) ( $ zip = new ZipArchive (); if ($ zip-> open ($ fileName)! == TRUE) ( throw new OpenXMLFatalException ( 'Unable d \' open the file. $ fileName, __METHOD__); ) / / On the research Content Type the main part of the document $ type = OpenXMLDocument: getMainPartContentType ($ zip); $ zip-> close (); / / On instantiates and it returns the class of document the type of content switch ($ type) ( OpenXMLDocument case: WORD_DOCUMENT_CONTENT_TYPE: return new WordDocument ($ fileName); break; OpenXMLDocument case: EXCEL_WORKBOOK_CONTENT_TYPE: return new ExcelWorkbook ($ fileName); break; default: throw new OpenXMLFatalException ( 'The type of document. $ type.' is unknown ', __METHOD__); ) ) To identify the type of material, this factory method is based on two methods static OpenXMLDocument: static function getMainPartContentType ($ ZipArchive zip) ( $ main_part = self:: getRelationTarget ($ zip, self: ROOT_PARTNAME, self: OFFICE_DOCUMENT_ROOT_REL); $ type = self:: getContentType ($ zip, $ main_part); return $ type; ) static function getContentType ($ ZipArchive zip, $ partName) ( $ contents_xml = self:: xml_getPart ($ zip, '[Content_Types]. xml'); $ contents_xml-> registerXPathNamespace ( 'cns', self:: CONTENT_TYPES_NS); $ types = $ contents_xml-> xpath ( "/ cns: Types / cns: Override [@ PartName = '/ $ partName'] / @ ContentType"); if (empty ($ types) or count ($ types) == 0) ( / / It has not found any element Override corresponding to the party sought / / On research, therefore, among the types default, that corresponding to the extension of the party $ extension = substr (strrchr ($ partName,'.'), 1); $ types = $ contents_xml-> xpath ( "/ cns: Types / cns: Default [@ Extension = '$ extension'] / @ ContentType"); if (empty ($ types) or count ($ types) == 0) ( throw new OpenXMLException ( 'Unable to determine the type of content. $ partName, __METHOD__); ) Else ( return $ types [0]; ) ) Else ( return $ types [0]; ) ) OpenXMLDocument: getMainPartContentType () returns the MIME type of the part containing the body of the document, and OpenXMLDocument: getContentType () returns the MIME type of the party whose name it is passed as a parameter.
3.3. Lecture metadata The reading of metadata is provided by the method OpenXMLDocument: readCoreProperties (): readCoreProperties private function () ( $ corePropertiesPartName = self:: getRelationTarget ($ this-> zip, self: ROOT_PARTNAME, self: CORE_PROPERTIES_REL); $ document = self:: xml_getPart ($ this-> zip, $ corePropertiesPartName, self: CORE_PROPERTIES_NS); $ this-> keywords = $ document-> keywords; $ this-> last_writer = $ document-> lastModifiedBy; $ this-> revision = $ document-> revision; dc_elements $ = $ document-> children (self:: DUBLIN_CORE_NS); $ this-> creator = $ dc_elements-> creator; dc_elements $ = $ document-> children (self:: DUBLIN_CORE_TERMS_NS); $ this-> date_modified = $ dc_elements-> modified; $ this-> date_created = $ dc_elements-> created; ) The structure of the part containing the metadata document obliges us to some acrobatics with SimpleXML. Indeed, elements containing metadata are divided between three namespaces, as evidenced by this example: <? xml version = "1.0" encoding = "UTF-8" standalone = "yes"?> <CP: coreProperties xmlns: cp = "http://schemas.openxmlformats.org/package/2006/metadata/core-properties" xmlns: dc = "http://purl.org/dc/elements/1.1/" xmlns: dcterms = "http://purl.org/dc/terms/" xmlns: dcmitype = "http://purl.org/dc/dcmitype/" xmlns: xsi = "http://www.w3.org/2001/XMLSchema-instance"> <dc:title> Manipulation OpenXML files with PHP </ dc: title> <dc:subject> Programming <dc:/subject> <dc:creator> Eric Grimois </ dc: creator> <cp:keywords> OpenXML, PHP, framework, OPC </ cp: keywords> <dc:description> description OpenXML format and a PHP framework for its handling </ dc: description> <cp:lastModifiedBy> Eric Grimois </ cp: lastModifiedBy> <cp:revision> 2 </ cp: revision> <dcterms:created xsi:type="dcterms:W3CDTF"> 2007-01-15T15: 44:00 Z </ dcterms: created> <dcterms:modified xsi:type="dcterms:W3CDTF"> 2007-01-28T18: 44:00 Z </ dcterms: modified> </ cp: coreProperties> The root element of the document, coreProperties, is predetermined by an alias. The namespace to which this alias edge must be passed as a parameter to simplexml_load_string () (through xml_getPart ()), under pain of seeing this feature return NULL. Other items are accessible through the method simpleXMLElement: children () that is happening in parameter space of names to which they belong. 3.4. Reading extended properties The location and retrieval of the party including the properties so-called "extensive" of the document is done by the method OpenXMLDocument: xml_getExtendedProperties (): xml_getExtendedProperties protected function () ( $ extendedPropertiesPartName = self:: getRelationTarget ($ this-> zip, self: ROOT_PARTNAME, self: EXTENDED_PROPERTIES_REL); return self: xml_getPart ($ this-> zip, $ extendedPropertiesPartName); ) By cons, reading the information contained in this section is specific to each class of document. From a scheme less complex than describing metadata, this part can be more easily manipulated by SimpleXML. For Class WordDocument is the method WordDocument: getExtendedProperties () which takes care of: readExtendedProperties function () ( $ document = parent: xml_getExtendedProperties (); $ this-> application = $ document-> Application; $ this-> nb_paragraphs = $ document-> Paragraphs; $ this-> nb_characters = $ document-> Characters; $ this-> nb_characters_with_spaces = $ document-> CharactersWithSpaces; $ This-> nb_pages = $ document-> Pages; $ This-> nb_words = $ document-> Words; ) 3.5. XSLT Transformation This is the method OpenXMLDocument: getXSLTTransformedDocument () which takes care of converting the main part of the document by stylesheet past parameter: protected function getXSLTTransformedDocument ($ stylesheetName) (
$ xsl = new XSLTProcessor ();
$ stylesheet = new DOMDocument (); if ($ stylesheet-> load ($ stylesheetName) == FALSE) ( throw new OpenXMLFatalException ( 'Unable to load the stylesheet. $ stylesheet, __METHOD__); ) $ xsl-> importStyleSheet ($ stylesheet); $ mainPartName = self:: getRelationTarget ($ this-> zip, self: ROOT_PARTNAME, self: OFFICE_DOCUMENT_ROOT_REL); mainPartContent $ = $ this-> zip-> getFromName ($ mainPartName); if (empty ($ mainPartContent)) ( throw new OpenXMLFatalException ( 'Unable to read the part. $ partName, __METHOD__); ) $ document = new DOMDocument (); if (-$ document> loadXML ($ mainPartContent) == FALSE) ( throw new OpenXMLFatalException ( 'Unable to load the main part of the document, __METHOD__); ) This is the only method framework that requires the use of DOM. However, this use is limited by loading the document to transform and the stylesheet.
The transformation takes place only on the main part of the document (main part) and the framework does not, in its current design, access to other parts of the document, such as those containing the header and foot page for example. It would be necessary in a comprehensive framework to provide a mechanism for the XSLT processor can reach any part of the document from the main part.
At each grade daughter OpenXMLDocument, transformation is the method used by getHTMLPreview () to return an excerpt of the document. getHTMLPreview function () ( return parent: getXSLTTransformedDocument ( 'preview-word.xslt'); ) The stylesheet is obviously specific to each type of document. Using XSLT and therefore separation from the PHP code allows this presentation to be completely changed without fearing for the integrity of the code framework. The stylesheet associated with Word documents, very brief, merely to return the first non-empty paragraph: <? xml version = "1.0" encoding = "UTF-8"?> <xsl: stylesheet version = "1.0" xmlns: xsl = "http://www.w3.org/1999/XSL/Transform" xmlns: w = "http://schemas.openxmlformats.org/wordprocessingml/2006/3/main" exclude-result-prefixes = "w"> <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" omit-xml-declaration="yes"/> <xsl:template Match="/"> <p> <xsl:value-of select="w:document/w:body/w:p[normalize-space(.) !=''][1]"/> </ P> </ Xsl: template> </ xsl: stylesheet>
The namespace http://schemas.openxmlformats.org/wordprocessingml/2006/3/main is used by Word files generated by the Beta 2 version of Office 2007 (it is also the version of the files provided with examples framework) if you use the stable version of Office 2007, change in the style sheet that namespace by the final: http://schemas.openxmlformats.org/wordprocessingml/2006/main
For him retain its simplicity, this framework does not present a preview for Excel. 4. Application Example Here is a simple example of exploiting the application framework: <? PHP
require_once ( 'openxml.class.php');
$ document = array ( 'sample1.docx', 'sample1.xlsx');
foreach ($ documents as $ document) (
echo "<b> <u> $ document / u> </ b> <br/>"; try (
$ mydoc = OpenXMLDocumentFactory: openDocument ($ document); echo '<br/> <i> Metadata: </ i> <br/>'; echo 'Creator:'. $ mydoc-> getCreator (). '<br/>'; echo 'Subject:'. - $ mydoc> getSubject (). '<br/>'; echo 'Keywords:'. $ mydoc-> getKeywords (). '<br/>'; echo 'Description:'. $ mydoc-> getDescription (). '<br/>'; echo 'date of creation:'. $ mydoc-> getCreationDate (). '</ br>'; echo 'Last update:'. $ mydoc-> getLastModificationDate (). '<br/>'; echo 'last modified'. $ mydoc-> getLastWriter (). '<br/>'; echo 'Review:'. $ mydoc-> getRevision (). '<br/>'; echo '<br/> <i> Document Properties: </ i> <br/>'; echo 'Generated by:'. $ mydoc-> getApplication (). '<br/>'; $ document_class = get_class ($ mydoc); if ($ document_class == 'WordDocument') ( echo 'Number of paragraphs:'. $ mydoc-> getNbOfParagraphs (). '<br />'; echo 'Number of characters:'. $ mydoc-> getNbOfCharacters (). '<br />'; echo 'Number of characters (with spaces):'. $ mydoc-> getNbOfCharactersWithSpaces (). '<br/>'; echo 'Number of pages:'. $ mydoc-> getNbOfPages (). '<br/>'; echo 'Number of words:'. $ mydoc-> getNbOfWords (). '<br/>'; ) echo '<br/> <i> Overview document: </ i> <br/>'; echo $ mydoc-> getHTMLPreview (); ) catch (OpenXMLFatalException $ e) ( echo $ e-> getMessage (); ) echo '<br/>'; ) ?> Conclusion In this tutorial, we described a framework for PHP reading files Office 2007. This framework, although quite functional, is quite rudimentary, it is up to you to extend and enrich your ideas.
|