|
REXML: Parser XML documents in Ruby
|
|
|
|
|
Thursday, 15 May 2008
|
I. Start with the course of tree bibliography.xml We'll start with the API route tree, which resembles the DOM but more intuitive. Here is one example of code: code1.rb - show an XML file require 'rexml / document' include REXML file = File.new ( "bibliography.xml") doc = Document.new (file) Doc puts The charge require the library REXML. We include environmental REXML then we no longer have to use names like "REXML: Document" all the time. Then we open the existing file "bibliography.xml" and we travel to the store in a Document object. Finally, we are posting the document on the screen. When you run the command "ruby code1.rb", the content of our XML document is displayed. It is possible that you get this error message: example1.rb: 1: in `require ': No such file to load -- Rexml / document (LoadError) from example1.rb: 1 In this case, because REXML has not been installed with Ruby, what happens with some managers like Debian APT packages that install separate packages. Install the package missing, then try again. The method takes into Document.new parameter objects of type IO, document or String. The argument specifies the source from which we want to read the XML document. In the first instance, we used an object IO, File precisely an object that inherits the class IO. Another descendant of the IO class is the class Socket, which can be used with Document.new to get an XML file via a network connection. If the manufacturer document takes into setting a Document object, it will be fully cloned in the New Document. If the manufacturer takes a String parameter an object, the chain expected to contain an XML. A small example: code2.rb - Viewing an XML content in a chain require 'rexml / document' include REXML string = <<EOF <? xml version = "1.0" encoding = "ISO-8859-15"?> <! DOCTYPE bibliography PUBLIC "- / / OASIS / / DTD DocBook XML V4.2 / / EN" "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"> <bibliography> <biblioentry id="FHIW13C-1234"> <author> <firstname> Godfrey </ firstname> <surname> Vesey </ surname> </ author> <title> Personal Identity: A Philosophical Analysis </ title> <publisher> <publishername> Cornell University Press </ publishername> </ publisher> <pubdate> 1977 </ pubdate> </ biblioentry> </ bibliography> EOF Document.new doc = (string) Doc puts We used a type document String: All the characters from "EOF and EOF, new lines included, are part of the chain. II. Go to the elements and attributes From now on, we'll use irb, interactive debugger Ruby, for examples of the use of the library REXML. At the prompt of irb, we will upload the file bibliography.xml in a document. After that, we can execute commands to access elements and the attributes of our paper in an interactive manner. koan $ IRB irb (main): 001:0> require 'rexml / document' => True irb (main): 002:0> include REXML => Object irb (main): 003:0> doc = Document.new (File.new ( "bibliography.xml")) => <UNDEFINED> ... </> Now, we can explore our document very easily. Let's take a look at a typical session irb with our XML file: irb (main): 004:0> root = doc.root => <bibliography Id='personal_identity'> ... </> irb (main): 005:0> root.attributes [ 'id'] => "Personal identity" irb (main): 006:0> puts root.elements [1]. elements [ "author"] <author> <firstname> Godfrey </ firstname> <surname> Vesey </ surname> </ author> irb (main): 007:0> puts root.elements [ "biblioentry [1] / author"] <author> <firstname> Godfrey </ firstname> <surname> Vesey </ surname> </ author> irb (main): 008:0> puts root.elements [ "biblioentry [@ id = 'FHIW13C-1260']] <biblioentry id='FHIW13C-1260'> <author> <firstname> Sydney </ firstname> <surname> Shoemaker </ surname> </ author> <author> <firstname> Richard </ firstname> <surname> Swinburne </ surname> </ author> <title> Personal Identity </ title> <publisher> <publishername> Basil Blackwell </ publishername> </ publisher> <pubdate> 1984 </ pubdate> </ biblioentry> => Nil irb (main): 009:0> root.each_element ( '/ / author') (| author | author puts) <author> <firstname> Godfrey </ firstname> <surname> Vesey </ surname> </ author> <author> <firstname> René </ firstname> <surname> Marres </ surname> </ author> <author> <firstname> James </ firstname> <surname> Baillie </ surname> </ author> <author> <firstname> Brian </ firstname> <surname> Garrett </ surname> </ author> <author> <firstname> John </ firstname> <surname> Perry </ surname> </ author> <author> <firstname> Geoffrey </ firstname> <surname> Madell </ surname> </ author> <author> <firstname> Sydney </ firstname> <surname> Shoemaker </ surname> </ author> <author> <firstname> Richard </ firstname> <surname> Swinburne </ surname> </ author> <author> <firstname> Jonathan </ firstname> <surname> Glover </ surname> </ author> <author> <firstname> Harold </ firstname> <othername> W. </ othername> <surname> Noonan </ surname> </ author> => [<author> ... </> <author> ... </> <author> ... </> <author> ... </> <author> ... </> <author> ... </> <author> ... </> <author> ... </> <author> ... </> <author> ... </>] First, we use the name "root" to reach the root of our document. Here, the document root is the bibliography. Each object has a Element object Attribute called "attributes" which acts as an associative array with the names of attributes as a key, and value attributes as a value. With root.attributes [ 'id'] we have therefore the value of the id attribute of the root element. In the same way, every object Element contains an object Element called "elements" and we can reach the sub-elements using the methods each and []. The method [] takes as an argument index or Xpath, and return the child element which is the expression. The Xpath functions as a filter, which will decide what elements should be returned. Please note that root.elements [1] is the first child element, because the index Xpath start at 1, not 0. In fact, root.elements [1] is equivalent to root.elements [* [1]], where * [1] Xpath is the first child. The method of each class Element travels all the elements children, possibly filtering following a Xpath. The code block will be executed every iteration. In addition, the method Element.each_element is a shortcut for Element.elements.each.
III. Creation and insertion of elements and attributes We will now create a small bibliography, consisting of a single entry. Here's how it presents itself: irb (main): 010:0> doc2 = Document.new => <UNDEFINED/> irb (main): 011:0> doc2.add_element ( "bibliography" (id => "philosophy")) => <bibliography Id='philosophy'/> irb (main): 012:0> doc2.root.add_element ( "biblioentry") => <biblioentry/> irb (main): 013:0> biblioentry = doc2.root.elements [1] => <biblioentry/> irb (main): 014:0> author = Element.new ( "author") => <author/> irb (main): 015:0> author.add_element ( "firstname") => <firstname/> irb (main): 016:0> author.elements [ "firstname"]. text = "Bertrand" => "Bertrand" irb (main): 017:0> author.add_element ( "surname") => <surname/> irb (main): 018:0> author.elements [ "surname"]. text = "Russell" => "Russell" irb (main): 019:0> biblioentry.elements "author => <author> ... </> irb (main): 020:0> title = Element.new ( "title") => <title/> irb (main): 021:0> title.text = "The Problems of Philosophy" => "The Problems of Philosophy" irb (main): 022:0> biblioentry.elements <<title => <title> ... </> irb (main): 023:0> biblioentry.elements <<Element.new ( "pubdate") => <pubdate/> irb (main): 024:0> biblioentry.elements [ "pubdate"]. text = "1912" => "1912" irb (main): 025:0> biblioentry.add_attribute ( "id", "ISBN0-19-285423-2") => "ISBN0-19-285423-2" irb (main): 026:0> puts doc2 <bibliography id='philosophy'> <biblioentry id='ISBN0-19-285423-2'> <author> <firstname> Bertrand </ firstname> <surname> Russell </ surname> </ author> <title> The Problems of Philosophy </ title> <pubdate> 1912 </ pubdate> </ biblioentry> </ bibliography> => Nil As you can see, we create a new blank document in which we add an element. This element is the root element (root). The method add_element takes the name of the element in argument and an optional argument which is the name / value pair of associative array of attributes. This method adds a new son or the document to the element, it may optionally also define the attributes of an element. You can also create a new element, as we did with the "author", and add it after any element: if the method takes an object add_element Element, it will be added to the parent element. In place of the method add_element, you can also use the method "on Element.elements. Both methods return the element added. In addition, the method add_attribute, you can add an attribute to an existing one. The first parameter is the name of the attribute, the second is its value. The method returns the attribute that has been added. The value of the text of an element can be easily changed with Element.text or with the method add_text. If you want to insert an element to a specific position, you can use the methods insert_before and insert_after: irb (main): 027:0> publisher = Element.new ( "publisher") => <publisher/> irb (main): 028:0> publishername = Element.new ( "publishername") => <publishername/> irb (main): 029:0> publishername.add_text (Oxford University Press) => <publishername> ... </> irb (main): 030:0> publisher <<publishername => <publishername> ... </> irb (main): 031:0> doc2.root.insert_before ( "/ / pubdate", publisher) => <bibliography Id='philosophy'> ... </> irb (main): 032:0> puts doc2 <bibliography id='philosophy'> <biblioentry id='ISBN0-19-285423-2'> <author> <firstname> Bertrand </ firstname> <surname> Russell </ surname> </ author> <title> The Problems of Philosophy </ title> <publisher> <publishername> Oxford University Press </ publishername> </ publisher> <pubdate> 1912 </ pubdate> </ biblioentry> </ bibliography> => Nil IV. Deleting elements and attributes The methods add_element and add_attribute have their respective equivalent to destroy elements and attributes. Here's how it works with the attributes: irb (main): 033:0> doc2.root.delete_attribute ( 'id') => <bibliography> ... </> irb (main): 034:0> puts doc2 <bibliography> <biblioentry id='ISBN0-19-285423-2'> <author> <firstname> Bertrand </ firstname> <surname> Russell </ surname> </ author> <title> The Problems of Philosophy </ title> <publisher> <publishername> Oxford University Press </ publishername> </ publisher> <pubdate> 1912 </ pubdate> </ biblioentry> </ bibliography> => Nil The method returns delete_attribute attribute destroyed. The method delete_element may take an object Element, a string or an index as argument: irb (main): 034:0> doc2.delete_element ( "/ / publisher") => <publisher> ... </> irb (main): 035:0> puts doc2 <bibliography> <biblioentry id='ISBN0-19-285423-2'> <author> <firstname> Bertrand </ firstname> <surname> Russell </ surname> </ author> <title> The Problems of Philosophy </ title> <pubdate> 1912 </ pubdate> </ biblioentry> </ bibliography> => Nil irb (main): 036:0> doc2.root.delete_element (1) => <biblioentry Id='ISBN0-19-285423-2'> ... </> irb (main): 037:0> puts doc2 <bibliography/> => Nil The first call was delete_element in our example uses an XPath expression to locate the item was destroyed. The second time, we use the index 1, which means that the first element in the document root (root) will be destroyed. The method returns delete_element element destroyed. V. Node text and processing entities We have already used the text nodes in the previous examples. In this section we will see some advanced features with these knots text. Specifically, How REXML takes into account the entities? REXML is not a parser validator, and therefore it is not necessary to assign external entities. External entities are not replaced by their value, but the entities are: When REXML runs an XML document, it treats the DTD and creates a table with the entities and their value. When one of these entities is encountered in the document, REXML replaces it with its value. An example: irb (main): 038:0> doc3 = Document.new ( '<! DOCTYPE testentity [ irb (main): 039:1 '<! ENTITY entity "test">]> irb (main): 040:1 '<testentity> &entity; the entity </ testentity>') => <UNDEFINED> ... </> irb (main): 041:0> puts doc3 <! DOCTYPE testentity [ <! ENTITY entity "test">]> <testentity> &entity; the entity </ testentity> => Nil irb (main): 042:0> doc3.root.text => "Test the entity" You can see that the XML document at its printing, contains the correct entity. When you access the text, the entity "&entity;" is correctly converted to "test". However, REXML does not use a very thorough evaluation of the entities. As a result, we see this problem occur: irb (main): 043:0> doc3.root.text = "test the &entity;" => "Test the &entity;" irb (main): 044:0> puts doc3 <! DOCTYPE testentity [ <! ENTITY entity "test"> ]> <testentity> &entity; the &entity; </ testentity> => Nil irb (main): 045:0> doc3.root.text => "Test the test" As you can see, the text "test the &entity;" has been modified to "&entity; the &entity;". If you change the value of the entity, it will return a result different from your expectations: the more things will change in your document that you do want. If this is a problem for your application, you can apply the flag: raw on any node text or Elements, and even on the Document node. The entities in this node will not be processed and in this case, you have to treat yourself An example: irb (main): 046:0> doc3 = Document.new ( '<! DOCTYPE testentity [ irb (main): 047:1 '<! ENTITY entity "test">]> irb (main): 048:1 '<testentity> test the &entity; </ testentity>', (: raw =>: all)) => <UNDEFINED> ... </> irb (main): 049:0> puts doc3 <! DOCTYPE testentity [ <! ENTITY entity "test"> ]> <testentity> test the &entity; </ testentity> => Nil irb (main): 050:0> doc3.root.text => "Test the test" Special characters like "&", "<", ">" "" "(quotes), and 'are automatically converted. By the way, if you write one of these characters in a node in a text or attribute, REXML convert the equivalent in its entity. Ex: "&" "&". VI. Course évenementiel The course évenementiel is faster than the route tree. If speed is a criterion, the course évenementiel may be useful. However, as XPath options are not valid. You must have a class on Auditing ( "listener") and whenever REXML meet an event (starting tag, end tag, text, etc..), "Listener" will receive a notification of the event. An example program: code3.rb - Route évenementiel in action require 'rexml / document' require 'rexml / streamlistener' include REXML Listener class include StreamListener def tag_start (name, attributes) puts "Start # (name)" end def tag_end (name) puts "End # (name)" end end listener = Listener.new parser = Parsers: StreamParser.new (File.new ( "bibliography2.xml"), listener) parser.parse bibliography2.xml Run code3.rb gives this release: koan $ Ruby code3.rb Start bibliography Start biblioentry Start author Start firstname End firstname Start surname End surname End author Start Title End Title Start publisher Start publishername End publishername End publisher Start pubdate End pubdate End biblioentry End bibliography VII. Conclusion Ruby and XML are a good team. The processor REXML XML allows you to create, access and change your XML documents at once and quite intuitive. With the help of interactive debugger irb Ruby, you can also read your XML documents very easily.
|
|
|