Sunday, February 20, 2011

Working with Office 2007 (OOXML) files using PL/SQL

Starting with Office 2007, Microsoft switched to an XML-based format called Office Open XML (OOXML).

There has been some debate as to how "open" this format really is, given that the specs run to around 7,000 pages (!).

Be that as it may, it's a fact of life that a lot of people use Microsoft's Office suite, and that means we have to deal with this new format in a lot of situations.

The OOXML format is, as it turns out, not so difficult to deal with. The main concept is that an Office document, whether it is a Word document (.docx), Excel spreadsheet (.xlsx) or Powerpoint presentation (.pptx), is actually a compressed (.zip) file that contains a number of XML documents (as well as any image files the user has included in the document).

So to work with OOXML files, we need to be able to zip and unzip files, and to parse and generate XML. Oracle (and PL/SQL) has had good support for XML for a number of years, but (even though there is a UTL_COMPRESS package in the database) there is no built-in zip/unzip support. Of course you could load some Java classes into the database to do it, but dealing with the Java stuff is always a bit of a hassle. But some time ago the good gentleman Anton Scheffer published a PL/SQL implementation based on UTL_COMPRESS that supports zipping and unzipping.

Based on this I have written a package for working with OOXML documents. It's called OOXML_UTIL_PKG and you can download it as part of (you guessed it) the Alexandria utility library for PL/SQL.

Let's see what this package allows us to do.


Get document properties from a Word (docx) file

First we fire up Word and create a test document:


By the way, you can read and write the new OOXML formats using an older version of Office (as I do in the screenshot above), by downloading the Microsoft Office Compatibility Pack from Microsoft.

After saving the document, we can then extract the document properties using the GET_DOCX_PROPERTIES function, which returns a custom record type called T_DOCX_PROPERTIES.



Extract plain text from a Word (docx) file

Using our test document again, we can extract the plain text of the document using the GET_DOCX_PLAINTEXT function, which returns a CLOB.


This is of course very useful if you want to search and/or index (just) the text of a document, or otherwise work with the content.

Get document properties from an Excel (xlsx) file

Let's first create an Excel test file (again, using Excel 2003 but saving in Excel 2007 format):



(This has to be one of the lamest spreadsheets of all time, but it will do fine as an example. It has some text, some numbers, and a formula.)

Similar to the Word document, we can now use the GET_XLSX_PROPERTIES function, which returns a custom record type called T_XLSX_PROPERTIES.


You'll notice that Word documents and Excel spreadsheets have slightly different properties.

Extract a cell value from an Excel (xlsx) file

The GET_XLSX_CELL_VALUE function allows us to retrieve a single value from a named cell in a specific worksheet, like this:


Technical Detail: In the XLSX format, strings (as opposed to numbers) are not stored in the actual cell where they are entered, but rather in a "shared strings" section. The cell just contains a reference back to this "shared strings" section. The GET_XLSX_CELL_VALUE function handles this for you, so you don't have to worry about that.


Extract multiple cell values from an Excel (xlsx) file

Since the function that extracts a single value from a spreadsheet must open the file, unzip it, and parse the XML content every time you call that function, there is another function (GET_XLSX_CELL_VALUES, notice the plural) that allows you to retrieve multiple values in one call. In other words, the file is unzipped and the contents parsed as XML just once, which is obviously more efficient.

Simply specify the names of multiple cells using an array of strings:





Write contents into OOXML file using PL/SQL

Since the contents of OOXML files are XML files, you can manipulate the existing content, or generate new content, and then save it back to the zip file that contains your document.

The following demonstrates one approach; it uses a Powerpoint file, but this technique will also work with Word and Excel files.

We create a Powerpoint 2007 file (.pptx) and put in some tags that we want to replace via code. In other words, this becomes a template that we can fill with dynamic values from the database.



The GET_FILE_FROM_TEMPLATE function takes a template file as input, and two string arrays: The tag names and actual values to replace the tags with. It unzips the file, performs the substitutions, writes back the file to the zip archive, and returns the file, which you can then save back to disk (or, more likely, store in the database or send to a web browser).

The code is trivial:



Here is the result when opening the output file:



So the next time you do a presentation, you could actually update your Powerpoint slides with the latest sales figures (or whatever) from within SQL*Plus...


Conclusion

Working with Office 2007 (OOXML) files from PL/SQL is easy and opens up many possibilities, both for extracting information from documents and storing them in the database, as well as generating or modifying OOXML files in the database server.

20 comments:

mnolan said...

Hi Morten

That's pretty damn cool, I've had quite a few requests in the past to generate XLSX and DOCX files and I've always put it in the too hard basket.

Not anymore... thanks again for sharing.

P.S. I have no idea where you find the time to do all this majic.

Cheers
Matt

Martin Giffy D'Souza said...

Hi Morten,

Thanks a lot for posting this! I was looking for something like this last week.

Martin

Patrick Wolf said...

Hi Morten,

looks really interesting! Well done!

Patrick

blogje said...

Have you seen this blog of mine?
Using my unzip package you can query the plain text of a OOXML Word document.
Anton

Morten Braten said...

@Anton: I have added a link to your blog post under the OOXML section of the Alexandria library page.

- Morten

dipakchandnani said...

Hi Morten,

You have a great idea.
Before that i did not found functionality to access Microsoft Documents from PLSQL. So it's a magic for me also.
I have not OOXML_UTIL_PKG package in my database.
Can I have a script of package to run in my database.

Morten Braten said...

@dipakchandnani: The link to download the OOXML_UTIL_PKG is provided in the blog post.

- Morten

Anonymous said...

his is excellent. I have managed to take a word doc as a template, substitute text values and save the result.
But I'm struggling with embedding line feeds into the substitution strings, for a letter address for example where the whole address is one item (to avoid blank lines)

Tried the chr(10)||chr(13) thing but just puts everything on one line.
Hope someone can help.

Regards
Mark

Anonymous said...

Hi,

Having a problem if I try to replace text in a docx with something containing the GBP £ character.
The resulting file will not open in Word saying that the £ is an illegal XML character.

Any suggestions?

Jiten Patel said...

Hi Morten,

Your blog is awesome, i learn few things which i never heard about. Actually, I want to read data from table and create ppt according to it. Can you please help me on this.

Thanks,
Jiten

Morten Braten said...

@Jiten: Creating a Powerpoint file from scratch would be a lot of work, therefore it is easier to start with an existing file, and add contents to that, via placeholders as I have demonstrated.

- Morten

David said...

Hi Morten,

that's a pretty solution. Thanks a lot!
But also I have issues by inserting special characters (like €, ä, ü, ö, ß) to documents. Can you let us know how to handle it?

Thanks a lot!
David

Anonymous said...

Hi Morten,

I have almost same problem like one of the guys above that has a problem if he tries to replace text in a docx with something containing the GBP £ character. I have the same problem when I use cyrillic characters.
The resulting file will not open in Word saying that the £ is an illegal XML character.

Any suggestions?
Thanks in advance.

Stojan Gacov Bulgaria

Morten Braten said...

@Stojan: Try encoding the value using the following function:

dbms_xmlgen.convert (l_your_value, dbms_xmlgen.entity_encode)

http://docs.oracle.com/cd/B19306_01/appdev.102/b14258/d_xmlgen.htm#i1013100


- Morten

Anonymous said...

Hi Morten, Stojan,

I had exactly the same issue and tryed to print special characters by using dbms_xmlgen.convert. It again failed and word is giving illegal XML character error.

So I fixed it by creating following function:
-- function converts string to hex what can be used
-- in ooxml document
-- created by: David Michel
-- created on: 19.02.2013
create or replace
function func_string_to_xmlhex(
p_string IN NVARCHAR2
)
RETURN CLOB
IS
v_hex_string CLOB;
v_return CLOB;
BEGIN
FOR i in 1..length(p_string) LOOP
SELECT case length(RAWTONHEX(substr(p_string, i, 1)))
when 2 then '&#x00'||RAWTONHEX(substr(p_string, i, 1))||';'
when 3 then '&#x0'||RAWTONHEX(substr(p_string, i, 1))||';'
else '&#x'||RAWTONHEX(substr(p_string, i, 1))||';'
end
INTO v_return
FROM dual;

v_hex_string := v_hex_string||v_return;
END LOOP;
v_return := v_hex_string;

RETURN v_return;
END;

Regards,
David

Anonymous said...

Morten - this is a great tool. When you use the template feature with excel is there a way to write a value rather than text?

Anonymous said...

Hi Morton,

Really great solution, but I'm not sure how to get a string (Adress) across multiple lines. CHR(13)||CHR(10) doesn't seem to work as somebody else also noted. Any idea how to solve this?

Regards.

Anonymous said...

I have the same problem... I don't know if some one has the solution??

I can put Chr(13)||Chr(10) :(

Anonymous said...

THIS IS THE ANSWERE http://msdn.microsoft.com/en-us/library/office/ee922775(v=office.14).aspx

xD

Anonymous said...

Great little utility...

But, I've got some issues writing to substitution strings that exist in a table in the document. Only the first column ever gets written, the other columns are ignored.

Also, it seems leading and trailing spaces get removed before substitution. Is there a call to trim somewhere in the code?

Nick