home icon contact icon rss icon

Reading docx files with ruby

Just a quick post about handling docx (and other *x formats) with ruby.

These files are basically XML packed with zip, so reading them is as just extracting the zip and parsing xml files. Here is a snippet with getting pages count from docx file:

Reading the text is also easy – it’s located in word/document.xml.

I must say that this is a step at the right direction. Reading older microsoft formats without windows API is hard and painful… Although there is a possibility to run MS Office with wine and access it using Win32::Ole and Perl (I’ve tried it once, don’t ask…), there are license problems as far as I know.

Leave a Comment