Extracting Text From Corrupted DOCX Without Installing Additional Program

Ported from Old Blog

Note: Only work on Mac and *nix.

This afternoon I received a call from my best friend. Her 100-page thesis file is corrupted. She has spent the last 2 months in Nepal and North-east India doing research. The backup copy of that file is in an SD card which is currently malfunctioning. She doesn’t have access to the internet. Panic ensued.

This makes for a very challenging problem for the following reason:

  1. Lack of access to the internet - She can’t send me the file to fix or download file recovery applications.
  2. Mac OS - No “Open and recover” option in Microsoft Words for Mac (at least according to her). No pre-installed GCC so writing a C program is out of the question. I’m far from proficient at unix command besides the basics like “cd”, “rm”, and “cat”.

I vaguely remember reading an article saying that Docx file is in xml format. That turned out to be partially true. Docx is more like a zip file containing many xml files. Combine that fact with googling unix commands and fiddling with regular expression, I came up with the following steps that I sent to her over sms:

Step 1

Make a back-up of your corrupted docx file.

Step 2

Change the file extension to .zip. Eg. from my_paper.docx to my_paper.zip

Step 3

Extract the zip file into another folder. Now go inside that folder. You should see a word folder. Go inside word and check that there’s a file called document.xml

Step 4

Fire up your Terminal. Navigate to the folder containing document.xml. Use cd, cd .., and ls for this step. (See? I know some unix commands too.)

Step 5

Copy and paste these 2 lines

sed -r 's/<(\/?)w:p(\s*)(\w*)>/&\n/g' <document.xml >temp.txt
sed -r 's/<(\/?)(\w+)[^>]*(\/?)>//g' <temp.txt >recover.txt

The extracted text will be in recover.txt.

The first line replaces <w:p blahblah> and </w:p> in document.xml with a new line character. <w:p> denotes a paragraph in docx. The second line removes all other xml tags in the file, leaving just the text.

There you have it:) Fully recovered text without resorting to installing any other program.