Extracting Text From Corrupted DOCX Without Installing Additional Program# 1 December 2012 Ported from Old Blog
Note: Only work on Mac and *nix.
This afternoon I received a call from my best friend. Her 100-page thesis file is corrupted. She has spent the last 2 months in Nepal and North-east India doing research. The backup copy of that file is in an SD card which is currently malfunctioning. She doesn’t have access to the internet. Panic ensued.
This makes for a very challenging problem for the following reason:
- Lack of access to the internet - She can’t send me the file to fix or download file recovery applications.
- Mac OS - No “Open and recover” option in Microsoft Words for Mac (at least according to her). No pre-installed GCC so writing a C program is out of the question. I’m far from proficient at unix command besides the basics like “cd”, “rm”, and “cat”.
I vaguely remember reading an article saying that Docx file is in xml format. That turned out to be partially true. Docx is more like a zip file containing many xml files. Combine that fact with googling unix commands and fiddling with regular expression, I came up with the following steps that I sent to her over sms:
Make a back-up of your corrupted docx file.
Change the file extension to
.zip. Eg. from
Extract the zip file into another folder. Now go inside that folder. You should see a
word folder. Go inside
word and check that there’s a file called
Fire up your Terminal. Navigate to the folder containing
cd .., and
ls for this step. (See? I know some unix commands too.)
Copy and paste these 2 lines
sed -r 's/<(\/?)w:p(\s*)(\w*)>/&\n/g' <document.xml >temp.txt sed -r 's/<(\/?)(\w+)[^>]*(\/?)>//g' <temp.txt >recover.txt
The extracted text will be in
The first line replaces
<w:p blahblah> and
</w:p> in document.xml with a new line character.
<w:p> denotes a paragraph in docx. The second line removes all other xml tags in the file, leaving just the text.
There you have it:) Fully recovered text without resorting to installing any other program.