Extracting OLE Objects from Word Documents
12/06/08 18:12 Filed in: Linux
Many people have asked me how to extract the file embedded inside an OLE object that has been inserted into a Microsoft Word document, or similar.
I reverse-engineered the file format, it’s very simple. Not this code doesn’t always appear to work, but it gets 95% of them out.
Use it at your own peril. Please credit me (Julian Field jules@jules.fm) where/when/if you use this code or any derivative of it, including translations into other languages.
$byte = "";
$buffer = "";
#$infh = new FileHandle;
#sysopen $infh, "$explodeinto/$inname", O_RDONLY;
Open the infh filehandle with the "inname" file containing the OLE object.
sysseek $infh, 6, SEEK_SET; # Skip 1st 6 bytes
Skip the first 6 bytes, these appear to be useless
$outname = "";
$finished = 0;
$length = 0;
until ($byte eq "\0" || $finished || $length>1000) {
# Read a C-string into $outname
sysread($infh, $byte, 1) or $finished = 1;
$outname .= $byte;
$length++;
}
Read a null-terminated string of bytes,
this becomes the output filename.
next OLEFILE if $length>1000; # Bail out if it went wrong
If the filename was way too long, this is probably corrupt.
$finished = 0;
$byte = 1;
$length = 0;
until ($byte eq "\0" || $finished || $length>1000) { # Throw away a C-string
sysread($infh, $byte, 1) or $finished = 1;
$length++;
}
Throw away the next null-terminated string of bytes.
next OLEFILE if $length>1000; # Bail out if it went wrong
If the string was way too long, this is probably corrupt.
sysseek $infh, 4, Fcntl::SEEK_CUR or next OLEFILE; # Skip next 4 bytes
Skip the next 4 bytes of the file.
sysread $infh, $number, 4 or next OLEFILE;
$number = unpack 'V', $number;
Read the next 4 bytes into a 4-byte int called "$number".
#print STDERR "Skipping $number bytes of header filename\n";
if ($number>0 && $number<1_000_000) {
sysseek $infh, $number, Fcntl::SEEK_CUR; # Skip the next bit of header (C-string)
} else {
next OLEFILE;
}
If the number $number was a reasonable size,
skip that many bytes of the file.
sysread $infh, $number, 4 or next OLEFILE;
$number = unpack 'V', $number;
Read the next 4 bytes in a 4-byte int called "$number".
This is the length of the real embedded file we want to extract.
#print STDERR "Reading $number bytes of file data\n";
sysread $infh, $buffer, $number
if $number>0 && $number < $size; # Sanity check
Read the $number number of bytes into memory into a chunk
of memory allocated which is at least $number bytes long.
Do a sanity check that the number of bytes we have asked it to read
is less than the total length of the input file.
$outfh = new FileHandle;
$outsafe = $this->MakeNameSafe($outname, $explodeinto);
sysopen $outfh, "$explodeinto/$outsafe", (O_CREAT | O_WRONLY)
or next OLEFILE;
Create an output file with a filename which is a sanitised safe
version of the filename we read at the top of this bit of code.
if ($number>0 && $number<1_000_000_000) { # Number must be reasonable!
syswrite $outfh, $buffer, $number or next OLEFILE;
}
close $outfh;
If the output file is less than 1Gbyte long, write out the data we just read.
This creates the file containing the embedded file we wanted to extract.
Then close that output file.
I reverse-engineered the file format, it’s very simple. Not this code doesn’t always appear to work, but it gets 95% of them out.
Use it at your own peril. Please credit me (Julian Field jules@jules.fm) where/when/if you use this code or any derivative of it, including translations into other languages.
$byte = "";
$buffer = "";
#$infh = new FileHandle;
#sysopen $infh, "$explodeinto/$inname", O_RDONLY;
Open the infh filehandle with the "inname" file containing the OLE object.
sysseek $infh, 6, SEEK_SET; # Skip 1st 6 bytes
Skip the first 6 bytes, these appear to be useless
$outname = "";
$finished = 0;
$length = 0;
until ($byte eq "\0" || $finished || $length>1000) {
# Read a C-string into $outname
sysread($infh, $byte, 1) or $finished = 1;
$outname .= $byte;
$length++;
}
Read a null-terminated string of bytes,
this becomes the output filename.
next OLEFILE if $length>1000; # Bail out if it went wrong
If the filename was way too long, this is probably corrupt.
$finished = 0;
$byte = 1;
$length = 0;
until ($byte eq "\0" || $finished || $length>1000) { # Throw away a C-string
sysread($infh, $byte, 1) or $finished = 1;
$length++;
}
Throw away the next null-terminated string of bytes.
next OLEFILE if $length>1000; # Bail out if it went wrong
If the string was way too long, this is probably corrupt.
sysseek $infh, 4, Fcntl::SEEK_CUR or next OLEFILE; # Skip next 4 bytes
Skip the next 4 bytes of the file.
sysread $infh, $number, 4 or next OLEFILE;
$number = unpack 'V', $number;
Read the next 4 bytes into a 4-byte int called "$number".
#print STDERR "Skipping $number bytes of header filename\n";
if ($number>0 && $number<1_000_000) {
sysseek $infh, $number, Fcntl::SEEK_CUR; # Skip the next bit of header (C-string)
} else {
next OLEFILE;
}
If the number $number was a reasonable size,
skip that many bytes of the file.
sysread $infh, $number, 4 or next OLEFILE;
$number = unpack 'V', $number;
Read the next 4 bytes in a 4-byte int called "$number".
This is the length of the real embedded file we want to extract.
#print STDERR "Reading $number bytes of file data\n";
sysread $infh, $buffer, $number
if $number>0 && $number < $size; # Sanity check
Read the $number number of bytes into memory into a chunk
of memory allocated which is at least $number bytes long.
Do a sanity check that the number of bytes we have asked it to read
is less than the total length of the input file.
$outfh = new FileHandle;
$outsafe = $this->MakeNameSafe($outname, $explodeinto);
sysopen $outfh, "$explodeinto/$outsafe", (O_CREAT | O_WRONLY)
or next OLEFILE;
Create an output file with a filename which is a sanitised safe
version of the filename we read at the top of this bit of code.
if ($number>0 && $number<1_000_000_000) { # Number must be reasonable!
syswrite $outfh, $buffer, $number or next OLEFILE;
}
close $outfh;
If the output file is less than 1Gbyte long, write out the data we just read.
This creates the file containing the embedded file we wanted to extract.
Then close that output file.
Comments