Dowst.Dev

Extracting images from Word

The process of extracting images with a Word document is relatively straightforward. All you have to do is rename the document from a .docx to a .zip and extract it. Once you do that, all the images will be in a subfolder named media.

However, with the help of PowerShell, we can not only automate the extraction but also copy them to a new location and list the caption information for each image.

The first thing you need to do is rename the Word document with the .zip extension. To ensure the original Word document remains untouched, we’ll copy it to a temporary folder and rename it.

Once you have the zip file, you can run a simple Expand-Archive command to extract the contents of the Word document. You will find the images in the subfolder word\media.

Then you can have PowerShell copy the files to another directory. And if that is all you wanted to do, you are done.

However, we can take things a step further and parse the Word document to display the captions for each image.

To do this, you will need to load the document.xml file into a PowerShell object. This XML contains all the configuration and references for the Word document. You can then parse through each paragraph to find the ones that are images and the ones that are captions. Images will have a drawing section under the paragraph, and captions with have a fldSimple property.

A child node named keepNext lets you determine if a caption is above or below the picture. When the caption is below, the image will have the keepNext node, but when the caption is above, the caption paragraph will have the keepNext node. If there is no caption, neither will have the node.

You can see this in the output below. Figures 1 and 3 have the captions below. Figure 2 has the caption above, and figure 4 does not have a caption.

Now all you need to do is parse through each image, match it with its appropriate caption, and output the results.

You can find the full code below. Also, since it parses the XML and not Word itself, this function does not require Word to be installed.

Function Export-ImagesFromWord {
    <#
.SYNOPSIS
Extracts images from a Word document and copies them to a new location

.DESCRIPTION
Extracts images from a Word document and copies them to a new location. 
After the extraction the caption informatino will be outputed to the screen

.PARAMETER DocumentPath
The path of the Word Document

.PARAMETER Destination
The folder to copy the file into

.EXAMPLE
Export-ImagesFromWord -DocumentPath "D:\scripts\ImageExamples.docx" -Destination "D:\scripts\images"

.NOTES
Does not require Word to be installed
#>
    [CmdletBinding()]
    [OutputType()]
    param(
        [Parameter(Mandatory = $true)]
        [string]$DocumentPath,
        [Parameter(Mandatory = $true)]
        [string]$Destination
    )

    # Create a temporary folder to hold the extracted files
    $BaseName = [System.IO.Path]::GetFileNameWithoutExtension($documentPath) 
    $extractPath = Join-Path $env:Temp "mediaExtract\$($BaseName)"
    If (Test-Path $extractPath) {
        Remove-Item -Path $extractPath -Force -Recurse | Out-Null
    }
    New-Item -type directory -Path $extractPath | Out-Null

    # Copy the Word document as a zip and expand it
    $zipPath = Join-Path $extractPath "$($BaseName).zip"
    $zip = Copy-Item $documentPath $zipPath -Force -PassThru
    Expand-Archive -Path $zip.FullName -DestinationPath $extractPath -Force

    # Get the media files extracted and copy them to the output folder
    $mediaPath = Join-Path $extractPath 'word\media'
    If (-not(Test-Path $Destination)) {
        New-Item -type directory -Path $Destination | Out-Null
    }
    $extractedfigures = Get-ChildItem $mediaPath -File | Copy-Item -Destination $Destination -PassThru | Select-Object Name, @{l = 'Figure'; e = { $null } }, 
        @{l = 'Caption'; e = { '' } }, @{l = 'Id'; e = { [int]$($_.BaseName.Replace('image', '')) } }, FullName

    # Get the document configuration
    $documentXmlPath = Join-Path $extractPath 'word\document.xml'
    [xml]$docXml = Get-Content $documentXmlPath -Raw

    # Get all the paragraphs to find the images and captions
    $paragraphs = $docXml.document.body.p | Select-Object @{l = 'keepNext'; e = { @($_.pPr.ChildNodes.LocalName).Contains('keepNext') } }, 
        @{l = 'Id'; e = { $_.r.drawing.inline.docPr.id } }, @{l = 'CaptionId'; e = { $_.fldSimple.r.t } }, @{l = 'Prefix'; e = { $_.r[0].t.'#text' } }, 
        @{l = 'Text'; e = { $_.r[-1].t.'#text' } }, @{l = 'instr'; e = { $_.fldSimple.instr } }

    # Parse through each paragraph to match the caption to the image
    for ($i = 0; $i -lt $paragraphs.Count; $i++) {
        $capId = -1
        if ($paragraphs[$i].Id -gt 0 -and $paragraphs[$i].keepNext -eq $true) {
            $capId = $i + 1
        }
        elseif ($paragraphs[$i].Id -gt 0 -and $paragraphs[$i - 1].keepNext -eq $true) {
            $capId = $i - 1
        }

        if ($capId -gt -1) {
            $extractedfigures | Where-Object { $_.Id -eq $paragraphs[$i].Id } | ForEach-Object {
                $_.Figure = $paragraphs[$capId].CaptionId
                $_.Caption = "$($paragraphs[$capId].Prefix)$($paragraphs[$capId].CaptionId)$($paragraphs[$capId].Text)"
            }
        }
    }

    $extractedfigures | Select-Object Name, Figure, Caption, FullName
}

The this post of part of the series Automation Authoring. Refer the main article for more details on use cases and additional content in the series.

images | word

My personal collection of all things PowerShell and automation

Categories

Tags

Follow Me