One of the many apps I find myself writing (repeatedly) are tools for generating MS Word documents using an existing document as a template.
The most interesting practical application I can think of is for software release notes. If you’re using Atlassian dev-ops, you can pull in details and assemble your document about releases (Octopus), commits (BitBucket) and issues (Jira). The document is based on a template (branding and general information) and the specifics can be filled in programmatically. Please note, I always advocate the need for a human reviewer.
This solution uses placeholders for simple text and repeating paragraphs in the Word document template. The placeholders are defined in the template and not in the code. Please consider the below.
The document on the left is an example of a completed (filled-out) document and the one on the right is the template. I prefer using the double-braces for placeholder “tags”.
The scope of my solution is:
Let’s get started.
0. Quick background
MS Word document files with the DOCX extension are actually ZIP files filled with XML, stylesheets and resources. The OpenOffice Writer’s ODT files are also ZIP files but the XML schema and layout is different. The code could be adapted to work with ODT files with a few small adjustments.
1. Getting started
The DOCX is a ZIP file containing XML files and associated resources. We want to break open that archive and get to the thing we care about, the document text (“word\document.xml”). We can ignore everything else. We do this with the help of one of my favourite packages SharpZipLib.
Below I’m going to extract all the files to a temporary location. You can do this in-memory but for this demo I’m keeping it simple.
using System.IO;
using ICSharpCode.SharpZipLib.Core;
using ICSharpCode.SharpZipLib.Zip;
public static void ExtractArchive(string docFilename, string tempPath)
{
ZipEntry theEntry = null;
ZipInputStream zipStream = new ZipInputStream(File.OpenRead(docFilename));
while (true)
{
theEntry = zipStream.GetNextEntry();
if (theEntry == null)
{
break;
}
if (string.IsNullOrWhiteSpace(theEntry.Name))
{
break;
}
if (!theEntry.IsFile)
{
continue;
}
string fullPath = tempPath.TrimEnd('\\') + "\\" + theEntry.Name.TrimStart('\\');
if (!Directory.Exists(Path.GetDirectoryName(fullPath)))
{
Directory.CreateDirectory(Path.GetDirectoryName(fullPath));
}
if (string.IsNullOrEmpty(Path.GetFileName(fullPath)))
{
break;
}
using (FileStream streamWriter = File.Create(fullPath))
{
int size = 2048;
byte[] data = new byte[2048];
while (true)
{
size = zipStream.Read(data, 0, data.Length);
if (size > 0)
{
streamWriter.Write(data, 0, size);
}
else
{
break;
}
}
}
}
zipStream.Close();
zipStream.Dispose();
}
string tempPath = Path.GetTempPath().TrimEnd('\\') + "\\hiim-" + DateTime.Now.ToString("yyyy-MM-dd-HHmmss") + "\\";
ExtractArchive("C:\document1.docx", tempPath);
I like to test the code for packing the files back into an archive at this point. I always feel like I should compartmentalise when working with different packages. Open the file in MS Word to ensure that there are no issues.
public static void CreateArchive(string outPathname, string folderName, string password = "")
{
FileStream fsOut = File.Create(outPathname);
ZipOutputStream zipStream = new ZipOutputStream(fsOut);
zipStream.SetLevel(9);
zipStream.Password = password;
int folderOffset = folderName.Length + (folderName.EndsWith("\\") ? 0 : 1);
CompressFolder(folderName, zipStream, folderOffset);
zipStream.IsStreamOwner = true;
zipStream.Close();
}
private static void CompressFolder(string path, ZipOutputStream zipStream, int folderOffset)
{
string[] fileList = Directory.GetFiles(path);
foreach (string filename in fileList)
{
FileInfo fi = new FileInfo(filename);
string entryName = filename.Substring(folderOffset);
entryName = ZipEntry.CleanName(entryName);
ZipEntry newEntry = new ZipEntry(entryName);
newEntry.DateTime = fi.LastWriteTime;
newEntry.Size = fi.Length;
zipStream.PutNextEntry(newEntry);
byte[] buffer = new byte[4096];
using (FileStream streamReader = File.OpenRead(filename))
{
StreamUtils.Copy(streamReader, zipStream, buffer);
}
zipStream.CloseEntry();
}
string[] folders = Directory.GetDirectories(path);
foreach (string folder in folders)
{
CompressFolder(folder, zipStream, folderOffset);
}
}
CreateArchive("C:\document2-test.docx", tempPath);
Directory.Delete(tempPath, true);
2. Simple text replacement
Although the document is XML, I’m going to treat it as plain text and manipulate it as a string. I’ve found this is the easiest way.
We still have to respect that XML special characters need to be escaped.
I’m starting with a simple find-and-replace for the linear placeholders.
using System.Security;
StringBuilder sb = new StringBuilder();
sb.Append(File.ReadAllText(tempPath + "word\\document.xml"));
sb.Replace("{{today_date}}", SecurityElement.Escape("01/01/2001"));
sb.Replace("{{name}}", SecurityElement.Escape("Ray"));
sb.Replace("{{dob}}", SecurityElement.Escape("01/01/1901"));
3. Find the repeating blocks of text
The next step, I’m going to find the paragraphs that may be repeating and take it out of the document. We’ll use this paragraph like point 2 and replace the placeholders. This will be repeated as many times as needed, the results are inserted to the document.
I’ve chosen to collate the repeated paragraphs but you could insert them directly to the document.
I’m using Tuple’s because Tuple’s are cool.
protected Tuple<int, int> getOuterParagraph(string fullText, string findTerm)
{
string headTerm = "<w:p ";
string tailTerm = "</w:p>";
int headIndex = fullText.IndexOf(findTerm);
if (headIndex < 0)
{
return null;
}
int tailIndex = fullText.IndexOf(findTerm, (headIndex + findTerm.Length));
if (tailIndex < 0)
{
return null;
}
headIndex = fullText.LastIndexOf(headTerm, headIndex);
if (headIndex < 0)
{
return null;
}
tailIndex = fullText.IndexOf(tailTerm, (tailIndex + tailTerm.Length));
if (tailIndex < 0)
{
return null;
}
tailIndex += tailTerm.Length;
return new Tuple<int, int>(headIndex, (tailIndex - headIndex));
}
protected Tuple<int, int> getInnerParagraph(string fullText, string findTerm)
{
string headTerm = "<w:p ";
string tailTerm = "</w:p>";
int headIndex = fullText.IndexOf(findTerm);
if (headIndex < 0)
{
return null;
}
int tailIndex = fullText.IndexOf(findTerm, (headIndex + findTerm.Length));
if (tailIndex < 0)
{
return null;
}
headIndex = fullText.IndexOf(tailTerm, headIndex);
if (headIndex < 0)
{
return null;
}
headIndex += tailTerm.Length;
tailIndex = fullText.LastIndexOf(headTerm, tailIndex);
if (tailIndex < 0)
{
return null;
}
return new Tuple<int, int>(headIndex, (tailIndex - headIndex));
}
string[] interests = new string[] { "Eating", "Being fat", "Looking for stuff to eat" };
string paragraph = sb.ToString();
Tuple<int, int> outerCoord = getOuterParagraph(paragraph, "{{repeat_interests}}");
if (outerCoord != null)
{
sb.Remove(outerCoord.Item1, outerCoord.Item2);
Tuple<int, int> innerCoord = getInnerParagraph(paragraph.Substring(outerCoord.Item1, outerCoord.Item2), "{{repeat_interests}}");
string innerParagraph = paragraph.Substring((innerCoord.Item1 + outerCoord.Item1), innerCoord.Item2);
StringBuilder innerText = new StringBuilder();
foreach (string interest in interests)
{
innerText.Append(innerParagraph.Replace("{{interest_item}}", SecurityElement.Escape(interest)));
}
sb.Insert(outerCoord.Item1, innerText.ToString());
}
4. Finish
To finish overwrite the original XML file and pack the files back to a ZIP.
File.WriteAllText(tempPath + "word\\document.xml", sb.ToString());
CreateArchive("C:\document2.docx", tempPath);
Directory.Delete(tempPath, true);
5. Done
Being able to generate Word documents has been one of the most useful things I build. I can appreciate the elegance of using the Office interop but there’s something very primal about ripped an archive apart and using string manipulation.
This kind of code might not be for every occasion but I really hope someone finds this interesting or useful.
Posted on Sat 13th Jan 2018
Modified on Sun 13th Mar 2022