Mackenzie Solutions Ltd
  • Home
  • What We Do
  • Recent Projects
  • About Us
  • Technology
  • Partners
  • Contact
  • Blog

Style over substance? – Quality and consistency of styling in MS Word. Part 3 – A standards-based solution

18/11/2019

0 Comments

 
This is the last of three blog posts (for background and a description of typical issues see Part 1 with Part 2 describing typical solutions) that look at challenges with Word styles. This final part describes a standards-based approach that can be taken to check and normalize styles in a document (or batch of documents).

Part 3 – A standards-based solution

For many years the only way to do anything programmatically with a Word document was using the MS Word application itself. This was due to the flexibility and complexity of the application (therefore the underlying data) and the lack of a publicly published specification for the file format (.doc). Any solution to these issues had to be built on top of Word itself using macros, Word plug-ins or Word automation. As Word is a client-side tool, this means that all of these solutions needed to be deployed on client machines and would not operate on a server. Any business rules for styles and content would be included within the code of the application. This would then increase the complexity of roll-out of installation and especially of maintenance where templates are updated frequently.

In 2003 Microsoft created a public standard for an XML specification (Microsoft Office XML) that could be imported or exported from MS Word 2003. For the first time, developers could safely generate (or more easily adapt/transform) Word documents outside of the MS Word application. This allowed automation solutions to be developed for business challenges such as:
  • conversion from Word to XML for publishers;
  • creation of customized contracts (with appropriate clauses inserted based-on information gathered) and whose style reflects the corporate Word template; and
  • personalized reporting/marketing material (e.g. “your pension performance explained”).

The single file format became a favorite for XML developers to transform via XSLT to whatever output was required but this approach was rarely adopted outside the publishing community or bespoke products.

Microsoft replaced that standard in later years with the ISO standard “Office Open XML” ultimately becoming the default read and write formats for MS Word (i.e. “.docx”). Docx files are basically a zipped set of folders containing XML files for the text, style, comments (plus graphics) required for a Word document. This new format allows developers to work directly with the core document format of MS Word but needs the developer to have the ability to “unpack” the files, update multiple files before repackaging as a “.docx”. This meant many XSLT developers (as XSLT cannot open ZIP files) stuck to the old format.

When developing a new standards-based solution for checking, reporting and fixing Word issues I turned to XProc. XProc is an XML language that allows users to define a “pipeline” of processing steps in a declarative way. XProc provides many built-in steps that can be combined together according to your needs that makes it perfect for processing Docx files. These steps include the ability to unzip, validate, compare, merge and manipulate XML, transform via XSLT and zip the results.

So, having dealt with the zipping and unzipping of documents, I needed a way to check the consistency and quality of the document style and content. While it is easy to validate the individual Word XML files against a schema (the “Office Open XML” schema) this only checks that the XML structure within the file matches what a Word document should have and does not check compliance against any business-specific rules such as style conformance or mandatory text content.

Fortunately, there is another way to check rules in an XML document that DOES allow such business-specific checks. Schematron allows a document analyst to define whatever simple or complex rules that is required to check the quality of a document and to provide information back to the business users on how to correct any issues. An example of a Schematron rule to test that a paragraph with a paragraph style “Heading 3” must be immediately preceded by a paragraph with style “Heading 2” is as follows.

<sch:rule context="cm:getThisParaStyle(.)='Heading3'">
            <sch:assert test="cm:getParaBeforeStyle(.)='Heading2'" id="H3afterH2">Heading 3 must be immediately preceded by Heading 2</sch:assert>
</sch:rule>

​As these rules are declarative and separate from any logic used to process the Word file itself, a document analyst is free to develop and maintain these rules without having to be an expert programmer. The Schematron format is an open standard with plenty of documentation and training guides on the web and it utilizes the XPath standard as the way to identify content in order to test its validity. I have also developed some simple helper functions such as “getThisParaStyle” in order to aid document analysts identify content without the need to have a deep understanding of the underlying Word XML format. These rules can check for the existence and validity of fields, metadata or that content of a certain type has text that fits a particular pattern (regular expressions). If required, a library of these tests can be created and re-used as required.
​
Once a document has been processed by the tool, the errors or warnings from Schematron are presented back to the user as Word comments with the location of the comment providing the context for the error. Users can utilize Word’s review toolbar to navigate their way through the comments.
Picture
​Once a user remedies the issue (e.g. by changing style to the correct style or by moving an existing paragraph into the correct position) the file can be reprocessed allowing the existing errors/warnings to be stripped and any new or remaining issues to be created as new comments.

This is not the first solution to suggest using Schematron with Office documents (with feedback as comments) for this purpose (see Andrew Sales presentation at XML London) but I have tried to push the concept further:
  • Focusing on business cases other than those of supporting XML conversion from Word.
  • Enhancing the usability of the feedback provided to the users.
  • Performing the checks on native .docx files.
  • Detecting the type of document and selecting the correct Schematron rule files to use to check that file (therefore supporting general rules, corporate rules and template/content specific rules).
  • Checking that the styles used in the document matches those in a reference master style file.
  • Provide options to strip existing user generated comments (important before final delivery of a document) or to keep those comments.
  • Where fixes can be automated, run fixes in XSLT steps prior to checking quality.

The XProc process also supports recording the quality of the document to be logged in an XML log file so that an entire library of documents can be checked for style conformance (especially important for beginning some new project that presumes consistency of content). The log can be queried or transformed (e.g. for loading into Excel) to provide business intelligence on a batch of documents.

<log date="2019-11-12">
   <entry stylesMatch="true"
          errorCount="5"
          warnCount="1"
          issues="H1notfirst H3afterH2 Bullet2 NumOne FirstCelltext"
          warnings="NoI"
          filename="test.docx"
          startDateTime="2019-11-12T16:37:49.614Z"
          endDateTime="2019-11-12T16:37:49.621Z"/>
 
This XProc process could be invoked in a number of ways depending on the business requirement and IT limitations:
  • Run on current Word file from custom macro in Word (client side or posted to a server application).
  • Invoked from a workflow, content management or publishing solution as part of a “check” stage using Java or by running a BAT file.
  • Run from PowerShell when a file arrives in a specific network folder.
  • Run from a Bat file on a hierarchical folder full of Word files.
  • Run from XML processing tolls such as Oxygen.

Final thoughts:
  • It is perfectly possible to achieve much of the same functionality in C# or VB .Net (especially using the Open XML SDK) but developing using open standards inspires us to think of new standard-based approaches that can deliver real business benefits (such as having declarative rules in Schematron and not spaghetti rules embedded in impenetrable code).
  • Without a well-thought-out Word template, it is not really possible to infer or validate styles.
  • While it is possible to automate some fixes (e.g. swapping out-of-date style names to new ones or selecting a more modern template) manual intervention will be required where content is missing or needs human intervention in order to decide how best to rearrange it.
  • Where you need to capture complex semantic information, you have difficult publishing requirements or you want to take advantage of component-based re-use and translation then you should consider authoring using an XML editor with schema and Schematron validation performed at source.

​I will keep working on this solution to provide additional features (such as a way to allow users to select from a number of possible fixes) and to develop a library of functions to make development of rules and fixes easier. I hope to present the solution at an XML conference next year.
If you have any use cases you would like to suggest, any questions you would like to raise or if your company would like to use this approach and engage my services, please get in touch.

0 Comments

Style over substance? – Quality and consistency of styling in MS Word. Part 2 - Typical solutions

12/11/2019

0 Comments

 
This is the second of three small blog posts (for background and a description of typical issues see Part 1) that will look at challenges with Word styles, current approaches used to address those issues and will conclude (see Part 3) with a description of some standards-based approaches that can be taken to check and normalize styles in a document (or batch of documents).

Part 2 – Typical solutions
​

Despite the volume of licenses sold to the corporate market, Microsoft have not focused on providing product features to increase the quality/consistency of styling in documents created by Word. There are manual ways (not obvious to many users) of checking style usage including:
  • View the document in draft mode (having set the “Style area pane width” in “Advanced Options” otherwise you will not see them) and look at the style names being viewed;
  • Print a list of styles used in the document (when in Print Settings change “Print All Pages” selection to “Styles”); or
  • Manually click on each paragraph and view the style name in the Styles panel.

These manual approaches are not ideal as manual means “subject to human error” and they do not tell you if the latest/correct version of the style itself is in use.
Microsoft would probably state that by supporting templates, macros and APIs they have always enabled corporate and third-party developers to build whatever functionality is required. For reasons to be discussed in part 3 (standards-based solutions) and to provide the user with an interactive experience, historically styling solutions were all based around macros/plug-ins within Word or client-side automation using Word.

Typical approaches take to ensure quality of styles mostly fall into the following categories:
  • Template management: forcing the user to pick from one of a number of centrally managed templates or auto-loading a central template from a network drive when creating a new document.
    • But what if the user opens an old document or one sent in from a third-party and then saves it with a new name?
  • Customized editing experience: providing custom ribbons and dialogue boxes that aide the user by applying the correct style (somehow made more obvious via an icon?) of the many approved styles to a given paragraph.
    • But what if a user applies styles or formatting manually (if users are not trained in Word they will almost certainly get little training in any add-ons), does not apply any style or even does not enter content that is considered mandatory in a given scenario (e.g. all groups of “Warning Paras” must be preceding by a “Warning Title”)?
  • Document analysis and repair – Provides reports on style use and a custom user interface to allow users to manually apply a selected style to one or more paragraphs. Some of these tools can also spot hard coded textual references (e.g. “see clause 4.2” and replace them with dynamic Word cross references).
    • Can the “rules” for the styles be easily kept up to date as the template(s) changes?

When I have worked with clients who utilize these solutions and, over time, there tends to be an issue maintaining them. The issues have included:
  • The solution no longer works since Word was upgraded (incompatible macros/plug-ins).
  • The solution no longer works since the template was upgraded (the template designer does not understand the style solution and IT do not understand complex Word templates).
  • Security changes (in Windows or in the organization) mean that the client-side code no longer runs.

If the code that is trying to identify style issues and/or fix those issues has to contain the rules itself then the process logic and business logic gets muddled. Some tools utilize configuration files listing style names that are allowed and old style names that should be mapped to the new style names. However, logic such as “do not allow a ‘Clause Level 2’ unless it is preceded by a ‘Clause Level 1’” is not easily expressed in a simple look-up table never mind more complex logic that may look-up multiple paragraphs in order to decide what is valid and may also utilize text pattern matching (e.g. if a heading matches the pattern “Appendix [A-Z)” then it should use “Appendix heading” style).

Is there a way that we can:
  • Utilize a standard language to define what the rules are for the styling and content of a document in a way that supports the maintenance of the logic separately from both Word and the program that utilizes these rules?
  • Check that the latest style definitions are in use?
  • Find a process that does not need to be installed on the client machine so it is easier to maintain?
  • Apply fixes wherever automatable?
  • Report issues back to users using standard Word features?
  • Provide reports on libraries of documents summarizing the level of compatibility to current style rules?

​We will explore the answer to these questions in part 3 of this blog.
0 Comments

Style over substance? – Quality and consistency of styling in MS Word. Part 1 – The problem with styles

11/11/2019

0 Comments

 
This is the first of three small blog posts that will look at challenges with Word styles, current approaches used to address those issues (see Part 2) and will conclude (see Part 3) with a description of some standards-based approaches that can be taken to check and normalize styles in a document (or batch of documents).

Part 1 – The problem with styles

There are few ubiquitous tools in IT, but Microsoft Word™ probably comes as close as there is. With only a few exceptions (where a web-only deliverable means the content is authored directly into HTML or where complex re-use and professional publication requirements mandates the use of XML) we all use Word to author important documents. Whether the documents are internal reports, legal contracts or consultancy proposals it is vital to your organization that the documents:
  • reflect the latest corporate brand;
  • are consistent with other documents being delivered;
  • uses the agreed numbering system (using auto numbered paragraphs that can be dynamically referenced and chapter/appendix prefixes);
  • automatically creates the correct table of contents (and table of figures/tables if required);
  • can be easily edited by others; and
  • are able to have content extracted and re-used in other documents or libraries of information.

​This is only achievable in Word via the consistent use of styles in well managed templates. However, even if your organization has developed and maintained these templates, documents will frequently have their consistency (and therefore quality) reduced due to:
  • use of old templates;
  • creation of Word documents from existing documents that do not use the latest template;
  • editing of the document outside of the organization-controlled environment (e.g. sending contracts to “the other side”); and
  • user error where formatting is applied manually (via buttons, format painter etc.) or where ad-hoc styles are created and used.

It is vital that we do not underestimate the issue of user error. Most business users are NEVER trained in Word as, in its simplest form, it is so easy to use. But is not easy to use Word in the right way to achieve consistency even in the most macro heavy templated environment. Many have us have had to take-over complex Word documents from business users in order to try to decipher what has gone wrong and make last minute edits before deadlines. In many other cases these last-minute edits are made blind “I just changed things till it looked right” at the cost of consistency and any other users of the document.

With typical Word workflows, errors in the styles being applied will directly result in presentation errors in the final delivered documents (as the delivery format is Word or PDF). In more complex workflows the Word documents may be:
  • converted and formatted using InDesign;
  • converted to HTML for web publishing; or
  • converted to XML for enrichment and/or multi-format delivery.

In all of these more complex workflows the correct use of Word styles is pivotal to the success of the process in order to convert, brand or structure the data appropriately with missing or misused styles leading to invalid or substandard content.

So how do you know if your document has issues and is there any way to prevent or correct them? I will cover some of the options in my next post before suggesting a standards-based alternative that will not only identify problems in a single document but that can be run on an entire library of documents. Lack of style consistency/quality across thousands of documents would substantially increase the cost of any project designed to utilize that library as a consistent data set (and may even call the financial viability of the project into doubt).

If you have issues with Word that impact your production processes please let me know and I will try to discuss them in future posts. Issues that I have encountered include:
  • application of styles to wrong content/in wrong order;
  • use of manual mark-up instead of styles (or overriding styles to mimic other styles);
  • creation and use of non-supported styles (styles not defined in master template);
  • use of out of date styles/templates;
  • manual numbering (and chapter/appendix prefixing);
  • lack of metadata (missing or incomplete properties or fields).
0 Comments

    Author

    Colin Mackenzie
    Thoughts on document computing and XML publishing (when time permits!)

    Archives

    November 2019
    October 2015
    September 2015
    April 2015

    Categories

    All

    RSS Feed

XML Consultant, Development and Training
XSLT Consultant, Development and Training (also XSL-FO)
​XProc and XQuery Development
Data Analysis, XML Schema and DTD Development
Publishing and Business Consultancy

DITA Consultant and DITA-OT Customisation
Content Management and Document Automation Expertise
​Microsoft Word and Open Office XML