XML, XSLT and publishing blog

Style over substance? – Quality and consistency of styling in MS Word. Part 3 – A standards-based solution

18/11/2019

This is the last of three blog posts (for background and a description of typical issues see Part 1 with Part 2 describing typical solutions) that look at challenges with Word styles. This final part describes a standards-based approach that can be taken to check and normalize styles in a document (or batch of documents).

Part 3 – A standards-based solution

For many years the only way to do anything programmatically with a Word document was using the MS Word application itself. This was due to the flexibility and complexity of the application (therefore the underlying data) and the lack of a publicly published specification for the file format (.doc). Any solution to these issues had to be built on top of Word itself using macros, Word plug-ins or Word automation. As Word is a client-side tool, this means that all of these solutions needed to be deployed on client machines and would not operate on a server. Any business rules for styles and content would be included within the code of the application. This would then increase the complexity of roll-out of installation and especially of maintenance where templates are updated frequently.

In 2003 Microsoft created a public standard for an XML specification (Microsoft Office XML) that could be imported or exported from MS Word 2003. For the first time, developers could safely generate (or more easily adapt/transform) Word documents outside of the MS Word application. This allowed automation solutions to be developed for business challenges such as:

conversion from Word to XML for publishers;
creation of customized contracts (with appropriate clauses inserted based-on information gathered) and whose style reflects the corporate Word template; and
personalized reporting/marketing material (e.g. “your pension performance explained”).

The single file format became a favorite for XML developers to transform via XSLT to whatever output was required but this approach was rarely adopted outside the publishing community or bespoke products.

Microsoft replaced that standard in later years with the ISO standard “Office Open XML” ultimately becoming the default read and write formats for MS Word (i.e. “.docx”). Docx files are basically a zipped set of folders containing XML files for the text, style, comments (plus graphics) required for a Word document. This new format allows developers to work directly with the core document format of MS Word but needs the developer to have the ability to “unpack” the files, update multiple files before repackaging as a “.docx”. This meant many XSLT developers (as XSLT cannot open ZIP files) stuck to the old format.

When developing a new standards-based solution for checking, reporting and fixing Word issues I turned to XProc. XProc is an XML language that allows users to define a “pipeline” of processing steps in a declarative way. XProc provides many built-in steps that can be combined together according to your needs that makes it perfect for processing Docx files. These steps include the ability to unzip, validate, compare, merge and manipulate XML, transform via XSLT and zip the results.

So, having dealt with the zipping and unzipping of documents, I needed a way to check the consistency and quality of the document style and content. While it is easy to validate the individual Word XML files against a schema (the “Office Open XML” schema) this only checks that the XML structure within the file matches what a Word document should have and does not check compliance against any business-specific rules such as style conformance or mandatory text content.

Fortunately, there is another way to check rules in an XML document that DOES allow such business-specific checks. Schematron allows a document analyst to define whatever simple or complex rules that is required to check the quality of a document and to provide information back to the business users on how to correct any issues. An example of a Schematron rule to test that a paragraph with a paragraph style “Heading 3” must be immediately preceded by a paragraph with style “Heading 2” is as follows.

<sch:rule context="cm:getThisParaStyle(.)='Heading3'">
<sch:assert test="cm:getParaBeforeStyle(.)='Heading2'" id="H3afterH2">Heading 3 must be immediately preceded by Heading 2</sch:assert>
</sch:rule>

As these rules are declarative and separate from any logic used to process the Word file itself, a document analyst is free to develop and maintain these rules without having to be an expert programmer. The Schematron format is an open standard with plenty of documentation and training guides on the web and it utilizes the XPath standard as the way to identify content in order to test its validity. I have also developed some simple helper functions such as “getThisParaStyle” in order to aid document analysts identify content without the need to have a deep understanding of the underlying Word XML format. These rules can check for the existence and validity of fields, metadata or that content of a certain type has text that fits a particular pattern (regular expressions). If required, a library of these tests can be created and re-used as required.

Once a document has been processed by the tool, the errors or warnings from Schematron are presented back to the user as Word comments with the location of the comment providing the context for the error. Users can utilize Word’s review toolbar to navigate their way through the comments.

Once a user remedies the issue (e.g. by changing style to the correct style or by moving an existing paragraph into the correct position) the file can be reprocessed allowing the existing errors/warnings to be stripped and any new or remaining issues to be created as new comments.

This is not the first solution to suggest using Schematron with Office documents (with feedback as comments) for this purpose (see Andrew Sales presentation at XML London) but I have tried to push the concept further:

Focusing on business cases other than those of supporting XML conversion from Word.
Enhancing the usability of the feedback provided to the users.
Performing the checks on native .docx files.
Detecting the type of document and selecting the correct Schematron rule files to use to check that file (therefore supporting general rules, corporate rules and template/content specific rules).
Checking that the styles used in the document matches those in a reference master style file.
Provide options to strip existing user generated comments (important before final delivery of a document) or to keep those comments.
Where fixes can be automated, run fixes in XSLT steps prior to checking quality.

The XProc process also supports recording the quality of the document to be logged in an XML log file so that an entire library of documents can be checked for style conformance (especially important for beginning some new project that presumes consistency of content). The log can be queried or transformed (e.g. for loading into Excel) to provide business intelligence on a batch of documents.

<log date="2019-11-12">
   <entry stylesMatch="true"
          errorCount="5"
          warnCount="1"
          issues="H1notfirst H3afterH2 Bullet2 NumOne FirstCelltext"
          warnings="NoI"
          filename="test.docx"
          startDateTime="2019-11-12T16:37:49.614Z"
          endDateTime="2019-11-12T16:37:49.621Z"/>

This XProc process could be invoked in a number of ways depending on the business requirement and IT limitations:

Run on current Word file from custom macro in Word (client side or posted to a server application).
Invoked from a workflow, content management or publishing solution as part of a “check” stage using Java or by running a BAT file.
Run from PowerShell when a file arrives in a specific network folder.
Run from a Bat file on a hierarchical folder full of Word files.
Run from XML processing tolls such as Oxygen.

Final thoughts:

It is perfectly possible to achieve much of the same functionality in C# or VB .Net (especially using the Open XML SDK) but developing using open standards inspires us to think of new standard-based approaches that can deliver real business benefits (such as having declarative rules in Schematron and not spaghetti rules embedded in impenetrable code).
Without a well-thought-out Word template, it is not really possible to infer or validate styles.
While it is possible to automate some fixes (e.g. swapping out-of-date style names to new ones or selecting a more modern template) manual intervention will be required where content is missing or needs human intervention in order to decide how best to rearrange it.
Where you need to capture complex semantic information, you have difficult publishing requirements or you want to take advantage of component-based re-use and translation then you should consider authoring using an XML editor with schema and Schematron validation performed at source.

I will keep working on this solution to provide additional features (such as a way to allow users to select from a number of possible fixes) and to develop a library of functions to make development of rules and fixes easier. I hope to present the solution at an XML conference next year.
If you have any use cases you would like to suggest, any questions you would like to raise or if your company would like to use this approach and engage my services, please get in touch.

0 Comments

Style over substance? – Quality and consistency of styling in MS Word. Part 2 - Typical solutions

12/11/2019

0 Comments

This is the second of three small blog posts (for background and a description of typical issues see Part 1) that will look at challenges with Word styles, current approaches used to address those issues and will conclude (see Part 3) with a description of some standards-based approaches that can be taken to check and normalize styles in a document (or batch of documents).

Part 2 – Typical solutions

Despite the volume of licenses sold to the corporate market, Microsoft have not focused on providing product features to increase the quality/consistency of styling in documents created by Word. There are manual ways (not obvious to many users) of checking style usage including:

View the document in draft mode (having set the “Style area pane width” in “Advanced Options” otherwise you will not see them) and look at the style names being viewed;
Print a list of styles used in the document (when in Print Settings change “Print All Pages” selection to “Styles”); or
Manually click on each paragraph and view the style name in the Styles panel.

These manual approaches are not ideal as manual means “subject to human error” and they do not tell you if the latest/correct version of the style itself is in use.
Microsoft would probably state that by supporting templates, macros and APIs they have always enabled corporate and third-party developers to build whatever functionality is required. For reasons to be discussed in part 3 (standards-based solutions) and to provide the user with an interactive experience, historically styling solutions were all based around macros/plug-ins within Word or client-side automation using Word.

Typical approaches take to ensure quality of styles mostly fall into the following categories:

Template management: forcing the user to pick from one of a number of centrally managed templates or auto-loading a central template from a network drive when creating a new document.
- But what if the user opens an old document or one sent in from a third-party and then saves it with a new name?
Customized editing experience: providing custom ribbons and dialogue boxes that aide the user by applying the correct style (somehow made more obvious via an icon?) of the many approved styles to a given paragraph.
- But what if a user applies styles or formatting manually (if users are not trained in Word they will almost certainly get little training in any add-ons), does not apply any style or even does not enter content that is considered mandatory in a given scenario (e.g. all groups of “Warning Paras” must be preceding by a “Warning Title”)?
Document analysis and repair – Provides reports on style use and a custom user interface to allow users to manually apply a selected style to one or more paragraphs. Some of these tools can also spot hard coded textual references (e.g. “see clause 4.2” and replace them with dynamic Word cross references).
- Can the “rules” for the styles be easily kept up to date as the template(s) changes?

When I have worked with clients who utilize these solutions and, over time, there tends to be an issue maintaining them. The issues have included:

The solution no longer works since Word was upgraded (incompatible macros/plug-ins).
The solution no longer works since the template was upgraded (the template designer does not understand the style solution and IT do not understand complex Word templates).
Security changes (in Windows or in the organization) mean that the client-side code no longer runs.

If the code that is trying to identify style issues and/or fix those issues has to contain the rules itself then the process logic and business logic gets muddled. Some tools utilize configuration files listing style names that are allowed and old style names that should be mapped to the new style names. However, logic such as “do not allow a ‘Clause Level 2’ unless it is preceded by a ‘Clause Level 1’” is not easily expressed in a simple look-up table never mind more complex logic that may look-up multiple paragraphs in order to decide what is valid and may also utilize text pattern matching (e.g. if a heading matches the pattern “Appendix [A-Z)” then it should use “Appendix heading” style).

Is there a way that we can:

Utilize a standard language to define what the rules are for the styling and content of a document in a way that supports the maintenance of the logic separately from both Word and the program that utilizes these rules?
Check that the latest style definitions are in use?
Find a process that does not need to be installed on the client machine so it is easier to maintain?
Apply fixes wherever automatable?
Report issues back to users using standard Word features?
Provide reports on libraries of documents summarizing the level of compatibility to current style rules?

We will explore the answer to these questions in part 3 of this blog.

0 Comments

Style over substance? – Quality and consistency of styling in MS Word. Part 1 – The problem with styles

11/11/2019

0 Comments

This is the first of three small blog posts that will look at challenges with Word styles, current approaches used to address those issues (see Part 2) and will conclude (see Part 3) with a description of some standards-based approaches that can be taken to check and normalize styles in a document (or batch of documents).

Part 1 – The problem with styles

There are few ubiquitous tools in IT, but Microsoft Word™ probably comes as close as there is. With only a few exceptions (where a web-only deliverable means the content is authored directly into HTML or where complex re-use and professional publication requirements mandates the use of XML) we all use Word to author important documents. Whether the documents are internal reports, legal contracts or consultancy proposals it is vital to your organization that the documents:

reflect the latest corporate brand;
are consistent with other documents being delivered;
uses the agreed numbering system (using auto numbered paragraphs that can be dynamically referenced and chapter/appendix prefixes);
automatically creates the correct table of contents (and table of figures/tables if required);
can be easily edited by others; and
are able to have content extracted and re-used in other documents or libraries of information.

This is only achievable in Word via the consistent use of styles in well managed templates. However, even if your organization has developed and maintained these templates, documents will frequently have their consistency (and therefore quality) reduced due to:

use of old templates;
creation of Word documents from existing documents that do not use the latest template;
editing of the document outside of the organization-controlled environment (e.g. sending contracts to “the other side”); and
user error where formatting is applied manually (via buttons, format painter etc.) or where ad-hoc styles are created and used.

It is vital that we do not underestimate the issue of user error. Most business users are NEVER trained in Word as, in its simplest form, it is so easy to use. But is not easy to use Word in the right way to achieve consistency even in the most macro heavy templated environment. Many have us have had to take-over complex Word documents from business users in order to try to decipher what has gone wrong and make last minute edits before deadlines. In many other cases these last-minute edits are made blind “I just changed things till it looked right” at the cost of consistency and any other users of the document.

With typical Word workflows, errors in the styles being applied will directly result in presentation errors in the final delivered documents (as the delivery format is Word or PDF). In more complex workflows the Word documents may be:

converted and formatted using InDesign;
converted to HTML for web publishing; or
converted to XML for enrichment and/or multi-format delivery.

In all of these more complex workflows the correct use of Word styles is pivotal to the success of the process in order to convert, brand or structure the data appropriately with missing or misused styles leading to invalid or substandard content.

So how do you know if your document has issues and is there any way to prevent or correct them? I will cover some of the options in my next post before suggesting a standards-based alternative that will not only identify problems in a single document but that can be run on an entire library of documents. Lack of style consistency/quality across thousands of documents would substantially increase the cost of any project designed to utilize that library as a consistent data set (and may even call the financial viability of the project into doubt).

If you have issues with Word that impact your production processes please let me know and I will try to discuss them in future posts. Issues that I have encountered include:

application of styles to wrong content/in wrong order;
use of manual mark-up instead of styles (or overriding styles to mimic other styles);
creation and use of non-supported styles (styles not defined in master template);
use of out of date styles/templates;
manual numbering (and chapter/appendix prefixing);
lack of metadata (missing or incomplete properties or fields).

0 Comments

Data Quality For Publishers – Part 3: 10 Things That Can Go Wrong

21/10/2015

2 Comments

This is the third in a series of short posts discussing the subject of data quality for publishers. In this post we will look at 10 challenges to maintaining data quality over the content lifecycle and during delivery to multiple products.

In my previous post (http://tinyurl.com/od5xae7) we looked at decisions made during the early days of a data modelling and conversion project. But that does not mean potential quality issues are at an end.
Let’s assume you have:

an XSD or Relax NG schema to guide authors and validate results;
all your content nicely converted into XML and it validates against the schema;
content uploaded into a content management system;
an XML editing solution to update the content and add new content; and
a publishing process that takes the XML content and pushes it through a transformation pipeline to produce one or more products (e.g. searchable HTML on a website).

So what can go wrong?

Here are 10 things to watch out for and some brief suggestions on how to mitigate quality risks:

1. re-Importing external content

Existing content is resupplied from an external source (not updated in your CMS/XML) but may not be of the required quality. Not a difficult problem if this is just a few files (you can important them and put them through your normal enrichment/approval/QA procedures) but what if there are hundreds or thousands of files? Prior to accepting the content (perhaps it needs to be sent back to outsourcers) and ingesting the content into your “live” content pool (and potentially causing a massive clean-up effort later) consider at least some of the following:

Perform bulk validation against your DTD/XSD/RNG schema to ensure that they at least structurally correct.
Perform bulk validation using Schematron to ensure that business rules within each file are met (there may be combinations of elements and attributes that the schema allows but should never occur or more advanced rules with regard to naming conventions of IDs).
Importing the data into an XML database like BaseX (free) and run xQuery that will provide confidence that the data matches your expectations (e.g. How many documents do you expect to be badged with metadata as being “about” some subject or containing a certain phrase in the title). You can also check to see if data that you do not expect to find occurs (e.g. if a paragraph containing nothing but text inside an emphasis then perhaps this is a tile that has not been semantically identified). Such queries can be saved and used in future as a quality report (also for content generated within the CMS or exported as part of a bulk publishing process). Note: if your main CMS is XML aware then you could always do the import into a test document collection and run the test there.
If you have no linguistic consistency tools to help, perform a manual “spot check” into the files checking spelling, grammar and any controlled vocabulary.
If your publishing processes are automated and can be run standalone (i.e. not only from within your CMS) for regression tests or functional tests (see later) then push the test data through the process to ensure it delivers what is needed for your end products.
When you do accept a new batch of content into the system, ensure that the batch is clearly marked with metadata so that if any errors are found later you can easily isolate the content in the CMS and take appropriate action (bulk correction, roll-back to previous version etc.).

2. reliance on WYSIWYG Editing

To make life easier for your authors you provide a WYSIYG (What You See Is What You get) authoring mode in your XML editor. WYSIWYG is a strange concept when used with XML single-source publishing as the content may be published into multiple formats/styles or combined with other information sources as the XML on screen is only a sub-component. Typically the “WYSIWYG” view will match a principle output (e.g. the content in a web product) with some extra facility provided for PDF rendering (e.g. using an XSL-FO processor or XML+CSS processors). While these facilities to help ease-of-use, you cannot rely on previews to ensure that the data being generated is semantically correct. The content may not be as rich as you would like it (perhaps formatting elements like “strong” are used instead of an element like “cost”) but this is masked by the fact that in the main output (and WYSIWYG view) they are formatted the same. Tags may be abused to make the content appear visually correct on screen but are actually semantically dubious. Ultimately you need to train your authors to understand the semantics (and the consequences of not expressing them), and provide a QA process to ensure that the semantics are correct. You may consider providing a core authoring mode that actually helps disambiguate elements that are normally formatted the same on output by adding prompts (like a form input) and/or different styles.

3. Check-In of Invalid Content to your CMS

Does your CMS allow you to check-in content that does not validate against your structural and/or business rules? On one hand, it may be helpful for authors to be able to safely and centrally store incomplete work (or to pass it over to another user via workflow etc.) but if that content is not easily identifiable as “complete and valid” then it could be published to your products causing major issues and delays. This is normally solved by only publishing particular versions of a content item that has been through some QA and has reached a “published” workflow stage. Most commercial CMS systems will support such workflow/metadata/publishing features but if this is not the case for your home grown CMS then you should consider implementing them.

4. Metadata

Metadata – Your XML schema has elements that are intended to contain metadata describing the information contained in the rest of the file. This may include the content title, subject matter classification, security or licencing information, version numbers and release dates. Other metadata may only be contained in fields in your CMS. So what can go wrong?

Metadata inside and outside the XML may get out of sync – the publishing process may need access to the metadata in the XML files whereas for a non XML –aware CMS the data may need to be extracted into fields for indexing so that they can be used for quick searches and for exporting certain categories of content. Either the data should only be held in one place (this may work for an XML CMS) or code must be written that automatically seeds the metadata from the master (database field or XML element) into the slave either in the CMS or at the point of publishing.
Metadata classification can become corrupt/bloated with incorrect entries – Hopefully you will have developed a taxonomy allowing you subject matter to be described according to an agreed constrained set of choices (so that authors do not describe one content subject as “cycles” and another similar content as “bicycles”) and to the appropriate level of detail. In some cases such constraints can be expressed as a choice using an XML schema/DTD but this may not be enough to maintain consistency if
- The metadata is held only in the CMS database fields (see previous point on ”sync”) and not entered into the XML.
- The taxonomy is hierarchical, complex and growing therefore it is held externally from the schema. In this case specific tools may need to be developed to help guide the user or to validate their choices against definitions held in a centrally maintained data store (e.g an external XML file checked using Schematron).

5. Links - Cross references and transclusions

In most XML sources there is a need to support links. There are many types of links including cross-references (link from this part of the document to that part of the document), images (put this image at this point in the content) and transclusions (get the content whose ID is the target of this link and include it at this point in the content). XML schemas only have very simple mechanisms for validating such constructs (e.g. ID/IDREF or key/keyref) that only work within the scope of the current XML file. Links can be complex and provide vital functionality for end products and for document assembly (e.g. maprefs in DITA) and they need to be maintained during the content lifecycle. What happens if a user tries to delete a section of file B that is the target of a link in file A? Many XML CMS systems support some validation for such scenarios (especially those based on DITA comprehensive linking and component re-use standards) but simple content stores will not.
A simpler quality issue with links is the use of links to external URLs and the need to check that the URL is correct and functional.

6. Evolving business need / Evolving quality Expectations

As discussed in previous posts, it is common to find that product requirements change and evolve but that the content is unfortunately not agile enough to support the new functionality without enrichment. If you have a large corpus then it is unlikely that you will be able to convert and/or manually enrich all of the content at the same time. It is also quite likely that business-as-usual changes (fixes to existing content, urgent new content added) will occur during the same period meaning that the content needs to be re-published even though the source is valid according to various different “profiles”. This can cause major issues “downstream” in the publishing process where code and systems that rely on the content may have to support multiple profiles at once or perhaps they cannot simply be updated in time to support the new content.
These problems can be avoided by ensuring that the final product process is isolated from the CMS/publishing process. The CMS publishing process should be able to create content that is backwards compatible with an agreed delivery schema to ensure old products do not break. This schema can then be used to validate the generated content prior to it being pushed through different products delivery pipelines. If a validation issue occurs at this stage then it is the authors responsibility to fix issues whereas if issues are found in the product then the investigation should start with the product developers.
This process should be implanted via an XML pipeline like XProc or Ant that allows multiple stages of XSLT transformation and validation.

7. Internal data can leak into the outside world

In your schema you may have the ability to gather author’s comments and change notes in a structured way. You can easily make the dropping of such content part of your publishing process. But what happens if an alternative process is developed that does not re-use the original code that drops these comments? Alternatively what happens if a tiny change to the priority of an XSLT template means that the code that drops the internal comments is no longer run? If you have a “delivery” schema that does NOT allow these comment elements and a standard publishing pipeline that always checks the content prior to it being pushed through product-specific publishing processes then any such mistakes would be identified automatically before it is too late.
This approach can also be used to check if published output contains any material marked as being of an inappropriate security classification or end-user/product licensing scheme.

8. Different products may need different or additional content

Even if the original source XML is good perhaps some subtle changes have caused these “derived files” to be wrong. In this case validation scripts and schemas can be created to validate the quality of these alternative files.
Examples of this would include

XML delivered to a customer-facing schema
Additional files describing contents of collections (e.g. SCORM manifests)
Additional files containing supplementary information (e.g. pre-processed table of contents, lists of word to appear in “word wheels” in search dialogs

9. Avoiding duplication

So your CMS is now the master content and all products are created via single-source publishing of that content, great. How do you avoid duplicate content being created within your CMS which may lead to vital updates being performed on one copy but not the next? This can be tricky to solve especially where you are storing information components not entire documents. If you are also importing legacy material from a time before you have a re-use strategy then this makes things even harder. While there are some tools that perform linguistic analysis and comparison which may help identify duplication, the first step is to have a clearly defined re-use strategy with user instructions on the process they should follow before creating new resources.

10. Globally Unique and Persistent IDs

You have always created IDs for some content items (in fact they may even be mandatory for some elements) as the content is authored. These IDs will be unique within the XML file (see point 4 above) you edit but may not be unique when your content is combined with other content during the publishing phase. Additionally internal IDs may be replaced (or new IDs added to elements that do not have them) as the content is published with IDs generated at publish time (e.g. using the XSLT generate-id function for content that is the target of automated table of contents or author delivery features. This may meet the needs of the products being created until such time as there is a need of an end user to be able to reliably target a piece of content in a delivered interactive product using an ID (e.g. a bookmark or personal annotation function on a website). Every time the content is republished, generated-ids will be added with different values causing the end users bookmarks/annotations to fail. Inconsistency of IDs across publishing cycles also causes issues with automated regression testing of the delivered XML.
Wherever possible, permanent globally unique IDs should be created as early in the process as possible and stored in the content so they remain persistent across content publishes.

One of the common factors in the issues above is making the content processes resilient to inevitable changes. One of the best ways to do this is to implement content regression tests. In a future post I will discuss technical challenges and solutions for content regression testing.

Previous: Quality And Content Enrichment

2 Comments

Data Quality For Publishers – Part 2: Quality and Content Enrichment

14/10/2015

1 Comment

This is the second in a series of short posts discussing the subject of data quality for publishers. In this post we will look at resourcing and decision-making during the initial content analysis and modelling phases and how this can affect the quality and maintainability of your data.

As outlined in the first post (http://tinyurl.com/nnhbxcj) of this series, content developers are often under pressure to deliver quick results when adding a new source to a publisher’s corpus of valuable data. These demands may not be unreasonable as timelines are often driven by an urgent business need such as competitive pressure to deliver a new product. Another driver for quick content results may be that the product is being developed in an Agile way, and initial data is needed to support deliverables from development sprints. This article will discuss how a content developer can approach the analysis and modelling tasks within such strictures and how the business should compensate for the impact of any short cuts taken to avoid a permanent effect on quality.

The scenario is a common one. The business has acquired some content (or has decided to move away from a legacy non-XML workflow) in a non-structured inconsistent format (MS Word, Framemaker, HTML etc.) and the new content is to be added to the corporate repository and in future delivered to customers in a variety of formats and products.

The ideal way forward is to follow best-practice for the content analysis, logical model and physical schema (ideally re-using a suitable existing standard or through customization of a framework like DITA). Following the proper identification of the full data semantics you then convert the data (including manual enrichment where required) to match your schema and fully validate its quality in various iterations using various techniques. However, time or initial cost constraints may prevent best practices being followed or enrichment tasks that are perceived as being not currently delivering value (not required for current product deliverable) from being performed as they are not high priority, so what should you do?

No matter what the constraints are, do concentrate on talking to subject matter experts; really understand the data (not just how it is currently styled or marked-up) and its consistency before documenting the full logical semantic model. Play the semantic model back to the subject matter experts in order to confirm your assumptions (these data experts may not be around when you come to review/improve the data in future). If you are part of an Agile project, you can perform these tasks a part of early sprints delivering knowledge and designs without necessarily having to deliver misleading data.
If you need to make compromises in what will be delivered, document the short-term model as a comparison to the full semantic model and illustrate the effect of some of these decisions. This is vital as, even if ultimately decisions are made not to initially convert/improve the data to its full semantic potential, how can the business understand what knowledge and opportunities may be lost if it does not understand the knowledge within the content that should exist?
Ensure that you establish that the newly converted XML content will be the authoritative and maintained single-source for new publications. This will allow ongoing semantic improvements (along with with keeping content current) to be performed over time. If the content is NOT the master content (and the source will be updated externally or in some other form) ensure the business understands that costs incurred for the initial project will be repeated EVERY time the content is updated and ongoing semantic enrichment will not be cost effective in most cases.
While you cannot rely on styling mark-up in the source material to fully reveal semantics (many items may be headings or bold in this presentation format but for different underlying reasons), the style hints will provide a record of what content deserved special treatment (formatting). If short cuts in conversion are made and this styling is discarded during conversion (and not replaced with appropriate full semantic mark-up) then hints may be lost that could be used for content improvement at a later date (e.g. over time you find out that the bold text inside the first item of a list inside a specific element actually means something specific and needs to be styled, searched or extracted). Consider keeping such additional “hints” in the converted source using general mark-up (the equivalent of an HTML “span” with a “class” attribute) that can be ignored until a future project that wants to enrich the content may needs it. In many cases it may be impossible to go back to original source for such information once the converted content has its own editorial lifecycle (as the original source is now out-of-date). This may also be helpful if a requirement to “round trip” updated content back to its original format is required.
Avoid “tag abuse” (incorrectly marking up semantic content) which will ultimately lead to applications and products losing quality even if the initial product deliverable (e.g. a simple HTML view) functions correctly.
- Do not “crowbar” your new content into some existing supported schema that does not really describe the new data.
- Do not “flatten” important semantics that can automatically be derived from information in the source to presentational mark-up (e.g. bold, heading etc.) even if you have no initial use for that semantic.
Even if the content is to be converted from the source and delivered directly to a product format without an intermediate/ongoing XML editorial stage, do make a schema (or customize an existing framework) and record what semantics you do identify for an intermediate XML stage. This can provide business value and quality indicators in a number of ways. Even if the XML is not highly structured, bulk validation against known expected “styles” in a large corpus can act as means to identify data anomalies or gaps in your understanding of the data that would not be obvious simply by looking at the results of an XSLT conversion.
Where compromises are made, do record the “technical debt” (see https://en.wikipedia.org/wiki/Technical_debt) which should be a concept most IT organizations understand. The key difference with content enrichment technical debt is that unlike remedial action for programming language source code (whose impact may be limited to simply rewriting the code and interfaces to other systems), the remedial actions for content may need a massive manual effort to add in semantics that are not part of the current data set. Even with off-shoring this can be costly and time consuming and affect the business viability of future deliverable products. The detail of what was done and why may be lost in the mists of time.
How can we make proper business decisions without understanding the impact of such decisions now and in the future?
Identify where the approach taken for a given project varies from corporate standard approaches (content quality, duplication not re-use, schemas etc.) and may lead to inconsistency in the corporate knowledge store. This is especially dangerous where the lower quality/inconsistent quality material is later combined with other better sources as part of interactive or print products and customer-facing errors occur that are incorrectly perceived as being product issues. This mixing of good and bad content is equivalent to “polluting your content reservoir”. Ensure that any content that does not meet normal standards (but for business reasons needs to be delivered quickly for use in specific ways for specific projects) is clearly marked with searchable metadata identifying the specific source/batch and its “quality grade”. While this does not fix the content it does make it more traceable so that it can be found programmatically and remedied at a later date. When remedial action does occur to batches of the content, the metadata can then be updated allowing XSLT or other code to change its behaviour based on the quality found.

While any project rarely gets to do the perfect job first time, when dealing with high volumes of valuable data it is vital that the impact of early decisions are understood and managed by the business as a whole rather than causing an unwelcome surprise and costly delays as future needs evolve.

An alternative scenario to the one given at the start of this post is be the case where the master source content is always updated outside of your control and re-supplied each time in a less structured format than is ideal. This can be most frustrating (and expensive) as errors constantly re-appear and require repeated correction.

In the next post I will discuss approaches that can be taken to check and maintain the quality of content throughout its lifecycle and some challenges that are often encountered.

Previous : The Cost of Quality
Next: 10 Things That Can Go Wrong

1 Comment

Data Quality for Publishers – Part 1: The Cost of Quality

25/9/2015

1 Comment

This is the first in a series of short posts discussing the subject of data quality for publishers. Subsequent posts in this series will describe the difficulty of ensuring the ongoing quality of structured information in a cost-effective manner and outline some effective solutions that can be applied.

All professional information publishers aspire to create and deliver good quality output. Whether you are a commercial publisher (your primary products are published data/information systems) or if you are a corporate publisher (publishing in support of other products/services that you deliver) your organisation’s name is stamped all over your output and therefore has its overall reputation intertwined with the customer perception of the content you deliver.

If you are an ultimate “luxury brand” you may go to enormous expense to ensure this reputation cannot be tarnished (the world’s most expensive toilet roll is hand signed and dated by the maker and inspected by firm's president http://tinyurl.com/pyrm5ts) but for most organisations maintaining quality has to compete for budget. This can be an ever more challenging task where there is a drive to reduce costs and speed-up time to market.
This pressure may lead to a re-assessment of what is “acceptable data quality” for a given project or data source.

Data value

When deciding on the appropriate level of investment in data quality it is vital to assess the current and potential value of the data. Consider the following:

Accuracy - Is it vital that the information is highly accurate and accurately represented within the delivered product (and are you culpable if now)?
Exclusivity - Are you sole owners/providers of the data? If so customers may “take what they can get” now but over time enrichment may be an even greater income generator as clients cannot get the data elsewhere)?
Usefulness - Does the data provide a substantial benefit for your customers?
Timeliness - Does the data only have value for a short period of time and therefore must be delivered quickly?
Longevity - Is the data something that would be kept and provide value over a long period of time?
Intelligence - Does the data become more valuable if it is enriched so that it is more functional (searchable, can answer questions, can be better manipulated by specific delivery mechanisms)?
Relationships - Can information within this content be linked to other information in your corpus (or that is publicly accessible or commercially available) in order to provide better value?
Re-use - At a document level, could this data be seen as part of a larger corpus or one or many applications or publications published to one or more media types (e.g. online and print)? At a micro level, could this data be seen as being a best-practice information component that could be re-used within many documents?

What is “acceptable quality” for your data?

When considering quality you need to not only ensure that your data is fit for purpose now but also how that can be maintained over time.

As business requirements change (for example data that is only delivered for occasional reference on a low-value simple web-page may in future have to be delivered as part of an expensive printed book or delivered in a semantically rich form to third parties for re-purposing or analysis), it is likely the expectation within your business will be that the content is adaptable enough to meet these new needs in a timely fashion without excessive further investment.

This may create a dichotomy of opinion within your organisation with regard to minimal investment now (for current requirements only) versus ongoing investment assuring the data against potential future requirements (that may or may not ever arise). Taking an Agile perspective (a lean practice associated with software development), you only focus on delivering functionality that is needed now. However the difference with data (especially large volumes) is that once decisions are made which initially limit the quality and expressiveness of the data, the effort (often manual or semi-manual) and time taken to upgrade it later (often in many iterations) may far outweigh the cost of creating something more maintainable and adaptive initially.

This lack of quality and content agility can lurk unnoticed in your delivered products until such a time when a new product (or a change to an existing product's functionality) brings the issue to the surface. This can cause sever delays in delivering he new functionality or delivery can commence but with potential reputational damage as customers notice irregularities. The remedial action can be time consuming and expensive with an added complication that expert knowledge from the original source may have been lost as the new master data set is the less semantically rich version now in your content management system.

As a long-time advocate of the use of structured information (XML and before that SGML) in publishing, I fully recommend its usage to make your content more robust and adaptive to change. It should be stressed however that the decision to use XML alone will not ensure quality adaptable content. These articles will focus on the importance of making the right choices for semantic data and providing processes and tools to ensure long-term success.

Note: Quality is of course not just about the structure of the data but also relates to the information itself. Users demand content that has best information explained in a clear and consistent way using the most appropriate language with correct spelling and grammar. Some of this can be controlled by the use of editing and linguistic analysis tools but only if a clear “house style” already in place.

Next: Quality and Content Enrichment

1 Comment

Welcome to the website..

17/4/2015

1 Comment

As you can tell with the briefest of glances at the website you can see that it is really new and is pretty devoid of any content other than that describing who I/we are and what we do. The full list of projects and experiences will be maintained in my LinkedIn profile which is accessible from the icon on the top right of every page.
Note : Apologies if the whole "I/we" thing gets confusing! When you are a single consultant represented by a company website then it gets confusing what to use when.

The website is, like other consultants' sites, primarily a marketing exercise to allow you and I to find each other for our mutual benefit. I do not intend to create an XML/XSLT/publishing resource centre here (there are many on those already) but I will try to add thoughts, practical experiences and any links to helpful information held elsewhere as they occur in my working life.

So for now, welcome, and please let me know if you have any issues with the site or any suggested blog topics!

1 Comment

Style over substance? – Quality and consistency of styling in MS Word. Part 3 – A standards-based solution

Style over substance? – Quality and consistency of styling in MS Word. Part 2 - Typical solutions

Style over substance? – Quality and consistency of styling in MS Word. Part 1 – The problem with styles

Data Quality For Publishers – Part 3: 10 Things That Can Go Wrong

1. re-Importing external content

2. reliance on WYSIWYG Editing

3. Check-In of Invalid Content to your CMS

4. Metadata

5. Links - Cross references and transclusions

6. Evolving business need / Evolving quality Expectations

7. Internal data can leak into the outside world

8. Different products may need different or additional content

9. Avoiding duplication

10. Globally Unique and Persistent IDs

Data Quality For Publishers – Part 2: Quality and Content Enrichment

Data Quality for Publishers – Part 1: The Cost of Quality

Welcome to the website..

Author

Archives

Categories

Style over substance? – Quality and consistency of styling in MS Word. Part 3 – A standards-based solution

Style over substance? – Quality and consistency of styling in MS Word. Part 2 - Typical solutions

Style over substance? – Quality and consistency of styling in MS Word. Part 1 – The problem with styles

Data Quality For Publishers – Part 3: 10 Things That Can Go Wrong

1. re-Importing external content

2. reliance on WYSIWYG Editing

3. Check-In of Invalid Content to your CMS

4. Metadata

5. Links - ​Cross references and transclusions

6. ​Evolving business need / Evolving quality Expectations

7. ​Internal data can leak into the outside world

8. ​Different products may need different or additional content

9. ​Avoiding duplication

10. ​Globally Unique and Persistent IDs

﻿Data Quality For Publishers – Part 2: Quality and Content Enrichment

Data Quality for Publishers – Part 1: The Cost of Quality

Welcome to the website..

Author

Archives

Categories

5. Links - Cross references and transclusions

6. Evolving business need / Evolving quality Expectations

7. Internal data can leak into the outside world

8. Different products may need different or additional content

9. Avoiding duplication

10. Globally Unique and Persistent IDs

Data Quality For Publishers – Part 2: Quality and Content Enrichment