XML, XSLT and publishing blog

Data Quality For Publishers – Part 3: 10 Things That Can Go Wrong

21/10/2015

This is the third in a series of short posts discussing the subject of data quality for publishers. In this post we will look at 10 challenges to maintaining data quality over the content lifecycle and during delivery to multiple products.

In my previous post (http://tinyurl.com/od5xae7) we looked at decisions made during the early days of a data modelling and conversion project. But that does not mean potential quality issues are at an end.
Let’s assume you have:

an XSD or Relax NG schema to guide authors and validate results;
all your content nicely converted into XML and it validates against the schema;
content uploaded into a content management system;
an XML editing solution to update the content and add new content; and
a publishing process that takes the XML content and pushes it through a transformation pipeline to produce one or more products (e.g. searchable HTML on a website).

So what can go wrong?

Here are 10 things to watch out for and some brief suggestions on how to mitigate quality risks:

1. re-Importing external content

Existing content is resupplied from an external source (not updated in your CMS/XML) but may not be of the required quality. Not a difficult problem if this is just a few files (you can important them and put them through your normal enrichment/approval/QA procedures) but what if there are hundreds or thousands of files? Prior to accepting the content (perhaps it needs to be sent back to outsourcers) and ingesting the content into your “live” content pool (and potentially causing a massive clean-up effort later) consider at least some of the following:

Perform bulk validation against your DTD/XSD/RNG schema to ensure that they at least structurally correct.
Perform bulk validation using Schematron to ensure that business rules within each file are met (there may be combinations of elements and attributes that the schema allows but should never occur or more advanced rules with regard to naming conventions of IDs).
Importing the data into an XML database like BaseX (free) and run xQuery that will provide confidence that the data matches your expectations (e.g. How many documents do you expect to be badged with metadata as being “about” some subject or containing a certain phrase in the title). You can also check to see if data that you do not expect to find occurs (e.g. if a paragraph containing nothing but text inside an emphasis then perhaps this is a tile that has not been semantically identified). Such queries can be saved and used in future as a quality report (also for content generated within the CMS or exported as part of a bulk publishing process). Note: if your main CMS is XML aware then you could always do the import into a test document collection and run the test there.
If you have no linguistic consistency tools to help, perform a manual “spot check” into the files checking spelling, grammar and any controlled vocabulary.
If your publishing processes are automated and can be run standalone (i.e. not only from within your CMS) for regression tests or functional tests (see later) then push the test data through the process to ensure it delivers what is needed for your end products.
When you do accept a new batch of content into the system, ensure that the batch is clearly marked with metadata so that if any errors are found later you can easily isolate the content in the CMS and take appropriate action (bulk correction, roll-back to previous version etc.).

2. reliance on WYSIWYG Editing

To make life easier for your authors you provide a WYSIYG (What You See Is What You get) authoring mode in your XML editor. WYSIWYG is a strange concept when used with XML single-source publishing as the content may be published into multiple formats/styles or combined with other information sources as the XML on screen is only a sub-component. Typically the “WYSIWYG” view will match a principle output (e.g. the content in a web product) with some extra facility provided for PDF rendering (e.g. using an XSL-FO processor or XML+CSS processors). While these facilities to help ease-of-use, you cannot rely on previews to ensure that the data being generated is semantically correct. The content may not be as rich as you would like it (perhaps formatting elements like “strong” are used instead of an element like “cost”) but this is masked by the fact that in the main output (and WYSIWYG view) they are formatted the same. Tags may be abused to make the content appear visually correct on screen but are actually semantically dubious. Ultimately you need to train your authors to understand the semantics (and the consequences of not expressing them), and provide a QA process to ensure that the semantics are correct. You may consider providing a core authoring mode that actually helps disambiguate elements that are normally formatted the same on output by adding prompts (like a form input) and/or different styles.

3. Check-In of Invalid Content to your CMS

Does your CMS allow you to check-in content that does not validate against your structural and/or business rules? On one hand, it may be helpful for authors to be able to safely and centrally store incomplete work (or to pass it over to another user via workflow etc.) but if that content is not easily identifiable as “complete and valid” then it could be published to your products causing major issues and delays. This is normally solved by only publishing particular versions of a content item that has been through some QA and has reached a “published” workflow stage. Most commercial CMS systems will support such workflow/metadata/publishing features but if this is not the case for your home grown CMS then you should consider implementing them.

4. Metadata

Metadata – Your XML schema has elements that are intended to contain metadata describing the information contained in the rest of the file. This may include the content title, subject matter classification, security or licencing information, version numbers and release dates. Other metadata may only be contained in fields in your CMS. So what can go wrong?

Metadata inside and outside the XML may get out of sync – the publishing process may need access to the metadata in the XML files whereas for a non XML –aware CMS the data may need to be extracted into fields for indexing so that they can be used for quick searches and for exporting certain categories of content. Either the data should only be held in one place (this may work for an XML CMS) or code must be written that automatically seeds the metadata from the master (database field or XML element) into the slave either in the CMS or at the point of publishing.
Metadata classification can become corrupt/bloated with incorrect entries – Hopefully you will have developed a taxonomy allowing you subject matter to be described according to an agreed constrained set of choices (so that authors do not describe one content subject as “cycles” and another similar content as “bicycles”) and to the appropriate level of detail. In some cases such constraints can be expressed as a choice using an XML schema/DTD but this may not be enough to maintain consistency if
- The metadata is held only in the CMS database fields (see previous point on ”sync”) and not entered into the XML.
- The taxonomy is hierarchical, complex and growing therefore it is held externally from the schema. In this case specific tools may need to be developed to help guide the user or to validate their choices against definitions held in a centrally maintained data store (e.g an external XML file checked using Schematron).

5. Links - Cross references and transclusions

In most XML sources there is a need to support links. There are many types of links including cross-references (link from this part of the document to that part of the document), images (put this image at this point in the content) and transclusions (get the content whose ID is the target of this link and include it at this point in the content). XML schemas only have very simple mechanisms for validating such constructs (e.g. ID/IDREF or key/keyref) that only work within the scope of the current XML file. Links can be complex and provide vital functionality for end products and for document assembly (e.g. maprefs in DITA) and they need to be maintained during the content lifecycle. What happens if a user tries to delete a section of file B that is the target of a link in file A? Many XML CMS systems support some validation for such scenarios (especially those based on DITA comprehensive linking and component re-use standards) but simple content stores will not.
A simpler quality issue with links is the use of links to external URLs and the need to check that the URL is correct and functional.

6. Evolving business need / Evolving quality Expectations

As discussed in previous posts, it is common to find that product requirements change and evolve but that the content is unfortunately not agile enough to support the new functionality without enrichment. If you have a large corpus then it is unlikely that you will be able to convert and/or manually enrich all of the content at the same time. It is also quite likely that business-as-usual changes (fixes to existing content, urgent new content added) will occur during the same period meaning that the content needs to be re-published even though the source is valid according to various different “profiles”. This can cause major issues “downstream” in the publishing process where code and systems that rely on the content may have to support multiple profiles at once or perhaps they cannot simply be updated in time to support the new content.
These problems can be avoided by ensuring that the final product process is isolated from the CMS/publishing process. The CMS publishing process should be able to create content that is backwards compatible with an agreed delivery schema to ensure old products do not break. This schema can then be used to validate the generated content prior to it being pushed through different products delivery pipelines. If a validation issue occurs at this stage then it is the authors responsibility to fix issues whereas if issues are found in the product then the investigation should start with the product developers.
This process should be implanted via an XML pipeline like XProc or Ant that allows multiple stages of XSLT transformation and validation.

7. Internal data can leak into the outside world

In your schema you may have the ability to gather author’s comments and change notes in a structured way. You can easily make the dropping of such content part of your publishing process. But what happens if an alternative process is developed that does not re-use the original code that drops these comments? Alternatively what happens if a tiny change to the priority of an XSLT template means that the code that drops the internal comments is no longer run? If you have a “delivery” schema that does NOT allow these comment elements and a standard publishing pipeline that always checks the content prior to it being pushed through product-specific publishing processes then any such mistakes would be identified automatically before it is too late.
This approach can also be used to check if published output contains any material marked as being of an inappropriate security classification or end-user/product licensing scheme.

8. Different products may need different or additional content

Even if the original source XML is good perhaps some subtle changes have caused these “derived files” to be wrong. In this case validation scripts and schemas can be created to validate the quality of these alternative files.
Examples of this would include

XML delivered to a customer-facing schema
Additional files describing contents of collections (e.g. SCORM manifests)
Additional files containing supplementary information (e.g. pre-processed table of contents, lists of word to appear in “word wheels” in search dialogs

9. Avoiding duplication

So your CMS is now the master content and all products are created via single-source publishing of that content, great. How do you avoid duplicate content being created within your CMS which may lead to vital updates being performed on one copy but not the next? This can be tricky to solve especially where you are storing information components not entire documents. If you are also importing legacy material from a time before you have a re-use strategy then this makes things even harder. While there are some tools that perform linguistic analysis and comparison which may help identify duplication, the first step is to have a clearly defined re-use strategy with user instructions on the process they should follow before creating new resources.

10. Globally Unique and Persistent IDs

You have always created IDs for some content items (in fact they may even be mandatory for some elements) as the content is authored. These IDs will be unique within the XML file (see point 4 above) you edit but may not be unique when your content is combined with other content during the publishing phase. Additionally internal IDs may be replaced (or new IDs added to elements that do not have them) as the content is published with IDs generated at publish time (e.g. using the XSLT generate-id function for content that is the target of automated table of contents or author delivery features. This may meet the needs of the products being created until such time as there is a need of an end user to be able to reliably target a piece of content in a delivered interactive product using an ID (e.g. a bookmark or personal annotation function on a website). Every time the content is republished, generated-ids will be added with different values causing the end users bookmarks/annotations to fail. Inconsistency of IDs across publishing cycles also causes issues with automated regression testing of the delivered XML.
Wherever possible, permanent globally unique IDs should be created as early in the process as possible and stored in the content so they remain persistent across content publishes.

One of the common factors in the issues above is making the content processes resilient to inevitable changes. One of the best ways to do this is to implement content regression tests. In a future post I will discuss technical challenges and solutions for content regression testing.

Previous: Quality And Content Enrichment

2 Comments

Data Quality For Publishers – Part 2: Quality and Content Enrichment

14/10/2015

1 Comment

This is the second in a series of short posts discussing the subject of data quality for publishers. In this post we will look at resourcing and decision-making during the initial content analysis and modelling phases and how this can affect the quality and maintainability of your data.

As outlined in the first post (http://tinyurl.com/nnhbxcj) of this series, content developers are often under pressure to deliver quick results when adding a new source to a publisher’s corpus of valuable data. These demands may not be unreasonable as timelines are often driven by an urgent business need such as competitive pressure to deliver a new product. Another driver for quick content results may be that the product is being developed in an Agile way, and initial data is needed to support deliverables from development sprints. This article will discuss how a content developer can approach the analysis and modelling tasks within such strictures and how the business should compensate for the impact of any short cuts taken to avoid a permanent effect on quality.

The scenario is a common one. The business has acquired some content (or has decided to move away from a legacy non-XML workflow) in a non-structured inconsistent format (MS Word, Framemaker, HTML etc.) and the new content is to be added to the corporate repository and in future delivered to customers in a variety of formats and products.

The ideal way forward is to follow best-practice for the content analysis, logical model and physical schema (ideally re-using a suitable existing standard or through customization of a framework like DITA). Following the proper identification of the full data semantics you then convert the data (including manual enrichment where required) to match your schema and fully validate its quality in various iterations using various techniques. However, time or initial cost constraints may prevent best practices being followed or enrichment tasks that are perceived as being not currently delivering value (not required for current product deliverable) from being performed as they are not high priority, so what should you do?

No matter what the constraints are, do concentrate on talking to subject matter experts; really understand the data (not just how it is currently styled or marked-up) and its consistency before documenting the full logical semantic model. Play the semantic model back to the subject matter experts in order to confirm your assumptions (these data experts may not be around when you come to review/improve the data in future). If you are part of an Agile project, you can perform these tasks a part of early sprints delivering knowledge and designs without necessarily having to deliver misleading data.
If you need to make compromises in what will be delivered, document the short-term model as a comparison to the full semantic model and illustrate the effect of some of these decisions. This is vital as, even if ultimately decisions are made not to initially convert/improve the data to its full semantic potential, how can the business understand what knowledge and opportunities may be lost if it does not understand the knowledge within the content that should exist?
Ensure that you establish that the newly converted XML content will be the authoritative and maintained single-source for new publications. This will allow ongoing semantic improvements (along with with keeping content current) to be performed over time. If the content is NOT the master content (and the source will be updated externally or in some other form) ensure the business understands that costs incurred for the initial project will be repeated EVERY time the content is updated and ongoing semantic enrichment will not be cost effective in most cases.
While you cannot rely on styling mark-up in the source material to fully reveal semantics (many items may be headings or bold in this presentation format but for different underlying reasons), the style hints will provide a record of what content deserved special treatment (formatting). If short cuts in conversion are made and this styling is discarded during conversion (and not replaced with appropriate full semantic mark-up) then hints may be lost that could be used for content improvement at a later date (e.g. over time you find out that the bold text inside the first item of a list inside a specific element actually means something specific and needs to be styled, searched or extracted). Consider keeping such additional “hints” in the converted source using general mark-up (the equivalent of an HTML “span” with a “class” attribute) that can be ignored until a future project that wants to enrich the content may needs it. In many cases it may be impossible to go back to original source for such information once the converted content has its own editorial lifecycle (as the original source is now out-of-date). This may also be helpful if a requirement to “round trip” updated content back to its original format is required.
Avoid “tag abuse” (incorrectly marking up semantic content) which will ultimately lead to applications and products losing quality even if the initial product deliverable (e.g. a simple HTML view) functions correctly.
- Do not “crowbar” your new content into some existing supported schema that does not really describe the new data.
- Do not “flatten” important semantics that can automatically be derived from information in the source to presentational mark-up (e.g. bold, heading etc.) even if you have no initial use for that semantic.
Even if the content is to be converted from the source and delivered directly to a product format without an intermediate/ongoing XML editorial stage, do make a schema (or customize an existing framework) and record what semantics you do identify for an intermediate XML stage. This can provide business value and quality indicators in a number of ways. Even if the XML is not highly structured, bulk validation against known expected “styles” in a large corpus can act as means to identify data anomalies or gaps in your understanding of the data that would not be obvious simply by looking at the results of an XSLT conversion.
Where compromises are made, do record the “technical debt” (see https://en.wikipedia.org/wiki/Technical_debt) which should be a concept most IT organizations understand. The key difference with content enrichment technical debt is that unlike remedial action for programming language source code (whose impact may be limited to simply rewriting the code and interfaces to other systems), the remedial actions for content may need a massive manual effort to add in semantics that are not part of the current data set. Even with off-shoring this can be costly and time consuming and affect the business viability of future deliverable products. The detail of what was done and why may be lost in the mists of time.
How can we make proper business decisions without understanding the impact of such decisions now and in the future?
Identify where the approach taken for a given project varies from corporate standard approaches (content quality, duplication not re-use, schemas etc.) and may lead to inconsistency in the corporate knowledge store. This is especially dangerous where the lower quality/inconsistent quality material is later combined with other better sources as part of interactive or print products and customer-facing errors occur that are incorrectly perceived as being product issues. This mixing of good and bad content is equivalent to “polluting your content reservoir”. Ensure that any content that does not meet normal standards (but for business reasons needs to be delivered quickly for use in specific ways for specific projects) is clearly marked with searchable metadata identifying the specific source/batch and its “quality grade”. While this does not fix the content it does make it more traceable so that it can be found programmatically and remedied at a later date. When remedial action does occur to batches of the content, the metadata can then be updated allowing XSLT or other code to change its behaviour based on the quality found.

While any project rarely gets to do the perfect job first time, when dealing with high volumes of valuable data it is vital that the impact of early decisions are understood and managed by the business as a whole rather than causing an unwelcome surprise and costly delays as future needs evolve.

An alternative scenario to the one given at the start of this post is be the case where the master source content is always updated outside of your control and re-supplied each time in a less structured format than is ideal. This can be most frustrating (and expensive) as errors constantly re-appear and require repeated correction.

In the next post I will discuss approaches that can be taken to check and maintain the quality of content throughout its lifecycle and some challenges that are often encountered.

Previous : The Cost of Quality
Next: 10 Things That Can Go Wrong

1 Comment

Data Quality For Publishers – Part 3: 10 Things That Can Go Wrong

1. re-Importing external content

2. reliance on WYSIWYG Editing

3. Check-In of Invalid Content to your CMS

4. Metadata

5. Links - Cross references and transclusions

6. Evolving business need / Evolving quality Expectations

7. Internal data can leak into the outside world

8. Different products may need different or additional content

9. Avoiding duplication

10. Globally Unique and Persistent IDs

Data Quality For Publishers – Part 2: Quality and Content Enrichment

Author

Archives

Categories

Data Quality For Publishers – Part 3: 10 Things That Can Go Wrong

1. re-Importing external content

2. reliance on WYSIWYG Editing

3. Check-In of Invalid Content to your CMS

4. Metadata

5. Links - ​Cross references and transclusions

6. ​Evolving business need / Evolving quality Expectations

7. ​Internal data can leak into the outside world

8. ​Different products may need different or additional content

9. ​Avoiding duplication

10. ​Globally Unique and Persistent IDs

﻿Data Quality For Publishers – Part 2: Quality and Content Enrichment

Author

Archives

Categories

5. Links - Cross references and transclusions

6. Evolving business need / Evolving quality Expectations

7. Internal data can leak into the outside world

8. Different products may need different or additional content

9. Avoiding duplication

10. Globally Unique and Persistent IDs

Data Quality For Publishers – Part 2: Quality and Content Enrichment