This is the third in a series of short posts discussing the subject of data quality for publishers. In this post we will look at 10 challenges to maintaining data quality over the content lifecycle and during delivery to multiple products.
In my previous post (http://tinyurl.com/od5xae7) we looked at decisions made during the early days of a data modelling and conversion project. But that does not mean potential quality issues are at an end.
Let’s assume you have:
Here are 10 things to watch out for and some brief suggestions on how to mitigate quality risks:
In my previous post (http://tinyurl.com/od5xae7) we looked at decisions made during the early days of a data modelling and conversion project. But that does not mean potential quality issues are at an end.
Let’s assume you have:
- an XSD or Relax NG schema to guide authors and validate results;
- all your content nicely converted into XML and it validates against the schema;
- content uploaded into a content management system;
- an XML editing solution to update the content and add new content; and
- a publishing process that takes the XML content and pushes it through a transformation pipeline to produce one or more products (e.g. searchable HTML on a website).
Here are 10 things to watch out for and some brief suggestions on how to mitigate quality risks:
1. re-Importing external content
Existing content is resupplied from an external source (not updated in your CMS/XML) but may not be of the required quality. Not a difficult problem if this is just a few files (you can important them and put them through your normal enrichment/approval/QA procedures) but what if there are hundreds or thousands of files? Prior to accepting the content (perhaps it needs to be sent back to outsourcers) and ingesting the content into your “live” content pool (and potentially causing a massive clean-up effort later) consider at least some of the following:
- Perform bulk validation against your DTD/XSD/RNG schema to ensure that they at least structurally correct.
- Perform bulk validation using Schematron to ensure that business rules within each file are met (there may be combinations of elements and attributes that the schema allows but should never occur or more advanced rules with regard to naming conventions of IDs).
- Importing the data into an XML database like BaseX (free) and run xQuery that will provide confidence that the data matches your expectations (e.g. How many documents do you expect to be badged with metadata as being “about” some subject or containing a certain phrase in the title). You can also check to see if data that you do not expect to find occurs (e.g. if a paragraph containing nothing but text inside an emphasis then perhaps this is a tile that has not been semantically identified). Such queries can be saved and used in future as a quality report (also for content generated within the CMS or exported as part of a bulk publishing process). Note: if your main CMS is XML aware then you could always do the import into a test document collection and run the test there.
- If you have no linguistic consistency tools to help, perform a manual “spot check” into the files checking spelling, grammar and any controlled vocabulary.
- If your publishing processes are automated and can be run standalone (i.e. not only from within your CMS) for regression tests or functional tests (see later) then push the test data through the process to ensure it delivers what is needed for your end products.
- When you do accept a new batch of content into the system, ensure that the batch is clearly marked with metadata so that if any errors are found later you can easily isolate the content in the CMS and take appropriate action (bulk correction, roll-back to previous version etc.).
2. reliance on WYSIWYG Editing
To make life easier for your authors you provide a WYSIYG (What You See Is What You get) authoring mode in your XML editor. WYSIWYG is a strange concept when used with XML single-source publishing as the content may be published into multiple formats/styles or combined with other information sources as the XML on screen is only a sub-component. Typically the “WYSIWYG” view will match a principle output (e.g. the content in a web product) with some extra facility provided for PDF rendering (e.g. using an XSL-FO processor or XML+CSS processors). While these facilities to help ease-of-use, you cannot rely on previews to ensure that the data being generated is semantically correct. The content may not be as rich as you would like it (perhaps formatting elements like “strong” are used instead of an element like “cost”) but this is masked by the fact that in the main output (and WYSIWYG view) they are formatted the same. Tags may be abused to make the content appear visually correct on screen but are actually semantically dubious. Ultimately you need to train your authors to understand the semantics (and the consequences of not expressing them), and provide a QA process to ensure that the semantics are correct. You may consider providing a core authoring mode that actually helps disambiguate elements that are normally formatted the same on output by adding prompts (like a form input) and/or different styles.
3. Check-In of Invalid Content to your CMS
Does your CMS allow you to check-in content that does not validate against your structural and/or business rules? On one hand, it may be helpful for authors to be able to safely and centrally store incomplete work (or to pass it over to another user via workflow etc.) but if that content is not easily identifiable as “complete and valid” then it could be published to your products causing major issues and delays. This is normally solved by only publishing particular versions of a content item that has been through some QA and has reached a “published” workflow stage. Most commercial CMS systems will support such workflow/metadata/publishing features but if this is not the case for your home grown CMS then you should consider implementing them.
4. Metadata
Metadata – Your XML schema has elements that are intended to contain metadata describing the information contained in the rest of the file. This may include the content title, subject matter classification, security or licencing information, version numbers and release dates. Other metadata may only be contained in fields in your CMS. So what can go wrong?
- Metadata inside and outside the XML may get out of sync – the publishing process may need access to the metadata in the XML files whereas for a non XML –aware CMS the data may need to be extracted into fields for indexing so that they can be used for quick searches and for exporting certain categories of content. Either the data should only be held in one place (this may work for an XML CMS) or code must be written that automatically seeds the metadata from the master (database field or XML element) into the slave either in the CMS or at the point of publishing.
- Metadata classification can become corrupt/bloated with incorrect entries – Hopefully you will have developed a taxonomy allowing you subject matter to be described according to an agreed constrained set of choices (so that authors do not describe one content subject as “cycles” and another similar content as “bicycles”) and to the appropriate level of detail. In some cases such constraints can be expressed as a choice using an XML schema/DTD but this may not be enough to maintain consistency if
- The metadata is held only in the CMS database fields (see previous point on ”sync”) and not entered into the XML.
- The taxonomy is hierarchical, complex and growing therefore it is held externally from the schema. In this case specific tools may need to be developed to help guide the user or to validate their choices against definitions held in a centrally maintained data store (e.g an external XML file checked using Schematron).
5. Links - Cross references and transclusions
In most XML sources there is a need to support links. There are many types of links including cross-references (link from this part of the document to that part of the document), images (put this image at this point in the content) and transclusions (get the content whose ID is the target of this link and include it at this point in the content). XML schemas only have very simple mechanisms for validating such constructs (e.g. ID/IDREF or key/keyref) that only work within the scope of the current XML file. Links can be complex and provide vital functionality for end products and for document assembly (e.g. maprefs in DITA) and they need to be maintained during the content lifecycle. What happens if a user tries to delete a section of file B that is the target of a link in file A? Many XML CMS systems support some validation for such scenarios (especially those based on DITA comprehensive linking and component re-use standards) but simple content stores will not.
A simpler quality issue with links is the use of links to external URLs and the need to check that the URL is correct and functional.
A simpler quality issue with links is the use of links to external URLs and the need to check that the URL is correct and functional.
6. Evolving business need / Evolving quality Expectations
As discussed in previous posts, it is common to find that product requirements change and evolve but that the content is unfortunately not agile enough to support the new functionality without enrichment. If you have a large corpus then it is unlikely that you will be able to convert and/or manually enrich all of the content at the same time. It is also quite likely that business-as-usual changes (fixes to existing content, urgent new content added) will occur during the same period meaning that the content needs to be re-published even though the source is valid according to various different “profiles”. This can cause major issues “downstream” in the publishing process where code and systems that rely on the content may have to support multiple profiles at once or perhaps they cannot simply be updated in time to support the new content.
These problems can be avoided by ensuring that the final product process is isolated from the CMS/publishing process. The CMS publishing process should be able to create content that is backwards compatible with an agreed delivery schema to ensure old products do not break. This schema can then be used to validate the generated content prior to it being pushed through different products delivery pipelines. If a validation issue occurs at this stage then it is the authors responsibility to fix issues whereas if issues are found in the product then the investigation should start with the product developers.
This process should be implanted via an XML pipeline like XProc or Ant that allows multiple stages of XSLT transformation and validation.
These problems can be avoided by ensuring that the final product process is isolated from the CMS/publishing process. The CMS publishing process should be able to create content that is backwards compatible with an agreed delivery schema to ensure old products do not break. This schema can then be used to validate the generated content prior to it being pushed through different products delivery pipelines. If a validation issue occurs at this stage then it is the authors responsibility to fix issues whereas if issues are found in the product then the investigation should start with the product developers.
This process should be implanted via an XML pipeline like XProc or Ant that allows multiple stages of XSLT transformation and validation.
7. Internal data can leak into the outside world
In your schema you may have the ability to gather author’s comments and change notes in a structured way. You can easily make the dropping of such content part of your publishing process. But what happens if an alternative process is developed that does not re-use the original code that drops these comments? Alternatively what happens if a tiny change to the priority of an XSLT template means that the code that drops the internal comments is no longer run? If you have a “delivery” schema that does NOT allow these comment elements and a standard publishing pipeline that always checks the content prior to it being pushed through product-specific publishing processes then any such mistakes would be identified automatically before it is too late.
This approach can also be used to check if published output contains any material marked as being of an inappropriate security classification or end-user/product licensing scheme.
This approach can also be used to check if published output contains any material marked as being of an inappropriate security classification or end-user/product licensing scheme.
8. Different products may need different or additional content
Even if the original source XML is good perhaps some subtle changes have caused these “derived files” to be wrong. In this case validation scripts and schemas can be created to validate the quality of these alternative files.
Examples of this would include
Examples of this would include
- XML delivered to a customer-facing schema
- Additional files describing contents of collections (e.g. SCORM manifests)
- Additional files containing supplementary information (e.g. pre-processed table of contents, lists of word to appear in “word wheels” in search dialogs
9. Avoiding duplication
So your CMS is now the master content and all products are created via single-source publishing of that content, great. How do you avoid duplicate content being created within your CMS which may lead to vital updates being performed on one copy but not the next? This can be tricky to solve especially where you are storing information components not entire documents. If you are also importing legacy material from a time before you have a re-use strategy then this makes things even harder. While there are some tools that perform linguistic analysis and comparison which may help identify duplication, the first step is to have a clearly defined re-use strategy with user instructions on the process they should follow before creating new resources.
10. Globally Unique and Persistent IDs
You have always created IDs for some content items (in fact they may even be mandatory for some elements) as the content is authored. These IDs will be unique within the XML file (see point 4 above) you edit but may not be unique when your content is combined with other content during the publishing phase. Additionally internal IDs may be replaced (or new IDs added to elements that do not have them) as the content is published with IDs generated at publish time (e.g. using the XSLT generate-id function for content that is the target of automated table of contents or author delivery features. This may meet the needs of the products being created until such time as there is a need of an end user to be able to reliably target a piece of content in a delivered interactive product using an ID (e.g. a bookmark or personal annotation function on a website). Every time the content is republished, generated-ids will be added with different values causing the end users bookmarks/annotations to fail. Inconsistency of IDs across publishing cycles also causes issues with automated regression testing of the delivered XML.
Wherever possible, permanent globally unique IDs should be created as early in the process as possible and stored in the content so they remain persistent across content publishes.
Wherever possible, permanent globally unique IDs should be created as early in the process as possible and stored in the content so they remain persistent across content publishes.
One of the common factors in the issues above is making the content processes resilient to inevitable changes. One of the best ways to do this is to implement content regression tests. In a future post I will discuss technical challenges and solutions for content regression testing.
Previous: Quality And Content Enrichment
Previous: Quality And Content Enrichment