Mackenzie Solutions Ltd
  • Home
  • What We Do
  • Recent Projects
  • About Us
  • Technology
  • Partners
  • Contact
  • Blog

Data Quality For Publishers – Part 3: 10 Things That Can Go Wrong

21/10/2015

0 Comments

 
This is the third in a series of short posts discussing the subject of data quality for publishers. In this post we will look at 10 challenges to maintaining data quality over the content lifecycle   and during delivery to multiple products.

In my previous post (http://tinyurl.com/od5xae7) we looked at decisions made during the early days of a data modelling and conversion project. But that does not mean potential quality issues are at an end.
Let’s assume you have:
  • an XSD or Relax NG schema to guide authors and validate results;
  • all your content nicely converted into XML and it validates against the schema;
  • content uploaded into a content management system;
  • an XML editing solution to update the content and add new content; and
  • a publishing process that takes the XML content and pushes it through a transformation pipeline to produce one or more products (e.g. searchable HTML on a website).
So what can go wrong?

Here are 10 things to watch out for and some brief suggestions on how to mitigate quality risks:

1. re-Importing external content

Existing content is resupplied from an external source (not updated in your CMS/XML) but may not be of the required quality. Not a difficult problem if this is just a few files (you can important them and put them through your normal enrichment/approval/QA procedures) but what if there are hundreds or thousands of files? Prior to accepting the content (perhaps it needs to be sent back to outsourcers) and ingesting the content into your “live” content pool (and potentially causing a massive clean-up effort later) consider at least some of the following:
  • Perform bulk validation against your DTD/XSD/RNG schema to ensure that they at least structurally correct.
  • Perform bulk validation using Schematron to ensure that business rules within each file are met (there may be combinations of elements and attributes that the schema allows but should never occur or more advanced rules with regard to naming conventions of IDs).
  • Importing the data into an XML database like BaseX (free) and run xQuery that will provide confidence that the data matches your expectations (e.g. How many documents do you expect to be badged with metadata as being “about” some subject or containing a certain phrase in the title). You can also check to see if data that you do not expect to find occurs (e.g. if a paragraph containing nothing but text inside an emphasis then perhaps this is a tile that has not been semantically identified). Such queries can be saved and used in future as a quality report (also for content generated within the CMS or exported as part of a bulk publishing process). Note: if your main CMS is XML aware then you could always do the import into a test document collection and run the test there.
  • If you have no linguistic consistency tools to help, perform a manual “spot check” into the files checking spelling, grammar and any controlled vocabulary.
  • If your publishing processes are automated and can be run standalone (i.e. not only from within your CMS) for regression tests or functional tests (see later) then push the test data through the process to ensure it delivers what is needed for your end products.
  • When you do accept a new batch of content into the system, ensure that the batch is clearly marked with metadata so that if any errors are found later you can easily isolate the content in the CMS and take appropriate action (bulk correction, roll-back to previous version etc.).

2. reliance on WYSIWYG Editing

​To make life easier for your authors you provide a WYSIYG (What You See Is What You get) authoring mode in your XML editor.  WYSIWYG is a strange concept when used with XML single-source publishing as the content may be published into multiple formats/styles or combined with other information sources as the XML on screen is only a sub-component. Typically the “WYSIWYG” view will match a principle output (e.g. the content in a web product) with some extra facility provided for PDF rendering (e.g. using an XSL-FO processor or XML+CSS processors). While these facilities to help ease-of-use, you cannot rely on previews to ensure that the data being generated is semantically correct. The content may not be as rich as you would like it (perhaps formatting elements like “strong” are used instead of an element like “cost”) but this is masked by the fact that in the main output (and WYSIWYG view) they are formatted the same. Tags may be abused to make the content appear visually correct on screen but are actually semantically dubious. Ultimately you need to train your authors to understand the semantics (and the consequences of not expressing them), and provide a QA process to ensure that the semantics are correct. You may consider providing a core authoring mode that actually helps disambiguate elements that are normally formatted the same on output by adding prompts (like a form input) and/or different styles.

3. Check-In of Invalid Content to your CMS

​Does your CMS allow you to check-in content that does not validate against your structural and/or business rules? On one hand, it may be helpful for authors to be able to safely and centrally store incomplete work (or to pass it over to another user via workflow etc.) but if that content is not easily identifiable as “complete and valid” then it could be published to your products causing major issues and delays. This is normally solved by only publishing particular versions of a content item that has been through some QA and has reached a “published” workflow stage. Most commercial CMS systems will support such workflow/metadata/publishing features but if this is not the case for your home grown CMS then you should consider implementing them.

4. Metadata

Metadata – Your XML schema has elements that are intended to contain metadata describing the information contained in the rest of the file. This may include the content title, subject matter classification, security or licencing information, version numbers and release dates. Other metadata may only be contained in fields in your CMS. So what can go wrong?
  • Metadata inside and outside the XML may get out of sync – the publishing process may need access to the metadata in the XML files whereas for a non XML –aware CMS the data may need to be extracted into fields for indexing so that they can be used for quick searches and for exporting certain categories of content. Either the data should only be held in one place (this may work for an XML CMS) or code must be written that automatically seeds the metadata from the master (database field or XML element) into the slave either in the CMS or at the point of publishing.
  • Metadata classification can become corrupt/bloated with incorrect entries – Hopefully you will have developed a taxonomy allowing you subject matter to be described according to an agreed constrained set of choices (so that authors do not describe one content subject as “cycles” and another similar content as “bicycles”) and to the appropriate level of detail. In some cases such constraints can be expressed as a choice using an XML schema/DTD but this may not be enough to maintain consistency if
    • The metadata is held only in the CMS database fields (see previous point on ”sync”) and not entered into the XML.
    • The taxonomy is hierarchical, complex and growing therefore it is held externally from the schema. In this case specific tools may need to be developed to help guide the user or to validate their choices against definitions held in a centrally maintained data store (e.g an external XML file checked using Schematron).

5. Links - ​Cross references and transclusions 

In most XML sources there is a need to support links. There are many types of links including cross-references (link from this part of the document to that part of the document), images (put this image at this point in the content) and transclusions (get the content whose ID is the target of this link and include it at this point in the content). XML schemas only have very simple mechanisms for validating such constructs (e.g. ID/IDREF or key/keyref) that only work within the scope of the current XML file. Links can be complex and provide vital functionality for end products and for document assembly (e.g. maprefs in DITA) and they need to be maintained during the content lifecycle. What happens if a user tries to delete a section of file B that is the target of a link in file A? Many XML CMS systems support some validation for such scenarios (especially those based on DITA comprehensive linking and component re-use standards) but simple content stores will not.
A simpler quality issue with links is the use of links to external URLs and the need to check that the URL is correct and functional.

6. ​Evolving business need / Evolving quality Expectations

​As discussed in previous posts, it is common to find that product requirements change and evolve but that the content is unfortunately not agile enough to support the new functionality without enrichment. If you have a large corpus then it is unlikely that you will be able to convert and/or manually enrich all of the content at the same time. It is also quite likely that business-as-usual changes (fixes to existing content, urgent new content added) will occur during the same period meaning that the content needs to be re-published even though the source is valid according to various different “profiles”. This can cause major issues “downstream” in the publishing process where code and systems that rely on the content may have to support multiple profiles at once or perhaps they cannot simply be updated in time to support the new content.
These problems can be avoided by ensuring that the final product process is isolated from the CMS/publishing process. The CMS publishing process should be able to create content that is backwards compatible with an agreed delivery schema to ensure old products do not break. This schema can then be used to validate the generated content prior to it being pushed through different products delivery pipelines. If a validation issue occurs at this stage then it is the authors responsibility to fix issues whereas if issues are found in the product then the investigation should start with the product developers.
This process should be implanted via an XML pipeline like XProc or Ant that allows multiple stages of XSLT transformation and validation.

7. ​Internal data can leak into the outside world

 In your schema you may have the ability to gather author’s comments and change notes in a structured way.  You can easily make the dropping of such content part of your publishing process. But what happens if an alternative process is developed that does not re-use the original code that drops these comments? Alternatively what happens if a tiny change to the priority of an XSLT template means that the code that drops the internal comments is no longer run? If you have a “delivery” schema that does NOT allow these comment elements and a standard publishing pipeline that always checks the content prior to it being pushed through product-specific publishing processes then any such mistakes would be identified automatically before it is too late.
This approach can also be used to check if published output contains any material marked as being of an inappropriate security classification or end-user/product licensing scheme.

8. ​Different products may need different or additional content

​Even if the original source XML is good perhaps some subtle changes have caused these “derived files” to be wrong. In this case validation scripts and schemas can be created to validate the quality of these alternative files.
Examples of this would include
  • XML delivered to a customer-facing schema
  • Additional files describing contents of collections (e.g. SCORM manifests)
  • Additional files containing supplementary information (e.g. pre-processed table of contents, lists of word to appear in “word wheels” in search dialogs 

9. ​Avoiding duplication

So your CMS is now the master content and all products are created via single-source publishing of that content, great. How do you avoid duplicate content being created within your CMS which may lead to vital updates being performed on one copy but not the next? This can be tricky to solve especially where you are storing information components not entire documents. If you are also importing legacy material from a time before you have a re-use strategy then this makes things even harder. While there are some tools that perform linguistic analysis and comparison which may help identify duplication, the first step is to have a clearly defined re-use strategy with user instructions on the process they should follow before creating new resources.

10. ​Globally Unique and Persistent IDs 

​You have always created IDs for some content items (in fact they may even be mandatory for some elements) as the content is authored. These IDs will be unique within the XML file (see point 4 above) you edit but may not be unique when your content is combined with other content during the publishing phase.  Additionally internal IDs may be replaced (or new IDs added to elements that do not have them) as the content is published with IDs generated at publish time (e.g. using the XSLT generate-id function for content that is the target of automated table of contents or author delivery features. This may meet the needs of the products being created until such time as there is a need of an end user to be able to reliably target a piece of content in a delivered interactive product using an ID (e.g. a bookmark or personal annotation function on a website). Every time the content is republished, generated-ids will be added with different values causing the end users bookmarks/annotations to fail. Inconsistency of IDs across publishing cycles also causes issues with automated regression testing of the delivered XML.
Wherever possible, permanent globally unique IDs should be created as early in the process as possible and stored in the content so they remain persistent across content publishes.
One of the common factors in the issues above is making the content processes resilient to inevitable changes. One of the best ways to do this is to implement content regression tests. In a future post I will discuss technical challenges and solutions for content regression testing.
​
Previous: Quality And Content Enrichment
0 Comments

Data Quality For Publishers – Part 2: Quality and Content Enrichment

14/10/2015

0 Comments

 


This is the second in a series of short posts discussing the subject of data quality for publishers. In this post we will look at resourcing and decision-making during the initial content analysis and modelling phases and how this can affect the quality and maintainability of your data.

As outlined in the first post (http://tinyurl.com/nnhbxcj) of this series, content developers are often under pressure to deliver quick results when adding a new source to a publisher’s corpus of valuable data. These demands may not be unreasonable as timelines are often driven by an urgent business need such as competitive pressure to deliver a new product. Another driver for quick content results may be that the product is being developed in an Agile way, and initial data is needed to support deliverables from development sprints. This article will discuss how a content developer can approach the analysis and modelling tasks within such strictures and how the business should compensate for the impact of any short cuts taken to avoid a permanent effect on quality.

The scenario is a common one. The business has acquired some content (or has decided to move away from a legacy non-XML workflow) in a non-structured inconsistent format (MS Word, Framemaker, HTML etc.) and the new content is to be added to the corporate repository and in future delivered to customers in a variety of formats and products.

The ideal way forward is to follow best-practice for the content analysis, logical model and physical schema (ideally re-using a suitable existing standard or through customization of a framework like DITA). Following the proper identification of the full data semantics you then convert the data (including manual enrichment where required) to match your schema and fully validate its quality in various iterations using various techniques. However, time or initial cost constraints may prevent best practices being followed or enrichment tasks that are perceived as being not currently delivering value (not required for current product deliverable) from being performed as they are not high priority, so what should you do?
  • No matter what the constraints are, do concentrate on talking to subject matter experts; really understand the data (not just how it is currently styled or marked-up) and its consistency before documenting the full logical semantic model. Play the semantic model back to the subject matter experts in order to confirm your assumptions (these data experts may not be around when you come to review/improve the data in future). If you are part of an Agile project, you can perform these tasks a part of early sprints delivering knowledge and designs without necessarily having to deliver misleading data.
  • If you need to make compromises in what will be delivered, document the short-term model as a comparison to the full semantic model and illustrate the effect of some of these decisions. This is vital as, even if ultimately decisions are made not to initially convert/improve the data to its full semantic potential, how can the business understand what knowledge and opportunities may be lost if it does not understand the knowledge within the content that should exist?
  • Ensure that you establish that the newly converted XML  content will be the authoritative and maintained single-source for new publications. This will allow ongoing semantic improvements (along with with keeping content current) to be performed over time. If the content is NOT the master content (and the source will be updated externally or in some other form) ensure the business understands that costs incurred for the initial project will be repeated EVERY time the content is updated and ongoing semantic enrichment will not be cost effective in most cases. 
  • While you cannot rely on styling mark-up in the source material to fully reveal semantics (many items may be headings or bold in this presentation format but for different underlying reasons), the style hints will provide a record of what content deserved special treatment (formatting). If short cuts in conversion are made and this styling is discarded during conversion (and not replaced with appropriate full semantic mark-up) then hints may be lost that could be used for content improvement at a later date (e.g. over time you find out that the bold text inside the first item of a list inside a specific element actually means something specific and needs to be styled, searched or extracted). Consider keeping such additional “hints” in the converted source using general mark-up (the equivalent of an HTML “span” with a “class” attribute) that can be ignored until a future project that wants to enrich the content may needs it. In many cases it may be impossible to go back to original source for such information once the converted content has its own editorial lifecycle (as the original source is now out-of-date). This may also be helpful if a requirement to “round trip” updated content back to its original format is required.
  • Avoid “tag abuse” (incorrectly marking up semantic content) which will ultimately lead to applications and products losing quality even if the initial product deliverable (e.g. a simple HTML view) functions correctly.
    • Do not “crowbar” your new content into some existing supported schema that does not really describe the new data.
    • Do not “flatten” important semantics that can automatically be derived from information in the source to presentational mark-up (e.g. bold, heading etc.) even if you have no initial use for that semantic.
  • Even if the content is to be converted from the source and delivered directly to a product format without an intermediate/ongoing XML editorial stage, do make a schema (or customize an existing framework) and record what semantics you do identify for an intermediate XML stage. This can provide business value and quality indicators in a number of ways. Even if the XML is not highly structured, bulk validation against known expected “styles” in a large corpus can act as means to identify data anomalies or gaps in your understanding of the data that would not be obvious simply by looking at the results of an XSLT conversion. 
  • Where compromises are made, do record the “technical debt” (see https://en.wikipedia.org/wiki/Technical_debt) which should be a concept most IT organizations understand. The key difference with content enrichment technical debt is that unlike remedial action for programming language source code (whose impact may be limited to simply rewriting the code and interfaces to other systems), the remedial actions for content may need a massive manual effort to add in semantics that are not part of the current data set. Even with off-shoring this can be costly and time consuming and affect the business viability of future deliverable products. The detail of what was done and why may be lost in the mists of time.
    How can we make proper business decisions without understanding the impact of such decisions now and in the future?
  • Identify where the approach taken for a given project varies from corporate standard approaches (content quality, duplication not re-use, schemas etc.) and may lead to inconsistency in the corporate knowledge store. This is especially dangerous where the lower quality/inconsistent quality material is later combined with other better sources as part of interactive or print products and customer-facing errors occur that are incorrectly perceived as being product issues. This mixing of good and bad content is equivalent to “polluting your content reservoir”. Ensure that any content that does not meet normal standards (but for business reasons needs to be delivered quickly for use in specific ways for specific projects) is clearly marked with searchable metadata identifying the specific source/batch and its “quality grade”. While this does not fix the content it does make it more traceable so that it can be found programmatically and remedied at a later date. When remedial action does occur to batches of the content, the metadata can then be updated allowing XSLT or other code to change its behaviour based on the quality found.

While any project rarely gets to do the perfect job first time, when dealing with high volumes of valuable data it is vital that the impact of early decisions are understood and managed by the business as a whole rather than causing an unwelcome surprise and costly delays as future needs evolve.

An alternative scenario to the one given at the start of this post is be the case where the master source content is always updated outside of your control and re-supplied each time in a less structured format than is ideal. This can be most frustrating (and expensive) as errors constantly re-appear and require repeated correction.
​
In the next post I will discuss approaches that can be taken to check and maintain the quality of content throughout its lifecycle and some challenges that are often encountered.

Previous : The Cost of Quality
Next: 10 Things That Can Go Wrong
0 Comments

Data Quality for Publishers – Part 1: The Cost of Quality

25/9/2015

0 Comments

 

This is the first in a series of short posts discussing the subject of data quality for publishers. Subsequent posts in this series will describe the difficulty of ensuring the ongoing quality of structured information in a cost-effective manner and outline some effective solutions that can be applied.


All professional information publishers aspire to create and deliver good quality output. Whether you are a commercial publisher (your primary products are published data/information systems) or if you are a corporate publisher (publishing in support of other products/services that you deliver) your organisation’s name is stamped all over your output and therefore has its overall reputation intertwined with the customer perception of the content you deliver.

If you are an ultimate “luxury brand” you may go to enormous expense to ensure this reputation cannot be tarnished (the world’s most expensive toilet roll is hand signed and dated by the maker and inspected by firm's president http://tinyurl.com/pyrm5ts) but for most organisations maintaining quality has to compete for budget. This can be an ever more challenging task where there is a drive to reduce costs and speed-up time to market.
This pressure may lead to a re-assessment of what is “acceptable data quality” for a given project or data source. 

Data value 

 When deciding on the appropriate level of investment in data quality it is vital to assess the current and potential value of the data. Consider the following:
  • Accuracy - Is it vital that the information is highly accurate and accurately represented within the delivered product (and are you culpable if now)?
  • Exclusivity - Are you sole owners/providers of the data? If so customers may “take what they can get” now but over time enrichment may be an even greater income generator as clients cannot get the data elsewhere)?
  • Usefulness - Does the data provide a substantial benefit for your customers? 
  • Timeliness - Does the data only have value for a short period of time and therefore must be delivered quickly?
  • Longevity - Is the data something that would be kept and provide value over a long period of time?
  • Intelligence - Does the data become more valuable if it is enriched so that it is more functional (searchable, can answer questions, can be better manipulated by specific delivery mechanisms)?
  • Relationships - Can information within this content be linked to other information in your corpus (or that is publicly accessible or commercially available) in order to provide better value?
  • Re-use - At a document level, could this data be seen as part of a larger corpus or one or many applications or publications published to one or more media types (e.g. online and print)? At a micro level, could this data be seen as being a best-practice information component that could be re-used within many documents?

What is “acceptable quality” for your data?

When considering quality you need to not only ensure that your data is fit for purpose now but also how that can be maintained over time.

As business requirements change (for example data that is only delivered for occasional reference on a low-value simple web-page may in future have to be delivered as part of an expensive printed book or delivered in a semantically rich form to third parties for re-purposing or analysis), it is likely the expectation within your business will be that the content is adaptable enough to meet these new needs in a timely fashion without excessive further investment.

This may create a dichotomy of opinion within your organisation with regard to minimal investment now (for current requirements only) versus ongoing investment assuring the data against potential future requirements (that may or may not ever arise). Taking an Agile perspective (a lean practice associated with software development), you only focus on delivering functionality that is needed now. However the difference with data (especially large volumes) is that once decisions are made which initially limit the quality and expressiveness of the data, the effort (often manual or semi-manual) and time taken to upgrade it later (often in many iterations) may far outweigh the cost of creating something more maintainable and adaptive initially.

This lack of quality and content agility can lurk unnoticed in your delivered products until such a time when a new product (or a change to an existing product's functionality) brings the issue to the surface. This can cause sever delays in delivering he new functionality or delivery can commence but with potential reputational damage as customers notice irregularities. The remedial action can be time consuming and expensive with an added complication that expert knowledge from the original source may have been lost as the new master data set is the less semantically rich version now in your content management system. 

As a long-time advocate of the use of structured information (XML and before that SGML) in publishing, I fully recommend its usage to make your content more robust and adaptive to change. It should be stressed however that the decision to use XML alone will not ensure quality adaptable content. These articles will focus on the importance of making the right choices for semantic data and providing processes and tools to ensure long-term success.

Note: Quality is of course not just about the structure of the data but also relates to the information itself.  Users demand content that has best information explained in a clear and consistent way using the most appropriate language with correct spelling and grammar. Some of this can be controlled by the use of editing and linguistic analysis tools but only if a clear “house style” already in place.

Next: Quality and Content Enrichment
0 Comments

Welcome to the website..

17/4/2015

0 Comments

 
As you can tell with the briefest of glances at the website you can see that it is really new and is pretty devoid of any content other than that describing who I/we are and what we do. The full list of projects and experiences  will be maintained in my LinkedIn profile which is accessible from the icon on the top right of every page.
Note : Apologies if the  whole "I/we" thing gets confusing! When you are a single consultant represented by a company website then it gets confusing what to use when.

The website is, like other consultants sites, primarily a marketing exercise to allow you and I to find each other for our mutual benefit. I do not intend to create an XML/XSLT/publishing resource centre here (there are many on those already) but I will try to add thoughts, practical experiences and any links to helpful information held elsewhere as they occur in my working life.

So for now, welcome, and please let me know if you have any issues with the site or any suggested blog topics!  
0 Comments

    Author

    Colin Mackenzie
    Thoughts on XML publishing (when time permits!)

    Archives

    October 2015
    September 2015
    April 2015

    Categories

    All

    RSS Feed

XML Consultant, Development and Training
XSLT Consultant, Development and Training (also XSL-FO)
Data Analysis, XML Schema and DTD Development
Publishing and Business Consultancy

DITA Consultant and DITA-OT Customisation
Content Management and Document Automation Expertise
✕