In my previous post (http://tinyurl.com/od5xae7) we looked at decisions made during the early days of a data modelling and conversion project. But that does not mean potential quality issues are at an end.
Let’s assume you have:
- an XSD or Relax NG schema to guide authors and validate results;
- all your content nicely converted into XML and it validates against the schema;
- content uploaded into a content management system;
- an XML editing solution to update the content and add new content; and
- a publishing process that takes the XML content and pushes it through a transformation pipeline to produce one or more products (e.g. searchable HTML on a website).
Here are 10 things to watch out for and some brief suggestions on how to mitigate quality risks:
1. re-Importing external content
- Perform bulk validation against your DTD/XSD/RNG schema to ensure that they at least structurally correct.
- Perform bulk validation using Schematron to ensure that business rules within each file are met (there may be combinations of elements and attributes that the schema allows but should never occur or more advanced rules with regard to naming conventions of IDs).
- Importing the data into an XML database like BaseX (free) and run xQuery that will provide confidence that the data matches your expectations (e.g. How many documents do you expect to be badged with metadata as being “about” some subject or containing a certain phrase in the title). You can also check to see if data that you do not expect to find occurs (e.g. if a paragraph containing nothing but text inside an emphasis then perhaps this is a tile that has not been semantically identified). Such queries can be saved and used in future as a quality report (also for content generated within the CMS or exported as part of a bulk publishing process). Note: if your main CMS is XML aware then you could always do the import into a test document collection and run the test there.
- If you have no linguistic consistency tools to help, perform a manual “spot check” into the files checking spelling, grammar and any controlled vocabulary.
- If your publishing processes are automated and can be run standalone (i.e. not only from within your CMS) for regression tests or functional tests (see later) then push the test data through the process to ensure it delivers what is needed for your end products.
- When you do accept a new batch of content into the system, ensure that the batch is clearly marked with metadata so that if any errors are found later you can easily isolate the content in the CMS and take appropriate action (bulk correction, roll-back to previous version etc.).
2. reliance on WYSIWYG Editing
3. Check-In of Invalid Content to your CMS
- Metadata inside and outside the XML may get out of sync – the publishing process may need access to the metadata in the XML files whereas for a non XML –aware CMS the data may need to be extracted into fields for indexing so that they can be used for quick searches and for exporting certain categories of content. Either the data should only be held in one place (this may work for an XML CMS) or code must be written that automatically seeds the metadata from the master (database field or XML element) into the slave either in the CMS or at the point of publishing.
- Metadata classification can become corrupt/bloated with incorrect entries – Hopefully you will have developed a taxonomy allowing you subject matter to be described according to an agreed constrained set of choices (so that authors do not describe one content subject as “cycles” and another similar content as “bicycles”) and to the appropriate level of detail. In some cases such constraints can be expressed as a choice using an XML schema/DTD but this may not be enough to maintain consistency if
- The metadata is held only in the CMS database fields (see previous point on ”sync”) and not entered into the XML.
- The taxonomy is hierarchical, complex and growing therefore it is held externally from the schema. In this case specific tools may need to be developed to help guide the user or to validate their choices against definitions held in a centrally maintained data store (e.g an external XML file checked using Schematron).
5. Links - Cross references and transclusions
A simpler quality issue with links is the use of links to external URLs and the need to check that the URL is correct and functional.
6. Evolving business need / Evolving quality Expectations
These problems can be avoided by ensuring that the final product process is isolated from the CMS/publishing process. The CMS publishing process should be able to create content that is backwards compatible with an agreed delivery schema to ensure old products do not break. This schema can then be used to validate the generated content prior to it being pushed through different products delivery pipelines. If a validation issue occurs at this stage then it is the authors responsibility to fix issues whereas if issues are found in the product then the investigation should start with the product developers.
This process should be implanted via an XML pipeline like XProc or Ant that allows multiple stages of XSLT transformation and validation.
7. Internal data can leak into the outside world
This approach can also be used to check if published output contains any material marked as being of an inappropriate security classification or end-user/product licensing scheme.
8. Different products may need different or additional content
Examples of this would include
- XML delivered to a customer-facing schema
- Additional files describing contents of collections (e.g. SCORM manifests)
- Additional files containing supplementary information (e.g. pre-processed table of contents, lists of word to appear in “word wheels” in search dialogs
9. Avoiding duplication
10. Globally Unique and Persistent IDs
Wherever possible, permanent globally unique IDs should be created as early in the process as possible and stored in the content so they remain persistent across content publishes.
Previous: Quality And Content Enrichment