XML, XSLT and publishing blog

Data Quality For Publishers – Part 2: Quality and Content Enrichment

14/10/2015

This is the second in a series of short posts discussing the subject of data quality for publishers. In this post we will look at resourcing and decision-making during the initial content analysis and modelling phases and how this can affect the quality and maintainability of your data.

As outlined in the first post (http://tinyurl.com/nnhbxcj) of this series, content developers are often under pressure to deliver quick results when adding a new source to a publisher’s corpus of valuable data. These demands may not be unreasonable as timelines are often driven by an urgent business need such as competitive pressure to deliver a new product. Another driver for quick content results may be that the product is being developed in an Agile way, and initial data is needed to support deliverables from development sprints. This article will discuss how a content developer can approach the analysis and modelling tasks within such strictures and how the business should compensate for the impact of any short cuts taken to avoid a permanent effect on quality.

The scenario is a common one. The business has acquired some content (or has decided to move away from a legacy non-XML workflow) in a non-structured inconsistent format (MS Word, Framemaker, HTML etc.) and the new content is to be added to the corporate repository and in future delivered to customers in a variety of formats and products.

The ideal way forward is to follow best-practice for the content analysis, logical model and physical schema (ideally re-using a suitable existing standard or through customization of a framework like DITA). Following the proper identification of the full data semantics you then convert the data (including manual enrichment where required) to match your schema and fully validate its quality in various iterations using various techniques. However, time or initial cost constraints may prevent best practices being followed or enrichment tasks that are perceived as being not currently delivering value (not required for current product deliverable) from being performed as they are not high priority, so what should you do?

No matter what the constraints are, do concentrate on talking to subject matter experts; really understand the data (not just how it is currently styled or marked-up) and its consistency before documenting the full logical semantic model. Play the semantic model back to the subject matter experts in order to confirm your assumptions (these data experts may not be around when you come to review/improve the data in future). If you are part of an Agile project, you can perform these tasks a part of early sprints delivering knowledge and designs without necessarily having to deliver misleading data.
If you need to make compromises in what will be delivered, document the short-term model as a comparison to the full semantic model and illustrate the effect of some of these decisions. This is vital as, even if ultimately decisions are made not to initially convert/improve the data to its full semantic potential, how can the business understand what knowledge and opportunities may be lost if it does not understand the knowledge within the content that should exist?
Ensure that you establish that the newly converted XML content will be the authoritative and maintained single-source for new publications. This will allow ongoing semantic improvements (along with with keeping content current) to be performed over time. If the content is NOT the master content (and the source will be updated externally or in some other form) ensure the business understands that costs incurred for the initial project will be repeated EVERY time the content is updated and ongoing semantic enrichment will not be cost effective in most cases.
While you cannot rely on styling mark-up in the source material to fully reveal semantics (many items may be headings or bold in this presentation format but for different underlying reasons), the style hints will provide a record of what content deserved special treatment (formatting). If short cuts in conversion are made and this styling is discarded during conversion (and not replaced with appropriate full semantic mark-up) then hints may be lost that could be used for content improvement at a later date (e.g. over time you find out that the bold text inside the first item of a list inside a specific element actually means something specific and needs to be styled, searched or extracted). Consider keeping such additional “hints” in the converted source using general mark-up (the equivalent of an HTML “span” with a “class” attribute) that can be ignored until a future project that wants to enrich the content may needs it. In many cases it may be impossible to go back to original source for such information once the converted content has its own editorial lifecycle (as the original source is now out-of-date). This may also be helpful if a requirement to “round trip” updated content back to its original format is required.
Avoid “tag abuse” (incorrectly marking up semantic content) which will ultimately lead to applications and products losing quality even if the initial product deliverable (e.g. a simple HTML view) functions correctly.
- Do not “crowbar” your new content into some existing supported schema that does not really describe the new data.
- Do not “flatten” important semantics that can automatically be derived from information in the source to presentational mark-up (e.g. bold, heading etc.) even if you have no initial use for that semantic.
Even if the content is to be converted from the source and delivered directly to a product format without an intermediate/ongoing XML editorial stage, do make a schema (or customize an existing framework) and record what semantics you do identify for an intermediate XML stage. This can provide business value and quality indicators in a number of ways. Even if the XML is not highly structured, bulk validation against known expected “styles” in a large corpus can act as means to identify data anomalies or gaps in your understanding of the data that would not be obvious simply by looking at the results of an XSLT conversion.
Where compromises are made, do record the “technical debt” (see https://en.wikipedia.org/wiki/Technical_debt) which should be a concept most IT organizations understand. The key difference with content enrichment technical debt is that unlike remedial action for programming language source code (whose impact may be limited to simply rewriting the code and interfaces to other systems), the remedial actions for content may need a massive manual effort to add in semantics that are not part of the current data set. Even with off-shoring this can be costly and time consuming and affect the business viability of future deliverable products. The detail of what was done and why may be lost in the mists of time.
How can we make proper business decisions without understanding the impact of such decisions now and in the future?
Identify where the approach taken for a given project varies from corporate standard approaches (content quality, duplication not re-use, schemas etc.) and may lead to inconsistency in the corporate knowledge store. This is especially dangerous where the lower quality/inconsistent quality material is later combined with other better sources as part of interactive or print products and customer-facing errors occur that are incorrectly perceived as being product issues. This mixing of good and bad content is equivalent to “polluting your content reservoir”. Ensure that any content that does not meet normal standards (but for business reasons needs to be delivered quickly for use in specific ways for specific projects) is clearly marked with searchable metadata identifying the specific source/batch and its “quality grade”. While this does not fix the content it does make it more traceable so that it can be found programmatically and remedied at a later date. When remedial action does occur to batches of the content, the metadata can then be updated allowing XSLT or other code to change its behaviour based on the quality found.

While any project rarely gets to do the perfect job first time, when dealing with high volumes of valuable data it is vital that the impact of early decisions are understood and managed by the business as a whole rather than causing an unwelcome surprise and costly delays as future needs evolve.

An alternative scenario to the one given at the start of this post is be the case where the master source content is always updated outside of your control and re-supplied each time in a less structured format than is ideal. This can be most frustrating (and expensive) as errors constantly re-appear and require repeated correction.

In the next post I will discuss approaches that can be taken to check and maintain the quality of content throughout its lifecycle and some challenges that are often encountered.

Previous : The Cost of Quality
Next: 10 Things That Can Go Wrong

1 Comment

Heather Adam link

8/12/2020 05:47:54 pm

Great rread thankyou

Data Quality For Publishers – Part 2: Quality and Content Enrichment

Leave a Reply.

Author

Archives

Categories

﻿Data Quality For Publishers – Part 2: Quality and Content Enrichment

Leave a Reply.

Author

Archives

Categories

Data Quality For Publishers – Part 2: Quality and Content Enrichment