This is the first in a series of short posts discussing the subject of data quality for publishers. Subsequent posts in this series will describe the difficulty of ensuring the ongoing quality of structured information in a cost-effective manner and outline some effective solutions that can be applied.
All professional information publishers aspire to create and deliver good quality output. Whether you are a commercial publisher (your primary products are published data/information systems) or if you are a corporate publisher (publishing in support of other products/services that you deliver) your organisation’s name is stamped all over your output and therefore has its overall reputation intertwined with the customer perception of the content you deliver.
If you are an ultimate “luxury brand” you may go to enormous expense to ensure this reputation cannot be tarnished (the world’s most expensive toilet roll is hand signed and dated by the maker and inspected by firm's president http://tinyurl.com/pyrm5ts) but for most organisations maintaining quality has to compete for budget. This can be an ever more challenging task where there is a drive to reduce costs and speed-up time to market.
This pressure may lead to a re-assessment of what is “acceptable data quality” for a given project or data source.
Data value
When deciding on the appropriate level of investment in data quality it is vital to assess the current and potential value of the data. Consider the following:
- Accuracy - Is it vital that the information is highly accurate and accurately represented within the delivered product (and are you culpable if now)?
- Exclusivity - Are you sole owners/providers of the data? If so customers may “take what they can get” now but over time enrichment may be an even greater income generator as clients cannot get the data elsewhere)?
- Usefulness - Does the data provide a substantial benefit for your customers?
- Timeliness - Does the data only have value for a short period of time and therefore must be delivered quickly?
- Longevity - Is the data something that would be kept and provide value over a long period of time?
- Intelligence - Does the data become more valuable if it is enriched so that it is more functional (searchable, can answer questions, can be better manipulated by specific delivery mechanisms)?
- Relationships - Can information within this content be linked to other information in your corpus (or that is publicly accessible or commercially available) in order to provide better value?
- Re-use - At a document level, could this data be seen as part of a larger corpus or one or many applications or publications published to one or more media types (e.g. online and print)? At a micro level, could this data be seen as being a best-practice information component that could be re-used within many documents?
What is “acceptable quality” for your data?
When considering quality you need to not only ensure that your data is fit for purpose now but also how that can be maintained over time.
As business requirements change (for example data that is only delivered for occasional reference on a low-value simple web-page may in future have to be delivered as part of an expensive printed book or delivered in a semantically rich form to third parties for re-purposing or analysis), it is likely the expectation within your business will be that the content is adaptable enough to meet these new needs in a timely fashion without excessive further investment.
This may create a dichotomy of opinion within your organisation with regard to minimal investment now (for current requirements only) versus ongoing investment assuring the data against potential future requirements (that may or may not ever arise). Taking an Agile perspective (a lean practice associated with software development), you only focus on delivering functionality that is needed now. However the difference with data (especially large volumes) is that once decisions are made which initially limit the quality and expressiveness of the data, the effort (often manual or semi-manual) and time taken to upgrade it later (often in many iterations) may far outweigh the cost of creating something more maintainable and adaptive initially.
This lack of quality and content agility can lurk unnoticed in your delivered products until such a time when a new product (or a change to an existing product's functionality) brings the issue to the surface. This can cause sever delays in delivering he new functionality or delivery can commence but with potential reputational damage as customers notice irregularities. The remedial action can be time consuming and expensive with an added complication that expert knowledge from the original source may have been lost as the new master data set is the less semantically rich version now in your content management system.
As a long-time advocate of the use of structured information (XML and before that SGML) in publishing, I fully recommend its usage to make your content more robust and adaptive to change. It should be stressed however that the decision to use XML alone will not ensure quality adaptable content. These articles will focus on the importance of making the right choices for semantic data and providing processes and tools to ensure long-term success.
Note: Quality is of course not just about the structure of the data but also relates to the information itself. Users demand content that has best information explained in a clear and consistent way using the most appropriate language with correct spelling and grammar. Some of this can be controlled by the use of editing and linguistic analysis tools but only if a clear “house style” already in place.
Next: Quality and Content Enrichment