Automation is today’s hot topic. The never ending flood of information makes it impossible to maintain each file and each dataset individually – and never mind manually. Meta data are the key to the solution of this problem. They allow the grouping and batch processing of information in accordance with specified properties. To ensure the smooth operation of such processes, meta data must be captured in a structured form. This article explains why a structure, i.e. a meta data schema, is so important and what needs to be considered in the development of such a schema.
Why do I Need a Meta Data Schema?
Machines are not designed to process unstructured data – be it for simple, short scripts or KIs – because they lack an ability for logical interpretation. A specific, fixed structure is needed for their use. The more context there is for a particular bit of information and the more precise the definition of its structure and meaning, the lower the effort will be for automated processing and the more reliable and meaningful the results will be. A meta data schema is basically nothing more than a definition with the purpose to make such contexts available for machine processing.
However, a schema isn’t just good for the use of meta data – it is also beneficial for data capture. Since a meta data schema defines what data must look like, many errors can be detected at the time of input of the data and it doesn’t matter if that is done manually or (partially) automatically. In addition to avoiding errors, a good schema will also reduce the amount of work you have to put in, because when the meaning and the relationships of the meta data is clearly defined, then much of that data can be captured automatically or can be generated from other (meta) data.
The bottom line: A meta data schema…
- …facilitates effective, automated data processing and maintenance;
- …increases the quality of the meta data and with it their value;
- …reduces costs for capturing the meta data.
What Makes a Good Meta Data Schema?
The best schema is one that supports data input and data processing the most, and makes these steps easiest. A few basic rules will help you to develop a schema that optimally matches your data and its purpose.
1. Defining the Area of Use
What type of data should the meta data schema be applied to? A schema that matches all available data will also allow the processing of all the data with the same automatisms. Very varied data, on the other hand, will also have very few properties in common. Think about what kind of data you want to process (manage, search) together. That data set should share one schema. Then the schema will not have to consider other types of data and formats. There is, of course, no reason not to reuse parts of the schema for other data.
2. Selecting the Right Fields
A meta data schema consists of so-called ‘fields’, whereby each field contains exactly one defined information. It is well worth your while to think about which fields you will need and where you want the data to come from. The key question here is: What will be the purpose of the meta data? It is a complete waste of time to define a field that isn’t needed at all. The same goes for fields that can’t be filled out for a large portion of the datasets, because mining that information would be too costly or not possible at all.
The data should be split into its smallest possible components, because it is much easier and less error-prone to join together two clearly defined fields, than it is to break down the content of a field. You should therefore check for each individual field you want to use, whether it may combine two or more independent bits of information. You could always add another filed in case of a combination of data that is frequently needed in this form – but that field should then be populated automatically to prevent contradictions.
3. Don’t Re-Invent the Wheel
Meta data has been in use for quite some time and in many areas. The necessity for data exchange has resulted in the development of more robust, well-documented meta data schemas and exchange formats, which cover most of the requirements of a specific sector. Using a standard schema has a lot of advantages. Data provided by external sources can be used immediately and without any modifications, provided the same standard scheme was used for its capture. There are various tools and masks available for commonly used schemas, which further simplify data maintenance. And of course you save a lot of time and effort you would have used for creating your own schema. When you therefore find that iiRDS, Dublin Core or MODS offers everything you need, then choosing one of these will in all likelihood be a better idea than developing your own schema tailored specifically to suit your data.
4. As Tight and Exact as Possible
The fewer selection options and freedoms a schema offers, the better. Every selectable option represents an opportunity for human error. Specify exactly, what information must be entered in a field and how. Data types, drop-down lists and regular expressions (a language to describe character strings) are a great help here. You avoid typos and make sure that identical information always appears in the same format. But there are even simpler ways that offer plenty of benefits. In a “Ranking” field, you only allow a numerical input of 1 to 6. A short explanation of the exact type of information this field refers to can be very helpful.
5. Optional or Mandatory
If you are planning to capture meta data automatically or using experts, then it must be mandatory to fill out all fields of which you know that they apply for all instances. Every person has a name, every file a format and every digital text an encoding. Should one field remain empty, then the dataset cannot be processed by all processes accessing that dataset or will at least require special treatment. That will significantly impact the benefit of the schema.
There is, however, an exception, in which a limiting of the schema by keeping the number of mandatory fields as high as possible can also be a drawback: that will be the case if the meta data is entered manually by people, whose main responsibility is not the maintenance of that data. Too many mandatory tasks will mean a lot of time spent, which can lead to a drop in motivation and with it to careless, faulty and even inadvertent input. Where that is the case, it may become necessary to think about how much time spent on data input makes sense to ensure the best possible data quality.
Optional fields will, of course, also be useful in automated data capture processes. A “Most recent renovation” field will be a good idea in a meta dataset about a house – but will not be applicable for a new construction. Optional fields make sense, where the fact that an input is missing also represents a statement.
In addition to all these basic rules, the rule of implementability must also be applied. Should the cost for the creation and maintenance of a drop-down list be simply too high or the technical implementation of the perfect schema would take too much time, then some compromise in terms of specificity will be unavoidable. But anyone, who right from the start isn’t really sure about what the perfect meta data schema should be, will find it difficult to implement the best possible schema anyway.
Done with your meta data schema? Then it is time for the next step: Capturing! Or better stick with Create?
5 Basic Rules for a Good Meta Data Schema (pdf)