I am still poking around the Data Mining theory, continuing my read of Principles of Data Mining. Seems like the math keep getting more and more complicated as we go, but luckily there are some terrific appendices that help a lot. So, there goes another “half-baked” post containing less facts and more fiction.
First thing I came across the other day, were the Naive Bayes and Nearest Neighbour classifiers which were not completely new to me as I have encountered them once or twice in the past, not going into great depths with them though. I did refresh my classifier theory a bit and immediately started thinking (guessing you will find something relevant out there if you look), how these types of classifying algorithms can be used to classify metadata instances. Predict their completeness (cause overall quality seems a bit far fetched) and act as a predictor of metadata record quality, raising red flags over records that follow a similar problematic pattern.
On top of that, thinking about the examples that I read in the book, it could be interesting to carry out an experiment where you link the annotator’s expertise to the actual metadata instances, to look whether or not, specific metadata can be correlated to specific characteristics of human annotators. It only seems reasonable but again, finding a connection and proving it, is more difficult than just discussing about it. Lucky me, I do have a bunch of metadata records from my PhD with names and addresses on them. This could mean that I could try to see how the person correlates to the record they produce in terms of completeness. For example, is a specific annotator ignoring field X in metadata regularly, or does he/she only complete it for specific types of resources cause it’s easier to figure out? Answers to questions like this one, could really help annotation tasks.
With this in mind, I dove into the chapter about Decision Trees which were more familiar to me, as I had worked with a lot more in the past. On top of that, it really seems that decision trees are kind of a thing that we do very often in our everyday lives, or it’s just me. I think that we find ourselves really often considering alternatives and outcomes in a really similar way like in decision trees we choose an attribute to split upon. Reading about “Entropy” as a means of selecting attributes to split the tree on, I found myself in known paths, as entropy is also frequently used to assess the richness of the descriptions (attributes) in a metadata field. This also made great sense to use, trying to maximize information gain when selecting an attribute to split the decision tree on.
Then, I read about the Gini Index of Diversity that can also be used to aid in attribute selection, acting as a measure of impurity for a dataset. When all classifications of an instance are the same, its value is zero. It takes its largest value (1-1/K) when all the classes are distributed between instances in an even way. Last but not least, Gain Ratio is also a useful tool for attribute selection, as at the same time it reduces the effect of inductive bias that stems from the use of the concept of information gain.
Then, the book went on about estimating the predictive accuracy of a classifier, which was also an interesting part. The discussion contained the classic “train and test” approach that most of us are familiar with already but also the different cross-validation approaches. I really benefited from the Confusion Matrix part (a term that I was not aware of) and also took a trip down memory lane with true and false positives.
Contemplating on all of the above, I am thinking that I would be interested in looking at a system where metadata records could be classified upon creation, into existing classes of records. Classes that would automatically mean that specific metadata quality assurance methods would be evoked and applied to them. Classes that would allow the repository managers to take the necessary measure to preserve them, enhance them and make sure they are retrofitted to the specific context in which they will be used.
Anyhow, it seems that it has been a really useful read so far, giving me quite some ideas and directions related to metadata assessment. Some of them have already been used in relevant literature and experiments but some others have not, so I am really looking forward to reading the next chapters of this book, trying to connect some dots here and there related to data mining and metadata quality.
As I said before, this is a “half-baked” ongoing post, so I would love to look at any related literature you may be so kind to share with me. I am now going into the chapter about continuous attributes… Wish me luck! 😉