I have been working with metadata with the biggest part of my professional life so far and I really plan to continue working with them (old habits die hard I guess). My experience so far revolved mainly around application profile design, application profile implementation and metadata assessment. In my PhD I did use some simplified statistics for the later and came up with some really interesting results (story – more info – presentation).
Although in my PhD these statistics were enough since my contribution was mainly about the process (MQACP) used to ensure metadata quality, in my professional life after the PhD, they are not enough. Lots of work has been carried out and is currently ongoing related to metadata assessment (interesting). And although metadata in education (my field) is not as big as big data go, they are not small either. And they need to be looked at seriously…
So, being more of a metadata-design person with a certain technical background, I decided to dig a bit deeper in data mining and also on related statistical measures and methods.
I will do so, knowing that data mining can come back and aid metadata design in my case, but also enhance data (and metadata) quality. You see, back in grad school I was not really interested in statistics, and this is something that the day has come that I regret it! Anyway, no reason to cry over spilt milk.
To do something about this, I have opened a book or two on data mining and reading my third one (Principles of Data Mining) I decided to start blogging about some things in parallel. Just to keep my notes up here and in the event that some of you out there have found themselves in the same situation as me. So, there goes nothing!
First of all, when looking at big data, you are probably looking to do one of the following, classify, predict, associate and cluster or simply calculate potential numerical values. The data you will be using will either be labelled or unlabelled. When data are labelled the data mining process is called supervised learning and when they are not, it’s called unsupervised learning. In supervised learning, when you look at categorical attributes you are talking about classification of data and when you’re looking at continuous attributes you are talking about regression. Nice? Easy? So far, yes!
To classify in supervised learning, you may use the Nearest Neighbour Matching, or Classification Rules or Classification (or Decision) Trees. To predict a numerical value, we may use Neural Networks. In unsupervised learning you may use a training set of data to discover association rules between the data (usually with a probability attached to them). So, you state that IF something happens, THEN something else happens or applies, with a probability of X%.
Clustering can also be used in unsupervised learning to attempt to group things/data/records/etc., that are similar. So, when the next record comes along, based on the existing clustering, you can assign it to a group based on a specific characteristic.
And just to be clear about a thing or two, in data mining, we have objects that have variables or attributes (like the height [variable] of a person [object]). So, variable and attribute will be used interchangeably. And each set of attributes or variables is considered an instance or a record for this object. Pretty straightforward.
So, for the time being I think this is enough. Will be coming back with the continuation of this series of posts, increasing the numbering to Data Mining 102, Data Mining 103, etc. Hopefully by the end I will have a nice documentation of my learning path and a nice guide for anyone that wishes to get a quick overlook of data mining in general.
DISCLAIMER: These are my notes from reading various books and are not copied in any way or from straight from the publications