Artificial intelligence (AI) has been a rising technology for several years. For the last decades, computer experts had to create rules and conditions to make a software work accordingly to a specification. With artificial intelligence, we have the opportunity to automate this step and use computers to identify the rules. Experts do not have the requirement to have full knowledge of a specific domain. They simply ask computers to understand and interpret the structure of real world data.
By using a data set of real world data, a classification algorithm can be trained so that it is able to assign unseen data to one of several categories. The data set is simple structured: Every element is represented by its attributes (so called features) and his category. The algorithm retains the features and analyses it for specific criteria. Those criteria will help it to distinguish between the categories and assign unseen data to the categories.
Using artificial intelligence to support the process of content migration requires us to solve two major challenges. A pre-classified training data set is often not available. This data set must be created by a domain expert, which can take days for him to create. This contradicts with one of the key factors of content migration: To reduce expenses. Therefore, only a small data set will be available in most cases. In addition, documents often span over multiple text pages. The more text is available, the more difficult it is to find the right features.
The two challenges give us a non-trivial issue to overcome: Small training data sets, but with a large vocabulary. Unfortunately, very good success rates are always achieved on large datasets with a small vocabulary. Based on our internal research, we found a way to conquer this issue and achieve high positive classification rates of over 80 %.
We validated our classifier with several data sets. One of the data sets had 1806 documents in six categories. Three of those categories are offer, concept and contract. We figured out, that these categories often have very similar content and even a human was not able to classify the documents of these categories precisely. However, our solution is so confident in his predictions, that three out of four correctly classified documents were predicted with a confidence value of over 90 %. This confidence value represents the confidence of the algorithm in his own predictions. The higher the confidence value is, the higher is the probability that the prediction is correct.