MEDIA
HELP
Author
Johannes Giere
AI Expert @ fme AG
October 10, 2018
Artificial intelligence (AI) has been a rising technology for several years. For the last decades, computer experts had to create rules and conditions to make a software work accordingly to a specification. With artificial intelligence, we have the opportunity to automate this step and use computers to identify the rules. Experts do not have the requirement to have full knowledge of a specific domain. They simply ask computers to understand and interpret the structure of real world data.
By using a data set of real world data, a classification algorithm can be trained so that it is able to assign unseen data to one of several categories. The data set is simple structured: Every element is represented by its attributes (so called features) and his category. The algorithm retains the features and analyses it for specific criteria. Those criteria will help it to distinguish between the categories and assign unseen data to the categories.
Using artificial intelligence to support the process of content migration requires us to solve two major challenges. A pre-classified training data set is often not available. This data set must be created by a domain expert, which can take days for him to create. This contradicts with one of the key factors of content migration: To reduce expenses. Therefore, only a small data set will be available in most cases. In addition, documents often span over multiple text pages. The more text is available, the more difficult it is to find the right features.
The two challenges give us a non-trivial issue to overcome: Small training data sets, but with a large vocabulary. Unfortunately, very good success rates are always achieved on large datasets with a small vocabulary. Based on our internal research, we found a way to conquer this issue and achieve high positive classification rates of over 80 %.
We validated our classifier with several data sets. One of the data sets had 1806 documents in six categories. Three of those categories are offer, concept and contract. We figured out, that these categories often have very similar content and even a human was not able to classify the documents of these categories precisely. However, our solution is so confident in his predictions, that three out of four correctly classified documents were predicted with a confidence value of over 90 %. This confidence value represents the confidence of the algorithm in his own predictions. The higher the confidence value is, the higher is the probability that the prediction is correct.
We plan to use the algorithm and publish a module for the migration-center in the next year. The module can support in numerous use cases.
For example, a target system requires to set a value for a specific attribute, but the source system does not provide this attribute. Using the auto-classification module, the values can be generated automatically. The documents in the target system can be used to train the classifier. In this way, users will not face the issue of creating a time-consuming data set, although the user sure has the opportunity to create a data set on his own.
Furthermore, the confidence values can be used to build more accurate transformation rules.
Let us look at another example: An algorithm was trained and after testing the algorithm, we know that 89 % of test data is predicted correctly. Out of that correctly classified data, 90 % are predicted with a confidence value of over 75 %. Additionally, it tells us that wrongly classified data has always a confidence value of under 75 %. This value of 75 % gives us the opportunity to define a transformation rule. The transformation rule will simply say that all documents with a confidence value of over 75 % will be assigned the predicted category. Documents with a lower confidence value will be held back. A user can investigate and validate the predictions of the low-confidence documents.