DCL/Structured Data From Complex Docs

DCL Automates the Process of Getting Structured Data from Complex Docs

Jennifer Zaino, Semantic Web

Introducing DCL's Automated Conversion System

Documents documents everywhere, and not a [good] way to search them. With apologies to Samuel Taylor Coleridge, that’s pretty much the situation many enterprises find themselves in. And it gets harder as more and more documents are stored with and as hard-to-index and hard-to-reuse images. How to address the problem? Data Conversion Laboratory (DCL) is trying to make the job easier with its recent introduction of its Automated Conversion System, which takes documents composed of varying visual quality and imagery and converts them into structured data.

Its technology transforms these documents into searchable XML, with extracted metadata, for storing in and access by content-management and other end-user systems.

The non-textual content of complex documents tends to confuse OCR (optical character recognition) technologies, and that results in degraded accuracy, DCL says. Its fully automated solution for digitizing and converting documents into structured data aims to overcome that issue with methods that automatically extract the extraneous content, reinserting it later into the converted document.

It includes an integrated communication layer, a workflow engine, and a multi-step processing approach that extracts each artifact, such as diagrams, and manages it individually, then reassembles the document, with the end result of rendering previously inaccessible data useable, at a large scale. “It really gets documents at a very high level of accuracy and we are able to produce XML on the fly without human intervention at very high volume,” says Mark Gross, President.

This large scale automated conversion includes some level of semantic tagging, he notes. “Frankly, when you are dealing with very large volumes of content in this big data world, you cannot do it without some level of semantic tagging,” he says.

The automated process generally delivers about 99 percent accuracy but at about a tenth of the cost of the 100 percent accuracy that can occur when a human is added to oversee the process, he says. In work with clients previous to making the announcement, “we see upwards of 99 or 99.5 percent accuracy automatically, and then your search works very well,” he says. “You can find almost anything you need.” The cloud service also supports fuzzy matching to further produce accurate results, he says.

The service also is customizable to specific needs. “The more we know about what document types the clients are using, the better we can automate the process for that client,” he says. For example, if the customer is a bank dealing with loan agreements it can modify its software to track standardized terms used in such agreements. “So we could semantically tag different parts of the loan agreement so that when it goes up online and becomes searchable, users can easily find the sections they need,” he says.


Read  the entire article at Semantic Web.