Magazine Article | July 17, 2006

XML: The Future Of Content Management?

An outsourcing trend prompted this integrator to offer XML (extensible markup language) conversion services to gain more business within its client base.

Business Solutions, August 2006

Quality Associates, Inc. (QAI) is celebrating its 20th year as a document management integrator in 2006. The company has a history of steady growth over those 20 years — highlighted by an 81% spike in revenue in 2005 and an additional 25% revenue gain projected for this year. QAI’s longevity and success are largely due to its knack for identifying emerging trends in the industry and its ability to quickly ramp up the new technologies and/or services to capitalize on those trends. The most recent example of this is the integrator’s investment in XML (extensible markup language) conversion capabilities.

Inshore XML Conversion Has Its Advantages

QAI’s core competency is developing in-house document management and imaging systems for a client base that consists primarily of federal, state, and local government agencies. Approximately two years ago, QAI realized that many of its government clients not only had a need to image documents, but also desired to convert their paper records, images, and electronic files into XML format. XML is a markup language that allows richly structured documents to be shared over the Web and easily repurposed. XML allows designers to create their own customized tags to indicate what role specific content plays in a document. For example:

<Client>

<name>Quality Associates Inc</name>

<street>9017 Red Branch Road</street>

<city>Columbia</city>

<state>MD</state>

<zip>21045</zip>

<phone>410-884-9100</phone>

</Client>

This process enables the definition, transmission, validation, and interpretation of data between applications and organizations, because the tags conform to particular standards.

There are several benefits to using XML as a document representation format. Perhaps the most important is the fact that text elements are identified, not on the basis of what they look like, but on their significance in the context of a document. This opens up new possibilities for highly efficient information search and retrieval engines (i.e. intelligent data mining). Furthermore, because XML consists only of ASCII (American Standard Code for Information Interchange)- and Unicode-approved characters, XML data can be moved freely among all hardware, software, and operating system platforms. For example, this allows organizations to easily exchange information between an ECM (enterprise content management) and CRM (customer relationship management) system.

Many of QAI’s clients turned to offshore service providers to fulfill their XML conversion needs, because they wanted to avoid the slow and labor-intensive process of manually converting documents into an XML format in-house. The rekeying and hand-tagging of content can also introduce both typographic and syntactic errors into the conversion process. The labor costs involved from a quality control perspective also made it difficult for U.S.-based companies to earn significant margins from XML conversion; therefore few service providers offer the service domestically. QAI realized it had an opportunity to earn significant revenue from existing and new customers if it could find a way to cost-effectively provide XML conversion services.

“Many businesses and government agencies are uncomfortable using offshore resources for data conversion because they are concerned about the security of the information,” says Scott Swidersky, director of information systems for QAI. “In fact, agencies such as the DoD won’t even consider offshore XML conversion because of the sensitivity of the data. Several logistics and quality control issues also come into play with overseas service providers. Many of these concerns would be alleviated by using an inshore source for XML conversion.”

Four Steps To XML Conversion
QAI spent over a year and several hundred thousand dollars developing full-service XML conversion capabilities in-house. Capital expenses included numerous servers, a variety of XML authoring tools, a cleanup component, a quality control module, custom software development and integration costs, and additional physical square footage for workflow processes. QAI also needed to invest in a few key staff members with XML, SGML (standard generalized markup language), and quality control backgrounds.

“Our goal from the outset was to develop an XML infrastructure that would be able to support the conversion of tens of millions of records,” says Swidersky. “To accomplish this, while still making the service affordable to clients and profitable for our business, required us to automate several steps of the process and drastically reduce the amount of manual labor typically involved in XML conversion. We feel like we’ve developed a solution that has been able to accomplish that.”

QAI built an infrastructure that can convert any file that can be printed to PostScript or PDF into XML. The solution uses visual cues to uncover a document’s structure, much the same way humans do. Documents can be submitted to QAI for XML conversion in hard copy, PDF, or a variety of other formats. Paper documents are scanned into electronic files by QAI prior to XML conversion. The initial step in converting these files to XML is to “block” the document. This step involves drawing color-coded boxes around sections of each page to define text, tables, and images. Once blocking has been validated, OCR (optical character recognition) technologies are used to automatically capture content from the designated text areas on each page. This text is then verified and edited, and the content is saved in a PDF Normal format before being sent through an XML processing engine.

Content must then go through four processes within the XML processing engine to be converted into valid XML output:

1. A document’s PostScript or PDF representation must be analyzed to extract all information about the appearance of a document. This includes the characters in the document and their typography, and any other visual objects. This process extracts text directly from the input data stream, so all content is accurately retained during conversion.

2. Basic building blocks of document structure, including important visual cues and large-scale layout areas of each page, are identified.

3. Identified document building blocks are placed into a tree structure. This phase identifies sections, paragraphs, quotes, lists, tables, footnotes, and other graphical objects, forming a complete internal representation of the structured document.

4. The internal representation of the document is used to export an XML file that presents the document’s content in a logical structure and retains all relevant formatting information.

QAI charges clients on a per-page or per-kilocharacter basis for its XML conversion services. A kilocharacter is a way to measure the size of a file and is equivalent to 1,024 electronic characters. The XML files QAI generates are output in a client-specified schema or DTD (document type definition) and placed on optical media (e.g. DVD) or delivered to the customer via secure VPN (virtual private network) connectivity.  

Because QAI’s XML conversion service offering required a significant upfront investment and is still in its early stages of availability, it is unclear how successful the initiative will be for QAI. However, the integrator is extremely optimistic.

“Our XML conversion services have generated a lot of interest within our existing client base,” says Swidersky. “We currently have six clients using the service, each of which has entrusted us to convert anywhere from 500,000 to 1 million records to XML. We feel this is going to be a very profitable stream of business for us.”