Amazon Textract Saves Hours by Automatically Extracting Key Data from Documents

essidsolutions

Textract, Amazon’s cloud-based managed service that uses machine learning and character recognition to extract data from documents, has been launched for all corporate customers of Amazon Web Services.

The automated serviceOpens a new window saves time for humans who spend long hours processing data from tables, PDFs, photos and other documents to build spreadsheets, searchable lists and databases.

Textract’s benefits fall into sharp relief considering the time spent on tedious chores performed daily by millions of office workers. They must trawl through tax forms, expense reports, medical files and other documents to cull data, or use optical character recognition (OCR) software, which recognizes text but not formats and typically produces jumbled streams of text.

The data service seamlessly merges the extracted information into Amazon’s database and analytics services such as Elasticsearch, DynamoDB and Athena as well as machine-learning services like Comprehend, Comprehend Medical and SageMaker. Pages of formatted information are emitted in just minutes.

No need to write code

The service’s pre-trained machine learning models mean there’s no need to write code for data extraction, Amazon says, because the models have already been trained to recognize what the company says are tens of millions of documentsOpens a new window from a broad range of industries.

It uses files pre-stored in the Amazon cloud to identify data, including line items and totals, and delivers it through an easily accessible API (application program interface) to process it and thousands of other sets of data. The data is returned in the form of JSON text, Opens a new window a text-based data interchange format that also contains identifying information like page numbers, sections and data types.

The extracted data can then be used to build smart searches on large archives of documents or loaded into a database for applications running accounting, auditing or compliance software.

“The power of Amazon Textract is that it accurately extracts text and structured data from virtually any document with no machine learning experience required,” says Swami Sivasubramanian, an Amazon vice president over the company’s machine learning department.

“Subsequently, developers can analyze and query the extracted text and data using our database and analytics services,” he says, “and integrate with other machine learning services to help customers derive deeper meaning from the extracted text and data.”

He adds that on top of integration with AWS services, a developing community of partners is helping customers in areas such as extracting meaning from files, operating more efficiently, improving their security procedures, automating data entry and making faster business decisions.

Other offerings in the OCR sector, a mature technology that is built into many applications, include Microsoft’s OneNote, but it can only handle less complex documents than Textract.

A preference for ‘pre-trained’ docs

Experiments for accuracy showed Textract has a clear preferenceOpens a new window for documents “pre-trained” with formats familiar to its databases.

Amazon admits the results can vary but says you can rely on each search’s confidence score to ensure accuracy. The system flags a document for worker review if its accuracy score falls under a pre-set percentage, for example 95%.

Several Textract customers in its pre-public launch period praised the service. Users include Canada’s Globe and Mail national newspaper, Britain’s  national weather service and the London-based auditor PricewaterhouseCoopers or PwC.

Michael O’Neill, a digital editor at the Globe and Mail, describes how his newsroom uses Textract to extract more information from government agencies that want to keep secrets. “As a news media company,” he says, “we rely on many PDF s or scanned-source documents such as FOIs (freedom of information requests) that have important information contained in tables.

“These documents have been under-utilized because journalists were not able to access them easily or didn’t know they existed” he says.”Using Amazon Textract, we are able to extract information from tables in PDFs and easily output” the data to a searchable file format for journalists. “This increases efficient access to information for our journalists by tenfold.”

Siddhartha Bhattacharya of PwC says, “We’ve integrated Amazon Textract into our solution for the pharmaceutical industry, to automate document processing for various (Food and Drug Administration) forms like MedWatch and CIOMS.”

In the past, the forms were processed manually, he says, with each form taking hours to complete. Now now the processing time for each is reduced to minutes.

Robotics vendor UiPath of New York, which sells software to automate business processes, says Textract will differentiate its platform. A spokesman said it will help “our customers to unlock critical business data from documents (and) transform that data into actionable business insights.”

Can’t do it all

Textract has its limits, however. The service can take bitmap images of just 5MB, and PDFs can’t run more than 500MB in length. In addition, the system can’t detect handwriting.

Textract’s options include a free tier of up to 1,000 pages per month using the Detecting Document Text API and on up to 100 pages per month using the Analyze Document Text API.

So far, the service has been made available only in certain regions – Ohio, northern Virginia and Oregon in the United States as well as Ireland in the European Union. Amazon plans to expand the service to the company’s other regions over the next 12 months.

The news follows other product launches from the tech giant as it pushes its cloud presenceOpens a new window .

In the past week, it has done a wide launch of its antenna-for-hire satellite and storage service called Amazon Ground Station and has announced what it calls Amazon Managed Streaming for Apache Kafka (Amazon MSK), an open-source platform. That service allows developers to build and run applications based on Apache Kafka without the need to manage underlying infrastructure.