White paper from Krohne and Inovex on PDF documents: Salvaging data treasures and making them available in a machine-readable form

Christine Neubauer Uncategorized 15. June 2022

By Nils Klute, IT Editor and Project Manager Communication Cloud Services at EuroCloud Germany

Integrating data into business models and tapping into new sources of revenue – how this is supposed to succeed is not clear for many companies. In addition, analogue formats, heterogeneous systems and closed silo structures cloud the view. Why it pays to salvage data treasures. And how service knowledge can be extracted from PDF files, for example.

Sharing data to realise new digital products and services – according to a study by the German Economic Institute (IW), this remains the exception rather than the rule: Of the companies surveyed, 73 per cent do not manage their data jointly with others. And 71 per cent of the companies do not even meet the requirements that are necessary to be able to manage data efficiently. Take, for example, the cloud: “The cloud is the central instrument for storing, exchanging and jointly exploiting data economically,” says Barbara Engels, Senior Economist at IW, in an interview on eurocloud.de. However, only 40 per cent are using the technology, as the autumn 2021 survey among German companies from industry and industry-related service providers shows.

Companies do not integrate data into business models

Whether sensor values, machine parameters or GPS locations – according to the IW, it is not clear to many companies how data can be integrated into business models. The result: For the majority, information remains a means to optimise internal processes and not raw material to profit from economically. The barriers for companies: “Legal issues are the biggest obstacle for 68 per cent of respondents,” says Jan Büchel, Economist at IW on eurocloud.de. But only ostensibly. Büchel: “The legal framework within which data can be shared and managed is mostly unknown.”

Heterogeneous and analogue: No efficient evaluation possible

Even where legal concerns can be resolved, problems remain elsewhere: “Many continue to maintain analogue filing systems or do not record their data at all,” says Büchel. “Only half of the companies store production and process data digitally, for example. Financial, product and customer master data are the most likely to be available in this way. Heterogeneous silos and analogue systems – those who rely on them can be sure that data cannot be processed efficiently in the first place.” The result: “Potential data treasures remain unused and undiscovered,“ says Büchel.

Example PDF: Evaluating documents with Python libraries

The AI project Service-Meister is no different: In industrial services, too, data is too often not available in the way it needs to be in order to be shared or evaluated via smart algorithms. For example in PDF documents: If manuals, operating and repair instructions or maintenance reports are stored in this way, then texts, tables and images can only be extracted from them in a roundabout way. The problem is: PDFs are, in a sense, only intended for the human eye and thus for presentation on displays. The solution: Python libraries are able to evaluate and process PDFs by automatically. In a German-language white paper, Krohne and Inovex summarise the results from their Service-Meister speedboat project.

Recognise logical structure and store information in a machine-readable way

The libraries provide typical analysis applications in the Python programming language. Krohne and Inovex compared the different collections and determined how suitable the algorithms are for analysing PDF files. Because: “Therefore, problems often arise when extracting content from documents. The recognition of words and paragraphs has to be done by heuristics and is mostly done based on distances between characters, but this introduces errors,” the authors note. “Often, there are characters that are not visible but are extracted, e.g. because the text was adjusted at the last minute. Furthermore, additional spaces appear between characters, for example in headings.” The white paper shows which approaches are available for extracting content from PDF documents, recognising their logical structure and storing information in machine-readable XML and JSON formats, for example.

Experimenting and testing: Businesses need room to manoeuvre

Extracting and homogenising information to use it for AI analyses – what helps the speedboat projects at Service-Meister should also support companies elsewhere: For example, the AI project has recently made blueprints for data-based and smart solutions in industrial service available online for third parties. And regardless of whether Excel lists, notepads or paper files – Engels from the IW is also certain: “It is neither sensible nor possible for all companies to be exclusively data-based,” writes the economist in her commentary on the specialist portal Bigdata-Insider. But it is crucial for the future viability of companies to know what data they collect and how they store and process it digitally in a secure and tested quality. Because: “Companies need to have the scope to experiment and look at where they can go data-driven and digital and where it makes sense to stay analogue and where the analogue can be combined with the digital.”

Image credits: iStock-1073831676

You liked this article? Then subscribe to our newsletter and receive regular updates on similar topics and the Project Service-Meister and discuss with us about this and similar exciting topics in our LinkedIn Group.