An Azure service that is used to develop microservices and orchestrate containers on Windows and Linux.
Ingesting data from various sources
Hello
Our customer has dozens of Excel and PDF files. These files come in various formats, and the formats may change over time. For example, some files provide data in a standard tabular structure, some use pivot-style Excel layouts, and others follow more complex or semi-structured formats.
We need to extract information from the files and ingest them into normalized tables.
So our need is to automatically infer the structure of these files and extract the required values and ingest them into Databricks tables. There are dozens of different templates and new templates may arise by the time. So how should be our pipeline and architecture?