D2RML (Data to RDF Mapping Language) is a data transformation language for defining complex data transformation workflows that can transform data obtained from diverse data sources and in diverse formats to RDF datasets, in order to be published as linked data.
A D2RML transformation workflow comes in the form of a D2RML document, which is itself an RDF document, that prescribes an entire data processing workflow, from the description of the data sources and the needed interactions with them so that the desired data may be obtained, to the interpretation of the obtained data, the ways the data obtained from the different data sources should be combined, and to the details of the transformation rules that should be applied to convert data elements to actual RDF triples and form RDF graphs. A D2RML transformation workflow may even be dynamic and determined by the data themselves.
The current version of D2RML supports several data sources such as relational databases, REST APIs, SPARQL endpoints, as well as local system and remote files, and several commonly used data formats such as XML, JSON, CSV, Excel spreadsheets, plain text, and the several RDF serializations. Data archive formats and inline RDF data are also supported. Data items are extracted from the data sources and files using the relevant standard expression language such as SQL, SPARQL, XPath, JSON path, as well as regular expressions.
The data elements extracted from the data sources are iterated over so that transformation rules can be applied. A transformation rule, in its simplest form, is a recipe for generating the subject, predicate, object and possibly the named graph of an RDF triple from one or more data elements. Transformations may be complex and may involve conditional or switch statements, function invocations, or even interaction with more data sources. The available functions include standard string, date and numerical manipulation functions, including several relevant XPath functions.. Multiple transformation rules for single data elements as well as value pivoting are supported.
D2RML is complemented by the D2RML processor, the software that executes a D2RML document, orchestrates the communication, interaction and retrieval of the data from the sources, applies the data transformation rules and produces the final RDF data, that may be either serialized and written to files or inserted directly into a triple store. The D2RML processor has been designed to be scalable, can make use of data caching, and can seamlessly handle data sources providing millions of data elements.