Data Virtualization is basically a data-integration solution that allows an application to retrieve and transform data without having to know how it is formatted at source or where it is physically located. Today many organizations choose to implement best-of-breed products instead of best-of-suite, this has significant impact on the landscape in the sense that it will be having disparate data sources scattered around both on-premise and cloud (PaaS, SaaS).
Traditionally many organization implemented chain of databases, such as staging area, a central data warehouse, and several data marts all with ETL (Extract, Transform, Load) batch processes to push data through the chain from one database to another thus duplicating data across the landscape.
That used to be acceptable but now isn’t. Nowadays organizations need real-time insight in their operation data across agile architectures and there is where data virtualization technology comes into place.
Data virtualization creates an abstraction layer based on views with a common provisioning point for all “data consumers” through standard APIs (JDBC,ODBC, REST, OData, SOAP etc.). Red Hat JBoss Data Virtualization is an example of a product (open source) available for this purpose.
It is not an in memory data grid since it does not replicate or hold any data in memory. Their are options to enable caching for better performance, but this is not the main purpose!
The architecture consists of minimally four layers (Rick F. van der Lans, 2016):
- Virtual Base Layer
The virtual base layer contains views that contain the data stored in source systems. For each physical table or file in a source system a view is created. Each view definition may contain cleansing specifications to improve the quality of the data coming from the source systems. Besides correcting the data, the virtual contents of such a view is identical to the contents of the source system.
- Enterprise Data Layer
Views at the second layer present an integrated view of all the data in the source systems, hence the name enterprise data layer. The structure of each view is “neutral.” In other words, it’s not aimed at the needs of one data consumer, but at supporting as many forms of usage possible. If possible, each view is structured according to the third normal form.
- Shared Specification Layer
To avoid too many duplicate and possibly inconsistent specifications in the data consumption views, the third layer contains shared specifications. The purpose of this layer is to make the environment as agile as possible by avoiding duplication of specifications. Optionally, the shared specifications layer contains authorization specifications (who is allowed to use which view).
- Data Consumption Layer
The structure of each view at the data consumption layer focuses on simplifying data access for data consumers. For example, for some data consumers it may be useful to see the data organized as a star schema, whereas others prefer to see all the data they require in one view consisting of a large set of columns. Filters, projections, transformations, and aggregations are specified at this data consumption layer show data consumers only relevant data at the right aggregation level and in the right form.
In the next blogpost I will be possibly go into more detail about the architectural principles or implementation strategy (top-down, bottom-up, inside-out).
About Red Hat
Red Hat is the world’s leading provider of open source solutions, using a community‐powered approach to provide reliable and high‐performing cloud, virtualization, storage, Linux, and middleware technologies. Red Hat also offers award‐winning support, training, and consulting services. Red Hat is an S&P company with more than 70 offices spanning the globe, empowering its customers’ businesses.