Before you can analyse your data, you need to get it loaded into your database. Vertica supports multiple methods of data loading. We can load structed and semi-structed data directly from delimited data files or we can collect data from other data sources using client connectors such as ODBC, JDBC and ADO.NET.
We can connect to a HADOOP database or move data from Hadoop into VERTICA. We can even load real-time streaming data using KAFKA.
This blog covers how VERTICA manages to quickly load and store data from any source using a hybrid data storage model.
Before we can talk about loading data into the VERTICA database, we need to do a quick review of the construct that the data is loaded into projections. Remember, VERTICA supports tables as a logical concept. Since the SQL language is common and familiar, you can directly act with your data using standard SQL commands which reference tables. However, that is not where the data is stored. We physically store the data in projections (column-oriented structures that are based on the tables) each Projections contains a sorted subset of the rows in your table.
There are two basic types of Projections. 1) Superprojection contains all the columns in your table. For every table, VERTICA creates at least one super projection. While this projection type will allow you to query any data from the table, it may not give you the best query performance. For that we can create 2) Query-Specific Projections which only contain a portion of the tables columns. The columns are ordered and sorted based on your most commonly run queries, increasing the performance even further.
VERTICA OBJECT HEIRARCHY
To understand what happens when data is loaded into VERTICA, you will first need to understand the object hierarchy. While VERTICA runs in a multi-node environment with the database distributed as evenly as possible across all the nodes, for simplicity sake, we are showing the object hierarchy on only one node. We’ve already discussed the tables which defines the data you want to store and projections which organize that data based upon your most commonly run queries to increase performance. But these are only structures for organizing your data and accessing it using a common language.
What happens when data is loaded into your database?
Every time data is writing to disk, VERTICA creates a data storage structure on disk called a CONTAINER. We are only showing the containers created for 4 data loads for this first projection. Although of course, each projection has its own containers created on each write to disk. To maximize query performance, VERTICA has a limit of 1024 containers per projection per node. There is no limit on the size of the containers.
Within each container a file is created on each column in the projection. The dot GT format of the file keeps the data sorted, encoded and identifiable across nodes. This is what allows VERTICA to distribute data across multiple nodes.
VERTICA’s HYBRID STORAGE MODEL
To support loading data into the database intermixed with queries in a typical data warehouse workload. VERTICA implements the storage model shown below. This model is the same on each VERTICA node.
The Write Optimized Store (WOS) is a memory resident data store temporarily storing data in memory speeds up the loading process and reduces fragmentation on disk. The data is still available for queries. For organizations who continually load small amounts of data, loading the data to memory first is faster than writing it to disk, making the data accessible quickly.
The Read Optimized Store (ROS) is a disk resident data store. When the TUPLE MOVER tasks move out its run. Containers are created in the ROS and the data is organized in projections on disk. The TUPLE MOVER is the VERTICA database optimizer component that manages the movement of data from memory (WOS) to disk (ROS), as well as managing the number of containers per projection on the ROS. The TUPLE MOVER command MOVE OUT, moves data from memory to disk. The command MERGE OUT, combines the containers on disk.
LOADING DATA: Trickle and Bulk Loads
Why did VERTICA build this loading model? To support different types and different sizes of data loads. If you have small frequent data loads or trickle loads, The BEST practice is to load the records into memory (into the WOS). Data loaded into the WOS is still available for query results. The size of the WOS is limited to 25% of the available RAM or 2GB whichever is smaller. If the amount of data loaded to WOS exceeds this size, the data is automatically spilled to disk in the ROS. For the initial bulk data load and for subsequent large loads, the best practice is to load the data directly to disk where it will be stored in the ROS. This process leads to the most efficient loading with the least demand on cluster resources.
The Tuple Mover: The moveout Task
Here we show how data is moved from memory to disk using the tuple move tasks MOVEOUT. Initially some data has already been written to disk. You see the containers in the ROS. By default, small data batches are loaded into the WOS. It is written to this memory resident data store in row format. This allows you to query data as fast as it is loaded. Data in the WOS is available to be queried but it is not optimized and volatile. It will take longer to return a result from data in the WOS.
When the WOS reaches its maximum capacity (25% of the available RAM per node up to 2GB or at regularly configured intervals) by default every 5 minutes, the tuple mover task MOVEOUT writes the data to temporary memory space where it is reorganized into columnar sorted and encoded format as defined by the projection definitions.
Once the data reorganization is complete. The MOVEOUT task writes the data to disk in containers in the ROS. When the data is committed to disk, it is removed from the temporary space
The Tuple Mover: The MERGOUT Task
Because the process of querying data from many files can be relatively slow, the MERGEOUT task compresses multiple containers into fewer containers this allows the queries to run more efficiently. By default, the MERGEOUT task runs automatically every 10 minutes. Initially there have been multiple loads of data into the ROS creating multiple containers.
The MERGEOUT task first moves the containers into temp space where they are condensed. The data is resorted and re-encoded.
The condensed containers are written back to the ROS. This process creates new larger containers
Finally, the data is removed from temp space.