Quantcast
Channel: Simple Talk
Viewing all articles
Browse latest Browse all 182

Making Temporal Databases Work. Part 3: Saving Data

$
0
0

The temporal database is a database that can keep information on time when the facts represented in the database were, are, or will be valid. We briefly described major concepts of temporal databases and discussed types of queries that such databases can support in part 1 of this article(Saving Data Historically with Temporal Tables: Part 1: Queries) and discussed how aggregation can work in temporal databases in the part 2 (Making Temporal Databases Work. Part 2: Computing Aggregates Across Temporal Versions).

In both parts 1 and 2 we assumed that data are somehow stored in the temporal tables and concentrated on retrieval of these data. In this part 3 we discuss how data are saved and modified in a temporal database and what kind of integrity constraints are needed or desired for temporal tables.

The content of part 1 is essential for understanding the part 3. The examples in the part 2 are based on table definitions and data from part 1 of this article (there is a script for creating these objects and data in an appendix to this article if you want to reset your objects.)

Reminders

To avoid confusion, we need to briefly emphasize a few things. First, the difference between temporary and temporal: the former means that the item (e.g. table) is kept for limited period of time (usually temporary tables are dropped at the end of a session), while the latter means that an item (table keeps versions of the data that were valid at different points in time.

We also need to remind you that a row of the temporal table has a period of validity represented as a pair of timestamps. The lower boundary (start of the period is included into period and the upper boundary (the end of the period) is not. A validity period is associated with every value, but we assume that periods of all attributes of a row are the same.

The semantics of the time period may be different depending on the time dimension (such as system time or valid time). The behavior of data manipulation operators (INSERT, UPDATE, and DELETE) may vary for different time dimensions. It is essential for aggregations discussed in part 2 that periods represent valid time, but we’ll discuss different time dimensions in this part 3.

Integrity Constraints

In a properly designed database, a row in a table describes some properties of an entity in real world, and any entity is described with only one row. The one-to-one correspondence between table rows and real objects is usually ensured with a primary key (which is unique and is not null).

In a temporal table this requirement reads as: a real entity is described by a single row at any point in time. To ensure this kind of uniqueness, we can include columns defining the period into definition of the primary key. Unfortunately, this does not work because different periods may overlap.

Uniqueness

Instead, the non-temporal UNIQUE constraint is replaced with requirement that periods of rows containing same value have non-overlapping periods.

In PostgreSQL this type of constraint is expressed with EXCLUDE clause. The complete definition of our temporal_emp table (defined in the part 1 without constraints) looks as follows:

create table emp_temporal (
emp_no int not null, 
name text not null,
salary int not null,
project text,
emp_row_id bigint  not null primary key
    generated by default as identity,
emp_ts timestamptz not null,
emp_te timestamptz not null,
EXCLUDE USING GIST (emp_no WITH =, tstzrange(emp_ts,emp_te) WITH &&)
);

Or to add this to the table that already exists, use an ALTER TABLE statement:

ALTER TABLE emp_temporal
  add constraint emp_temporal_unique_emp_no_rows
  EXCLUDE USING GIST (emp_no WITH =, tstzrange(emp_ts,emp_te) 
                                                     WITH &&
  )

Note that you may need to install an extension for this to execute by executing this code:

CREATE EXTENSION btree_gist;

If you may get the following error:

SQL Error [42704]: ERROR: data type integer has no default operator class for access method "gist"

Hint: You must specify an operator class for the index or define a default operator class for the data type.

The EXCLUDE clause contains a list of columns or expressions togeth4er with comparison operators. The && operator compares two ranges and returns TRUE if these ranges overlap. The function tstzrange constructs a period (that is, a range of timestamps) from two timestamps with time zone passed as arguments. Finally, the keyword GIST specifies the type of index that supports this constraint. One GIST index can support multiple values (two in the above table definition).

The EXCLUDE constraint requires that for any two rows in the table at least one of comparisons return FALSE for any row of the table.

Note that if all comparison operators are equality (=), then EXCLUDE is equivalent to UNIQUE constraint.

So, the EXCLUDE constraint can be viewed as temporal version of UNIQUE constraint. If periods do not overlap, they are not equal. Therefore, if all columns in the EXCLUDE clause are NOT NULL, then this combination of columns can also serve as a primary key. However, we define surrogate primary key emp_row_id.

Foreign Keys

The temporal equivalent of the FOREIGN KEY constraint is trickier. In a non-temporal database this constraint requires that the value of a column (or columns) constrained as foreign key must occur as a value of referenced column (or columns) in another or same table. An additional requirement must be satisfied for temporal tables: the referenced value must exist at any point in time when the foreign key exists.

We need one more table to illustrate the temporal foreign key constraint. The temporal_project table is defined as follows:

create table project_temporalt (
proj_code text not null, 
proj_title text not null,
proj_status text not null,
proj_row_id bigint  not null primary key
    generated by default as identity,
proj_ts timestamptz not null,
proj_te timestamptz not null,
EXCLUDE USING GIST (proj_code WITH =, tstzrange(proj_ts,proj_te) WITH &&)
);

According to any textbook on project management, each project must have start and end point in time (obviously, this are point in valid time). So, we populate this table with rows for projects that are already in the emp_temporal table:

insert into project_temporal (
proj_code, proj_title, proj_status, proj_ts, proj_te
) values
('p11', 'temporal join', 'active', '2023-01-01', '2023-08-01'),
('p15', 'temporal aggregates', 'active', '2023-01-01', '2024-01-01'),
('p20', 'temporal update', 'active', '2023-01-01', '2024-01-01')
;

The requirement of the project management is now satisfied. However, we are having the following problems:

  • Project is not visible after their end date.
  • The rows in the emp_temporal table cannot reference to the project_temporal table after termination of a project.

To address these problems in a non-temporal database, we would change status of a project instead of deleting it from the database. In the temporal database we insert additional rows into project_temporal table indicating the change of status:

insert into project_temporal (
proj_code, proj_title, proj_status, proj_ts, proj_te
) values
('p11', 'temporal join', 'closed','2023-08-01', 'infinity'),
('p15', 'temporal aggregates', 'closed','2024-01-01', 'infinity'),
('p20', 'temporal update', 'closed','2024-01-01', 'infinity')
;

We can now re-formulate the temporal foreign key constraint:

The period of a foreign key column value must be contained in the union of periods of the rows containing the referenced value.

For example, the latest row for Timo references the project p20 and its period is contained in the union of periods in both rows for this project in the project_temporal table and similarly for other rows in the emp_temporal_emp table. So, the emp_temoral.project column can be a foreign key referencing project_temporal.proj_code column.

Two remaining SQL constraints (NOT NULL and CHECK) look easier at the first glance: these constraints can be used as in non-temporal database. However, if such constraints are specified with ALTER statement (rather than when a table is created), then only rows with periods that intersect with period of a constraint should be checked.

For example, if the NOT NULL constraint was not specified for the project column when the table was created but must be enforced starting from Oct. 14, 2023, then the constraint can be added with the following SQL statement:

ALTER TABLE emp_temporal add constraint temporal_not_null_project
  CHECK (project is not null or NOT (emp_ts, emp_te) 
                     overlaps ('2023-10-14','infinity'));

Similarly, the same condition on row period can be OR-ed in the CHECK constraint.

Modifying Data in Temporal Tables

We are now ready to discuss how data can be modified in a temporal table. The SQL standard specifies syntax extensions for operations on temporal tables, but we do not use these extensions in this article. Instead, we show how temporal operations can be expressed with SQL statements operating on temporal tables. In some cases, we’ll need several SQL statements to express single temporal modification.

Modifications in a non-temporal database are visible (and valid) from the time when the statement was executed indefinitely, that is, until next modification of the same data item. This behavior is modelled with period from current point in time until infinity in the temporal database. The interpretation of the current point in time depends on the time dimension: it is always the timestamp of the transaction for SYSTEM time and may be specified by user for other time dimensions.

Inserting new rows can be performed as in a non-temporal table as long as the periods of the rows being inserted do not overlap with existing rows with same business key. For example, the following statement inserts new project:

insert into project_temporal (
proj_code, proj_title, proj_status, proj_ts, proj_te
) values
('p24', 'temporal FK', 'active','2023-10-01', 'infinity');

The DELETE operation makes rows invisible after the point in time when the deletion is executed. To obtain this behavior in the temporal database the rows are not removed from the database. Instead, the of the period is set to the time of temporal deletion. In other words, a DELETE operation is replaced with UPDATE of period end in the tamoral table. For example, assume that Antti leaves the company on Jan.1 2024 and his data should not be visible from this date. The temporal DELETE just updates the end of time period:

--Temporal delete row after 2024-01-01
update emp_temporal set emp_te = '2024-01-01'
where name='Antti'
and '2024-01-01'>= emp_ts and '2024-01-01' < emp_te;

Note that WHERE clause contains condition on time period (same as the condition for point-in-time queries). This condition ensures that only one row (per business key) is affected. Note that an attempt to set the period boundary to the same value (for same business key) would result in violation of EXCLUDE constraint because updated periods would inevitably overlap.

Any UPDATE operation in non-temporal database is logically equivalent to a pair of a delete operation (implemented with an UPDATE statement) removing the old values and INSERT operation inserting new values. The same idea works for temporal databases using temporal versions of DELETE (that is, update of the period) and INSERT described above. For example, the end date for just inserted project ‘p24’is not defined. The following statements will make the project manager happy:

--- Temporal DELETE
update project_temporal set proj_te = '2024-04-01' 
Where proj_code = 'p24'
and '2024-01-01'>= proj_ts and '2024-01-01' < proj_te;

--- Temporal INSERT
insert into project_temporal (
proj_code, proj_title, proj_status, proj_ts, proj_te
) values
('p24', 'temporal FK', 'closed','2024-04-01', 'infinity');

We can express temporal update as a single SQL statement using common table expression that contains UPDATE with RETURNING clause. For example, the following statement performs temporal update changing project assignment for Timo:

with temporal_delete as (
update emp_temporal set emp_te = '2023-10-01'
where name ='Timo'
   and emp_ts <= '2023-10-01'
   and '2023-10-01'< emp_te
returning *
)
insert into emp_temporal (
   emp_no, name, salary, project, emp_ts, emp_te) 
select 
   emp_no, name, salary, 'p24', emp_te, 'infinity'
from temporal_delete;

All examples above modified only one row. Of course, any of operations can process several rows as specified in WHERE clause or in the INSERT statement exactly as it can be specified in non-temporal databases.

Time Dimensions

We will remind you that there are dozens of ways to interpret time (usually called time dimensions). The most widely known are system (or transaction) time and valid (application) time dimensions. The model in [5] introduced asserted dimension that represents time when user believes the facts stored in the database are true. The valid time is also called application and effective time dimension.

System Time

If system time is in use, the start of a period is time when the value was stored in the database and the end of period is the time when the value was replaced with another value or logically deleted.

The advantages of system time are:

  • Easy migration of applications from non-temporal database as periods are obtained from system clock.
  • Any modification may touch only latest version of a row (that is, the latest row for each business key).
  • Any value stored in the database is never changed or deleted as only ends of periods need to be changed.

Temporal databases supporting system time are call historical. However, they represent history of a database, rather than a history of the pert of the real world modelled with the database. The asserted time is like system: it does not allow modifications of the past but decouples changes from the actual database transactions.

Valid Time

In contrast with system or asserted time, the database supporting valid time keeps time when changes occur in the real world. This time dimension is much more useful from the modelling perspective but the values for period boundaries must be supplied from the application.

For example, a decision to increase the salary starting from July 1, 2023 (valid time) might be finalized on June 15, 2023 (asserted time) and the database is updated in the next day June 16, 2023 (system time).

Within valid time it is possible and makes sense to update rows with periods in the past to correct errors. Such update may require modification of several rows (if new period covers multiple periods in the previous state of the table). So, the database is not historical if only valid time is supported.

Bi-temporal time

The bi-temporal databases support two-dimensions of time: system (or asserted) and valid time. Each row in a bi-temporal table has two associated time periods and the EXCLUDE constraint includes both. Modifications can either create new periods in valid dimension or provide new values for already existing valid periods. The former are interpreted as changes in the real world, while latter are corrections. A correction is stored as a new row, so all previous states of the data are still available (similar to historical databases). All queries discussed in parts 1 and 2 can be executed in a bi-temporal database as point-in-system-time and periods described in parts 1 and 2 as valid time periods.

The SQL standard defines system time dimension and provides basic support for one more dimension. So, bi-temporal tables can be created on top of standard SQL temporal features.

Performance Considerations

Any attempt to use temporal databases faces skeptical reactions: the performance is expected to be unacceptably poor. Indeed, the size of a temporal database is definitely much larger than the size of a database containing latest versions of data only.

However, performance expectations are often based on an intuition rather than rigorous estimations or measurements. This intuition is often misleading for the following reasons:

  • The cost of index search depends on the table size logarithmically (rather than linear), and the cost of index scan does not depend on the size of table. Therefore, the extra cost needed for filtering on time (for example, extraction of actual state) only) is negligible.
  • The cost of subsequent operations (for example, joins and grouping) depends on the size of data after filtering, rather than the size of stored tables. So, the size of data to be processed is approximately same as for non-temporal database (after proper filtering).
  • Temporal tables are pretty stable: any row can be updated at most once (when the period is closed due to update or deletion). Therefore, the amount of housekeeping procedures (such as PostgreSQL vacuum) is significantly reduced. Additional indexes (on periods) are not over-expensive for the same reason.

Of course, achieving good performance requires database tuning, but this is true for any database with high workload, no matter temporal or not.

Finally, the extra performance costs should be compared with extra benefits provided by temporal databases.

Conclusions

In this series of articles, we discuss how an application can obtain benefits from a temporal database. We started from general discussion of temporal databases, what information can be obtained and how to query temporal databases in part 1. The part 2 covers grouping and aggregation. The final part 3 (this part) discusses integrity constraints, modifications (INSERT, DELETE, and UPDATE), as well as relationships of modification with time dimensions.

Finally, although our examples use PostgreSQL-specific features, we tried to reduce dependency on the specific DBMS.

References

During first decades of research on temporal database a complete bibliography was maintained. More information on this bibliography is available in [1]. The book [2] highlights the major outcome of that research. A systematic presentation of theoretical viewpoint on temporal databases can be found in [3].

An article [4] provides an overview of temporal features in SQL Standard 2011 (that weren’t significantly changed in subsequent editions of the Standard). It also contains rationale for decisions made in the Standard.

One of several practical approaches to implementation of temporal features is described in [5]. The authors introduce asserted time dimensions and describe advantages of bi-temporal data model based on asserted and effective time dimensions.

An article [6] introduces an alignment operation that provides an extension of relational algebra supporting temporal operations for one-dimensional time.

Finally, an emotionally rich annotated bibliography is available at [7].

  1. Michael D. Soo. 1991. Bibliography on temporal databases. SIGMOD Rec. 20, 1 (March 1991), 14–23. https://doi.org/10.1145/122050.122054
  2. Abdullah Tansel, James Clifford, Shashi Gadia, Sushil Jajodia, Arie Segev, and Richard T. Snodgrass (editors). Temporal Databases: Theory, Design, and Implementation. 1993. 
  3. C. J. Date, Hugh Darwen, Nikos Lorentzos. Time and Relational Theory, Second Edition: Temporal Databases in the Relational Model and SQL. 2nd edition, 2014.
  4. Krishna Kulkarni and Jan-Eike Michels. “Temporal Features in SQL:2011”. SIGMOD Record, September 2012
  5. Tom Johnston and Randall Weis. Managing Time in Relational Databases: How to Design, Update and Query Temporal Data. 2010.
  6. Anton Dignös, Michael H. Böhlen, and Johann Gamper. 2012. Temporal alignment. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD ’12). Association for Computing Machinery, New York, NY, USA, 433–444. https://doi.org/10.1145/2213836.2213886
  7. Temporal Databases Annotated Bibliography. https://illuminatedcomputing.com/posts/2017/12/temporal-databases-bibliography/

Appendix:

Script to initialize the objects you will need for the examples in this article:


-- Using any database you wish drop schema if exists temp_agg cascade; create schema temp_agg; -- all one part named objects will be in temp_agg by default set search_path to temp_agg; create table emp_temporal ( emp_no int not null, name text not null, salary int not null, project text not null, emp_row_id bigint not null primary key generated by default as identity, emp_ts timestamptz not null, emp_te timestamptz not null, EXCLUDE USING GIST (emp_no WITH =, tstzrange(emp_ts,emp_te) WITH &&) ); insert into emp_temporal (emp_no, name, salary, project, emp_ts, emp_te) values (1,'Anne', 4000, 'p11','2023-01-01', '2023-05-01'), (1,'Anne', 4000, 'p15','2023-05-01', 'infinity'), (2,'Esa', 3200, 'p20','2023-01-01', '2023-04-16'), (2,'Esa', 3600, 'p11','2023-04-16', '2023-08-01'), (3,'Eja', 4500, 'p15','2023-01-01', '2023-04-20'), (3,'Eja', 4800, 'p15','2023-04-20', 'infinity'), (4,'Timo', 4800, 'p15','2023-01-01', '2023-06-15'), (4,'Timo', 4800, 'p20','2023-06-15', '2023-07-01'), (4,'Timo', 5000, 'p20','2023-07-01', 'infinity'), (5,'Antti', 3600, 'p15','2023-06-01', '2023-09-01'), (5,'Antti', 3900, 'p15','2023-09-01', 'infinity'), (6,'Merja', 4200, 'p11','2023-01-01', '2023-03-01'), (6,'Merja', 4200, 'p15','2023-03-01', '2023-07-01'), (6,'Merja', 4200, 'p20','2023-07-01', '2023-09-01'), (6,'Merja', 4800, 'p20','2023-09-01', 'infinity') ; create table project_temporal ( proj_code text not null, proj_title text not null, proj_status text not null, proj_row_id bigint not null primary key generated by default as identity, proj_ts timestamptz not null, proj_te timestamptz not null, EXCLUDE USING GIST (proj_code WITH =, tstzrange(proj_ts,proj_te) WITH &&) ); insert into project_temporal ( proj_code, proj_title, proj_status, proj_ts, proj_te ) values ('p11', 'temporal join', 'closed','2023-08-01', 'infinity'), ('p15', 'temporal aggregates', 'closed','2024-01-01', 'infinity'), ('p20', 'temporal update', 'closed','2024-01-01', 'infinity') ; create table qu ( code text not null, qu_ts timestamptz not null, qu_te timestamptz not null, EXCLUDE USING GIST (code WITH =, tstzrange(qu_ts, qu_te) WITH &&) ); insert into qu (code, qu_ts, qu_te) values ('q1','2023-01-01','2023-04-01'), ('q2','2023-04-01','2023-07-01'), ('q3','2023-07-01','2023-10-01'), ('q4','2023-10-01','2024-01-01'); Create view emp_current as select emp_no, name, salary, project, emp_row_id, emp_ts, emp_te from emp_temporal where emp_ts <=current_timestamp and current_timestamp < emp_te;

The post Making Temporal Databases Work. Part 3: Saving Data appeared first on Simple Talk.


Viewing all articles
Browse latest Browse all 182

Trending Articles