PhD defense Dean De Leo (Database Architectures) — CWI Amsterdam

Everyone is invited to attend Dean De Leo’s public defense of his thesis titled ‘Analyze dynamic graphs on sparse arrays‘.

Promoters: prof.dr. PA Boncz (CWI, VU) and dr. Hannes Mühleisen (CWI)

Link to attend the defense: https://vu.nl/en/events/2022/phd-defence-d-de-leo

Efficient chart analysis is difficult due to irregular chart structures and becomes even more complex as updates occur during chart analysis. Addressing this scenario is the subject of this thesis.

This research was carried out by Dean de Leo in the CWI Database Architectures group and advised by Peter Boncz and Hannes Mühleisen. It studies the properties of a data structure called Packed Memory Arrays, proposes a new variant called Rewired Memory Arrays, and uses these data structures in a new open-source transactional graph analysis system called Teseo. Dean de Leo now works for graph database company Relational.ai

The committee is formed by Dr. Semih Salihiglu (associate professor at the University of Waterloo), dr. Petra Selmer (graph query languages ​​team leader at neo4j), prof. George Fletcher (TU/e), prof. Stefan Manegold (University of Leiden and CWI) and prof. Henri Bal (VU) and will be chaired by prof. Jaap Heringa (VU).

Hope to see you there!

== Analysis of Dynamic Graphs on Sparse Arrays – Dean de Leo PhD Thesis ==

Frameworks optimized for graph analysis tend to rely on write-unfriendly data structures, often assuming that the graph being examined is static and immutable.

Traditionally, small transactions and detailed updates have mostly been served by different DBMSs, explicitly designed for online transaction processing (OLTP). At a later stage, data would be extracted from these systems and loaded into a graphical framework designed for analytical queries (OLAP). This pipeline, however, is not suitable for time-sensitive scenarios that require analyzing dynamic graphs, where their components (vertices, edges) and properties can be changed continuously in real time. These scenarios are commonly referred to as hybrid workloads or Hybrid Transactional Analytical Processing (HTAP) and are a general open problem in research: the same techniques and data structures that make systems so efficient in analytical queries are often the primary culprits that render them ineffective in fine updates.

In the thesis, we study a data structure, the Packed Memory Array (PMA), also called sparse array, which promises to bridge the performance of analytical queries, by allowing fast sequential analyses, with a potentially high throughput in updates. day. PMA is an existing data structure, but so far it has been the subject of essentially theoretical work. In our early experiments, we find that the PMA, in practice, suffers from a variety of structural problems than the developed theory implies. We present a new data structure, named Rewired Memory Array (RMA), derived from PMA, aiming to solve its problems by improving or inventing a new set of techniques. Finally, we prove that RMA is competitive with hand-tuned B+ trees in terms of fine updates and that it can achieve close performance in (range) scans to static dense arrays, in sequential and parallel environments.

From the RMA, we develop a new system explicitly intended for the context of graph analysis. Computations on graphs tend to present a number of challenges, caused by frequent random memory jumps, high memory to computation ratio, and partitioning issues. The Compressed Sparse Row (CSR) is an efficient representation established to operate on graphs, already adopted by a number of frameworks, but it is read-only: any modification of the graph requires rebuilding the entire data structure. Instead, most related work, which explicitly targets graph analysis on dynamic graphs, is based on a common design, which physically mimics the same structure as the adjacency list model. We note, however, that these systems have several functional limitations, related to their capabilities and consistency and isolation guarantees, and non-functional limitations, related to update bias and missed opportunities in scan access patterns. . We present a new system, named Teseo, to analyze dynamic graphs. Teseo adopts a design based on fat trees, B+ trees with very large leaves, where the leaves are sparse tables, and offers complete transactional solutions.
Support. We show that Teseo can overcome most of the functional constraints of existing systems to analyze dynamic graphs, while achieving comparable performance, depending on the access models, either to the CSR or, in the worst case, to other existing systems. .

Sparse arrays and large trees also represent two new building blocks for making generic systems equally capable of handling OLAP and OLTP workloads. Yet column stores, widely used in analytical systems, leverage additional techniques, including vertical partitioning and compression, that can further improve their efficiency in analyzes at the cost of updates. Although these techniques could be directly mapped to the data structures presented in the thesis, they would also compromise their updateability and, therefore, serve as main points for future work.