An exploration of dependent types for data-centric programming

Capstone project for the Bachelor of Computer Science course at the Insitute of Mathematics and Statistics of the University of São Paulo (IME-USP).

Student: Eduardo Sandalo Porto <eduardo@sandalo.dev>
Supervisor: Prof. Dr. Ana Cristina Vieira de Melo.

This page contains the the main monograph, the thesis proposal and the banner presented at the university.

Abstract

Type systems define rules that assign types to expressions of a language, and are the subject of study of type theory. The first theory of types was devised by Bertrand Russel in the early 20th century to be a logical foundation of mathematics, although it couldn't fulfill its promise. Dependently-typed programming languages restore this goal through the Curry-Howard correspondence, summarized by the "propositions as types" motto: types correspond to logical propositions and programs to their proofs. As such, they can be used not only to write programs, but also to formally verify them.

The goal of this project is to study dependent types from the ground up and apply them to a real-world scenario, namely, data-centric programming. The study of types starts at its conception, later deepening into how it intertwined with the field of programming languages through the lambda calculus. To do so, we first look at its untyped formulation, then the simply-typed lambda calculus, its extensions within the lambda cube, and we finally study dependently-typed theories.

The primary goals of data-centric programming are to represent, query, and transform data. A classic example is the SQL language and related environments, however for this project we focused on exploratory data analysis tools like the pandas library. We re-implemented a data frame system in the dependently-typed language Lean 4 basing it on an existing implementation in Idris 2. The result was a verified domain-specific language to load, query, and transform tabular data.

We conclude that writing verified programs can be tricky, including our data frame implementation and its usage, but can be beneficial for a few use cases. Exploratory processes can be improved by disallowing semantically-invalid operations, but our solution shines most in writing data operations for critical use-cases. A next step for the project is to implement a transpiling engine able to transform queries from our language into well-known systems like pandas and R.