TCC Eduardo

LKML Queryable format public dataset

The development process for the Linux Kernel is being developed in the open since it was first introduced in a public mailing list. There are public archives of all public discussions available in a couple places, and they provide a way to download and dig deep into the discussions. However, most arquives provide data in html or mbox formats, with limited search support.

RQ1: Are the emails in the public mailing lists a good resource for answering other questions about the development of the Linux Kernel?
- There are a few articles analyzing patches and reviews. See references.

RQ2: Can we produce a new arquive, in an open and queryable colunar format like Apache Parquet or Apache ORC, like the Software Heritage Archive?

Completing RQ2 allows us to answer RQ1 easily using SQL.

Resources needed:
By the nature of the format used in these discussions (plain-text emails), this task does not need specialized hardware to be completed.
The necessary space is estimated by the official kernel.org mailing list to be more than "20+ GB" (up to 1998). This is easy to maintain, even if underestimating by 10x.

The download instructions are available in: kernel.org/lore
The software used to host the and mirror the mailing lists is also available in the same link.
This should be useful to create the scripts to maintain the archive up-to-date.
54 Other archives also may contain emails older than 1998, and this could be explored if deemed useful.

Aditional sources for obtaining the archive:

(official)lore.kernel.org/lkml
lkml.org/lkml
marc.info
lkml.indiana.edu/hypermail/linux/kernel/
lore.kernel.org/linux-arch/

Deliverables:

(RQ1) Answer questions related to the maintainers for the other paper.
(RQ2) Provide a initial public dataset, like the Software Heritage, and the scripts necessary to reproduce and continue to expand the arquive.

Main tasks:
- understand the format used in existing archives
- download all the kernel.org archive (+ others if useful)
- create formal table schema to store our archive.
-> The desired output will very likely be a multi-table design, with message content and metadata stored in different tables.
-> the format should accommodate different query types, like statistical analysis on authors, threads, or content
-> by its nature, a graph representation could be useful. The similarity to the Software Heritage could also derive good insight.
- Add additional information:
-> Most (or at least some) mailing lists have a tag-format specified for the subject messages. There is alo well known tags like "[pach]" and "v1, v2, vx" tags. This is not a precise information, but the script should apply techniques to attempt to identify them, and add as extra metadata to the schema.
- Create the dataset in a queryable format like Apache Parquet or Apache ORC.
-> The dataset should use hive-style folder patitioning for dates. This is well documented in various places, like here:https://duckdb.org/docs/stable/data/partitioning/hive_partitioning.html
-> make it available in a public hosting platform or in the IME infrastructure

References:

- Daniel Schneider, Scott Spurlock, and Megan Squire. 2016. Differentiating Communication Styles of Leaders on the Linux Kernel Mailing List. In Proceedings of the 12th International Symposium on Open Collaboration (OpenSym '16). Association for Computing Machinery, New York, NY, USA, Article 2, 1-10. doi.org
- Used the lkml.indiana.edu archive.
- Downloaded messages and stored them in MySQL for analysis
- cited by 19 other articles , and some of them used the same technique: dl.acm.org

Most recent one, with the same technique.