Modeling Git Commits with Neo4j

Table Of Contents

For the coderadar project, I’m currently searching for a way to create a persistent model of a git commit history that contains the relationships between all commits and the files that were touched within these commits. And since coderadar is a code quality server, the model should also be able to express the history of (code quality) metrics on all files throughout all commits.

My first reflex was to model this with JPA entities in a relational database and build some really complex HQL queries to access the data. Turns out it works, but the relational schema required me to create a join table between the table representing commits and the table representing the files that were touched within the commits. For each commit, this table contained one entry for each file that exists at the time of the commit, even if the file was not modified within the commit, so that I could run queries on it! A test with a git repository containing about 7 500 files and a couple thousand commits resulted in 70 million entries in that join table! There has to be a different solution that does not waste millions of bytes in join tables. Hence, I had a look at Neo4j.

Why Neo4j?

Neo4j is a graph database. A graph allows modelling of entities (nodes) and their relationships to each other. A git commit history is also a graph of commits with parent/child relationships. Throw in relationships between files and commits and relationships between files and code quality metrics and we have the model I’m looking for.

Also, Neo4j has pretty good support with Spring Data, which is used in the coderadar project. In addition, the learning curve is not as steep as I initially thought, having only worked with relational databases before. The Getting Started Guide is quite helpful and I was able to learn the basics of Neo4j within just one afternoon.

The Graph

It turns out that I found modelling a graph database is much more fun than modelling a relational database, since you just draw nodes and edges and you have a model which can then be easily transferred into code using Spring Data Neo4j and Neo4j’s Object Graph Mapper (OGM). The following Graph shows the model I came up with after some drawing with pen & paper.

![Coderadar Graph]({{ base }}/assets/img/posts/coderadar-graph.png)

Commit

A commit node represents a commit in a git history. Every commit is a CHILD OF one or more other commits (except the first commit, which has no parent) and TOUCHES one or more files and file snapshots. A commit has a timestamp and a sha1-hash which serves as identifier.

File

A file node represents a file during all of its life within the git repository. A file comes into existence when it is ADDED in a commit and can be MODIFIED or RENAMED over several following commits and finally it can be DELETED in a final commit. Thus, each file node is connected to one or more commit nodes via a relationship that specified the type of change the file experienced within that commit. If a file is not modified within a certain commit, there will be no relationship between the two nodes.

The other way around, a commit TOUCHES certain files. This relationship is optional from a model point of view, since we already have the relationship in the other direction. However, bi-directional relationships make working with the Object Graph Mapper easier in some cases. For example, if you want to store a node, OGM automatically also stores the nodes that are connected to that node by outgoing relationships. I can now simply store a Commit node and all TOUCHED file nodes will be saved, too, all within a single call to the database.

A file node has a single fileId as attribute, which just serves as an identifier for the file.

FileSnapshot

While a file node represents a file all over it’s lifetime, a file snapshot node represents a file at the point in time of a certain commit only. This is necessary, since a file can be renamed during it’s lifetime, and we need some way to identify a file by it’s name. So, a file snaphot node has a path attribute that contains the file’s path at the point in time of a certain commit.

Metric

A metric node represents a certain code quality metric (like cyclomatic complexity). It has a metricId attribute which serves as identifier for the metrics. A metric node can have relationships with multiple file snapshot nodes, which represent that the file snapshot has been MEASURED with this metric. This relationship has the attribute value which specified the value of the metric in the file snapshot.

Querying the Graph

So, what’s the big thing? Couldn’t we have done the same with a relational database. Admittedly, you can model database tables very similar to the graph in the image above. However, note that we only have to to connect commit nodes with files and file snapshots of files that were touched in that commit and not with ALL files that exist at the time of the commit, thus saving a quadratic amount of storage space.

Also, querying the graph is much easier with Neo4j’s query language Cypher than it is with SQL (or HQL for that matter). Take this query, which recursively loads all files that were TOUCHED in any commit previous to a specified commit but have not yet been DELETED.

MATCH 
  (child:Commit {name:<COMMIT_NAME>})-[:IS_CHILD_OF*]->(parent:Commit),
  (parent)-[:TOUCHED]->(file:File)
WHERE NOT 
  ((file)-[:DELETED_IN_COMMIT]->())
RETURN 
  file

Try to find an equivalent in SQL for this query that does not rely on having a fully filled join table between commits and files! I guess you could create such a query if the database supports hierarchical queries, but those queries would be much harder to create and much less readable.

Wrap-Up

My first experience with Neo4j has been quite refreshing. Knowing a lot about JPA and relational databases I had to open up to the graph concept but I was quickly convinced of the expressive nature of a graph database and the elegance in which I could create and query a graph database for my use case.

I’m going to continue building a graph database for coderadar to evaluate performance and maintainability and may report the current state again in a later post. If you want to have a first-hand look at code, you may want to look at the coderadar-graph module of coderadar. It contains some unit tests that show how to work with Spring Data together with Neo4j.

Written By:

Tom Hombergs

Written By:

Tom Hombergs

As a professional software engineer, consultant, architect, general problem solver, I've been practicing the software craft for more than fifteen years and I'm still learning something new every day. I love sharing the things I learned, so you (and future me) can get a head start. That's why I founded reflectoring.io.

Recent Posts

Inheritance, Polymorphism, and Encapsulation in Kotlin

In the realm of object-oriented programming (OOP), Kotlin stands out as an expressive language that seamlessly integrates modern features with a concise syntax.

Read more

Publisher-Subscriber Pattern Using AWS SNS and SQS in Spring Boot

In an event-driven architecture where multiple microservices need to communicate with each other, the publisher-subscriber pattern provides an asynchronous communication model to achieve this.

Read more

Optimizing Node.js Application Performance with Caching

Endpoints or APIs that perform complex computations and handle large amounts of data face several performance and responsiveness challenges. This occurs because each request initiates a computation or data retrieval process from scratch, which can take time.

Read more