Illustration of probabilities as areas

Building a simple recommendation engine in Neo4j

Published on April 8, 2022

We created a recommendation engine to predict the area for a task in DayCaptain. We followed a simple approach and it works astonishingly well - even compared to complex approaches like deep NNs. The solution can be implemented in a handful lines of Cypher and runs directly in Neo4j - no overhead for managing additional systems.

In this article, I describe the easy maths and how it works.

Sampling the context of a node in a graph

Graph Embeddings: How nodes get mapped to vectors

Published on February 18, 2022

Understand how node2vec maps graph-structured data to numerical vectors - the key to unlock the powerful toolbox of traditional machine learning algorithms for graphs.

Storage model of a graph database

What's the hype about graph databases? A technical interpretation

Published on November 11, 2021

One question that I have come across many times recently, and which have given some careful thought is: What's the deal about graph databases and how are they different?

Comparison of batch and stream processing of event streams

What is Stream Processing?

Published on July 5, 2019

Today, new technologies arise and new things become possible. People are talking about stream processing and real-time data processing. Everyone wants to adopt to new technologies to invest in the future. Even though I personally think this is a reasonable thought, I am also convinced that one has to understand these technologies first and what they were intended to be used for.

Illustration how a query is transformed into a physical execution plan

The bigger picture: How SparkSQL relates to Spark core (RDD API)

Published on January 21, 2019

Apache Spark's high-level API SparkSQL offers a concise and very expressive API to execute structured queries on distributed data. Even though it builds on top of the Spark core API it's often not clear how the two are related. In this post I will try to draw the bigger picture and illustrate how these things are related.

Table showing available data models

Choosing the right data model

Published on January 10, 2019

One essential choice to make when building data-driven applications is the technology of the underlying data persistence layer. Let's explore their criteria.

Illustration of the Spark merge phase

Understanding Apache Spark Hash-Shuffle

Published on October 25, 2018

As described in Understanding Spark Shuffle there are currently three shuffle implementations in Spark. Each of these implements the interface ShuffleWriter. The goal of a shuffle writer is to take an iterator of records and write them to a partitioned file on disk - the so called map output file.

Illustration of Apache Spark physical plan stages

Understanding Apache Spark Shuffle

Published on September 10, 2018

This article is dedicated to one of the most fundamental processes in Spark - the shuffle. To understand what a shuffle actually is and when it occurs, we will firstly look at the Spark execution model from a higher level. Next, we will go on a journey inside the Spark core and explore how the core components work together to execute shuffles. Finally, we will elaborate on why shuffles are crucial for the performance of a Spark application.