In image showing a logical plan as it would be produced from a Spark program. The logical plan is essentially a tree structure.

The one thing you need to understand about Apache Spark APIs (Python, Scala, Java, SQL)

Published on Feb 20, 2024

Apache Spark offers multiple APIs to build applications. Is there any difference regarding performance, monitoring or debugging?

An image showing the title 'Which development setup should I choose to learn Spark?' alongside the logos of IntelliJ IDEA, Spark, Scala, Python and databricks.

Which Development Setup Should I Use to Learn Spark?

Published on Feb 6, 2024

Choosing the development setup is an essential choice to make. Let's find out which suits you best.

A schematic learning roadmap depicting the journey to learn Apache Spark to pro-level. The roadmap contains six milestones like: Learn SparkSQL, study internals, optimize performance, become pro.

Become an Apache Spark Pro in 2024 - Unleash Your Learning

Published on Jan 26, 2024

Aspiring to become Pro can be a daunting goal. One may easily feel confused and overwhelmed. What you need is a crystal-clear roadmap to success; And I help you craft yours.

The second step in each iteration of HashGNN is to combine the neighbor vectors.

HashGNN: Deep-dive into Neo4j GDS's new node embedding algorithm

Published on Aug 10, 2023

HashGNN (#GNN) is a node embedding technique, which was recently implemented in the Neo4j GDS (Graph Data Science Library). It considers node-local properties and employs concepts of Message Passing Neural Networks (MPNN) to capture high-order proximity. It significantly speeds-up calculation in comparison to traditional Graph Neural Networks by utilizing an approximation technique called MinHashing. Therefore, it is a hash-based approach and introduces a trade-off between eﬀiciency and accuracy. In this article, we will understand what all of that means and will explore how the algorithm works using a small example.

Graph Link Predictions can be interpreted as a binary classification problem.

Understanding Neo4j GDS Link Predictions (with Demonstration)

Published on Oct 5, 2022

The Neo4j GDS Machine Learning pipelines are a convenient way to execute complex machine learning workflows directly in the Neo4j infrastructure. Therefore, they can save a lot of effort for managing external infrastructure or dependencies. In this post we will explore a common Graph Machine Learning task: Link Predictions. We will use a popular data set (The Citation Network) to prepare the input data and configure a pipeline step-by-step. I will elaborate on the main concepts and pitfalls of working with such a pipeline, so that we understand what is happening under the hood. Finally, I'll give some thoughts on current limitations.

Building a simple recommendation engine in Neo4j

Published on Apr 8, 2022

We created a recommendation engine to predict the area for a task in DayCaptain. We followed a simple approach and it works astonishingly well - even compared to complex approaches like deep NNs. The solution can be implemented in a handful lines of Cypher and runs directly in Neo4j - no overhead for managing additional systems. In this article, I describe the easy maths and how it works.

Sampling the context of a node in a graph

Graph Embeddings: How nodes get mapped to vectors

Published on Feb 18, 2022

Understand how node2vec maps graph-structured data to numerical vectors - the key to unlock the powerful toolbox of traditional machine learning algorithms for graphs.

What's the hype about graph databases? A technical interpretation

Published on Nov 11, 2021

One question that I have come across many times recently, and which have given some careful thought is: What's the deal about graph databases and how are they different?

Comparison of batch and stream processing of event streams

What is Stream Processing?

Published on Jul 5, 2019

Today, new technologies arise and new things become possible. People are talking about stream processing and real-time data processing. Everyone wants to adopt to new technologies to invest in the future. Even though I personally think this is a reasonable thought, I am also convinced that one has to understand these technologies first and what they were intended to be used for.

Illustration how a query is transformed into a physical execution plan

The bigger picture: How SparkSQL relates to Spark core (RDD API)

Published on Jan 21, 2019

Apache Spark's high-level API SparkSQL offers a concise and very expressive API to execute structured queries on distributed data. Even though it builds on top of the Spark core API it's often not clear how the two are related. In this post I will try to draw the bigger picture and illustrate how these things are related.

Choosing the right data model

Published on Jan 10, 2019

One essential choice to make when building data-driven applications is the technology of the underlying data persistence layer. Let's explore their criteria.

Understanding Apache Spark Hash-Shuffle

Published on Oct 25, 2018

As described in Understanding Spark Shuffle there are currently three shuffle implementations in Spark. Each of these implements the interface ShuffleWriter. The goal of a shuffle writer is to take an iterator of records and write them to a partitioned file on disk - the so called map output file.

Illustration of Apache Spark physical plan stages

Understanding Apache Spark Shuffle

Published on Sep 10, 2018

This article is dedicated to one of the most fundamental processes in Spark - the shuffle. To understand what a shuffle actually is and when it occurs, we will firstly look at the Spark execution model from a higher level. Next, we will go on a journey inside the Spark core and explore how the core components work together to execute shuffles. Finally, we will elaborate on why shuffles are crucial for the performance of a Spark application.