HashGNN (#GNN) is a node embedding technique, which was recently implemented in the Neo4j GDS (Graph Data Science Library). It considers node-local properties and employs concepts of Message Passing Neural Networks (MPNN) to capture high-order proximity. It significantly speeds-up calculation in comparison to traditional Graph Neural Networks by utilizing an approximation technique called MinHashing. Therefore, it is a hash-based approach and introduces a trade-off between eﬀiciency and accuracy. In this article, we will understand what all of that means and will explore how the algorithm works using a small example.
The Neo4j GDS Machine Learning pipelines are a convenient way to execute complex machine learning workflows directly in the Neo4j infrastructure. Therefore, they can save a lot of effort for managing external infrastructure or dependencies. In this post we will explore a common Graph Machine Learning task: Link Predictions. We will use a popular data set (The Citation Network) to prepare the input data and configure a pipeline step-by-step. I will elaborate on the main concepts and pitfalls of working with such a pipeline, so that we understand what is happening under the hood. Finally, I'll give some thoughts on current limitations.
We created a recommendation engine to predict the area for a task in DayCaptain. We followed a simple approach and it works astonishingly well - even compared to complex approaches like deep NNs. The solution can be implemented in a handful lines of Cypher and runs directly in Neo4j - no overhead for managing additional systems. In this article, I describe the easy maths and how it works.
Today, new technologies arise and new things become possible. People are talking about stream processing and real-time data processing. Everyone wants to adopt to new technologies to invest in the future. Even though I personally think this is a reasonable thought, I am also convinced that one has to understand these technologies first and what they were intended to be used for.
Apache Spark's high-level API SparkSQL offers a concise and very expressive API to execute structured queries on distributed data. Even though it builds on top of the Spark core API it's often not clear how the two are related. In this post I will try to draw the bigger picture and illustrate how these things are related.
As described in Understanding Spark Shuffle there are currently three shuffle implementations in Spark. Each of these implements the interface ShuffleWriter. The goal of a shuffle writer is to take an iterator of records and write them to a partitioned file on disk - the so called map output file.
This article is dedicated to one of the most fundamental processes in Spark - the shuffle. To understand what a shuffle actually is and when it occurs, we will firstly look at the Spark execution model from a higher level. Next, we will go on a journey inside the Spark core and explore how the core components work together to execute shuffles. Finally, we will elaborate on why shuffles are crucial for the performance of a Spark application.