Skip to main content

Graph processing: a problem with no clear victor

·3 mins
Data science Graphs Opinion datascience graphs opinion spark tensorflow

We all depend on the Internet to search for potential solutions to technical problems. For example, for big data problems after five minutes in Google you will find out that Spark may help you. Even if you have no idea about what is Spark you will come across this name. Something similar occurs to TensorFlow when searching for deep learning solutions, Kubernetes for cloud, Docker for containers… It seems that there is always one platform/framework/library for every buzzword in computer science. However, try to look for a graph processing solution. You will find out that there is no clear victor. And I find this quite surprising.

In 2015 me and my colleagues at Inria published an article proposing a middleware that could inspire developers to offer a generic framework to implement distributed graph processing solutions. And we made that because we had a strong feeling that there was not a consistent proposal to accelerate the development of massive graph processing solutions. And this is surprising if we consider that The Graph500 benchmark has some computing intensive problems using graphs. The explosion of social networks after the born of Facebook and Twitter captured the attention of the research community and put on the table new problems in terms of computation and scalability. Additionally, there is a vast number of problems that use graphs as the underlying data structure to be used. Graphs are used for fraud detection, game theory, and a vast number of data related problems.

There is massive ecosystem of graph libraries. Just to mention a few available I have experience with we have: GraphTool, SNAP, igraph, NetworkX, BGL, network for R, etc. And then we have modules of extensions for other more popular platforms such as GraphX for Spark, the Hadoop extension for graphs Giraph, the Neo4j which is really a database not a library.

Then, we jump into a more obscure world of solutions that claim to be the most efficient in some aspect. A ton of work has been carried out by research facilities and/or universities. However, I would not suggest any developer to adopt any of these solutions. Purely. Why? Because they are hardly maintained, not documented and will eventually vanish. There are some notorious examples such as Google’s Pregel which inspired Apache Giraph and was never done publicly available (correct me if I’m wrong). GraphLab code was available and slowly vanished until becoming part of Turi.

If you are a developer and you need to include some graph processing consider these points:

  • The programming language you want to use may reduce your options.
  • Are you looking for a specific algorithm or you want to explore and/or implement your own solution? The first somehow simplifies the search because you can look for the best performing algorithm implementation. If you want to do your own exploration look at the next point.
  • How large is your graph? This is extremely important. Most of the libraries I have cited above will have problems when working with millions of vertices and edges. It is even worse if your graph does not fit into memory.

Unfortunately, at the moment of writing this post we do not have a clear victor in the world of graph processing solutions. Actually, most solutions used in companies are tailor made from scratch. Developers code again and again the same algorithms contributing to the mayhem of existing implementations. I am strongly convinced that there is room (and the need) for a new paradigm that may help developers to model distributed graph algorithms simplifying the development and maintainability while increasing performance. Something similar to what TensorFlow did to neural networks by abstracting common pieces and concentrating the efforts of the community into a common solution.

I would like to hear about your comments and experiences.

Related

The many challenges of graph processing
·6 mins
Data science Graphs Software datascience graphs programming structures
Covid19 spreading in a networking model (part II)
·11 mins
Data science Graphs covid19 datascience graph-tool graphs plotly python
Covid19 spreading in a networking model (part I)
·13 mins
Data science covid19 datascience graph-tool graphs networks pandas plotly python
We Have to Define What is Edge Computing
·6 mins
Opinion Technology edge computing fog computing opinion technology
What can Borges teach you about overfitting?
·3 mins
Data science literature borges datascience machine learning