Everyone is Connected

Measuring Social Geometry with Graph Databases

Social Geometry is a system of theory developed by Dr. Black that measures and accounts for different social behaviors. He suggests directly-measurable components of location, distance, and direction. Some of these can be measured or estimated directly: on wealth, demographic, age, language ability, or any other form of social indicator for which data is available. In many cases, though, these data aren’t directly available, and must be approximated through other means.

What these data have in common is a focus on relationships between things, rather than focusing on the things themselves. The value of money is relative to who else has what amounts of it; as are the value of vocabulary, and the value of a piece of real estate. These relationships can be easy to measure, though difficult to sort through.

Defining Graphs

There’s a mathematical model that fits these kind of data very well. Graphs are representations of things (called “nodes”) and the relationships between them (called “edges”). Some graphs are directed, meaning each edge between nodes represents only a one-way relationship. Some graphs are undirected, meaning that each edge refers to the same relationship between A and B as exists between B and A.

A drawing of a graph, source: Wikipedia

What does social media have to do with it?

Social media is full of graphs! When you friend someone on Facebook, that adds an edge to Facebook’s graph connecting your node to your friend’s node. That’s an undirected graph, as both parties have to agree to be friends before a friendship is made. For Twitter, the graph is directed — you can follow someone without them having to agree to follow you back.

Other data sets that work well with graph databases include:

  • Social Networks: Any site that connects users implements some kind of graph model.
  • Software Version Control: Git stores every software change as a node in a graph, with edges connecting each change to the change it built upon.
  • Geographic route planning: Streets and the intersections between them are a classic graph data set example.
  • Internet routing: Finding the shortest path between computers on the Internet.
  • Wikipedia article links: Another classic data set used to test graph systems.

Knowing there are tools and data available to work with graph-style information, using these tools to measure Social Geometry becomes a question of what measurements are available to work with. Some social geometry measurements of note include:

  • Frequency of interactions: How regularly are two people communicating?
  • Shared symbolic expressions: How are memes propagating through the social network?
  • Distribution of Resources: Who has more or less of any given scarce resource?
  • Organization: How densely or loosely connected are the people in a social network?

How you can use graph databases

In Owning Conversation, the central questions are “What does conversation own?” and, “Who or what, if anything or anyone, owns conversation?” To perform this analysis, I developed a model called Sociolinguistics Space that allows calibrated sociolinguistic measurement between different individuals. The distance between any two words in a person’s vocabulary can be expressed as:

“Multidimensional Pythagorean Distance Derivation”
(source: “Owning Conversation”, Stayskal 2009, p. 109)

Once these words are measured with a single person, they can be compared between people to answer the shared symbolic expressions question above: how do words and other memes propagate within a social network, and how do those meanings change between individuals?

Second, we can measure clustering coefficients of a network. These measurements tell us how dense or sparse a network is in connections within itself. For an entire graph, the clustering coefficient can be expressed as:

Clustering Coefficient Formula: Wikipedia

Finally, we can traverse graphs, whether they’re directed or undirected using Dijkstra’s Algorithm. When your GPS is telling you how to get to a store from your house, this algorithm (or something similar) is generally what it’s using. It tells a program what the shortest path is between nodes that are connected by edges in a graph, and can also be used to measure the asymmetry of needs category of data. Whether we’re talking about cars moving through a city, dollars moving through a corporate-political system, or data packets moving through the network, the math behind it all is remarkably similar.

When dealing with graph data, however, traditional relational databases fall short. Relational databases such as MySQL, Oragle, and PostgreSQL concentrate on working with the data themselves before focusing on the relationships between these data. What’s needed to work with graph data effectively, then, is a new kind of database: the graph database.

Many large web applications rely on graphs to store relationships between data. Facebook even makes their graph data available to developers, through their graph.facebook.com API. Twitter uses graphs to store who follows who, LinkedIn uses graphs to let you know how you know other professionals and what abilities you might have in common with them. These kinds of data can be stored relationally, though as we’ll see shortly, graph databases trump relational databases in speed and versatility.

Graph databases such as Neo4J are robust and work with a wide variety of systems. They’re seldom used as the only database in a systems architecture, but are well-adapted to integrate with other data systems that might be struggling with large numbers of complicated relationships between other data points.

When building out sample graph systems in both Neo4J and MySQL, we can test the effectiveness of these two data management strategies. Whereas MySQL uses a variant of ANSI 97 standard SQL syntax, Neo4J uses a language called Cipher to query data that’s been stored. When calculating a clustering coefficients for a network through Cipher, the query is:

START a = node(*)
MATCH (a)--(b)
WITH a, count(distinct b) as n
MATCH (a)--()-[r]-()--(a)
WHERE a.name! = "startnode"
RETURN n, count(distinct r) as r

Researchers regularly run tests between these two common database platforms that we’ve verified through our own systems. For a sample graph of 200,000 nodes, the clustering coefficient on a standard laptop took around 3 hours of CPU time. In Neo4J, the same calculation took 700 milliseconds.

When your systems architecture requires calculations like this on graph data, social or otherwise, graph databases like Neo4J are great architectural components to consider.

Written by Danne Stayskal on 2013-05-03