## NetworkX Tutorial Summary¶

This is a summary of NetworkX tutorial retrieved from the following website: Programminghistorian Tutorial

```
!pip3 install networkx
```

```
```

```
'''
Before beginning this tutorial, you will need to download two files that together constitute our network dataset.
The file quakers_nodelist.csv is a list of early modern Quakers (nodes)
and
the file quakers_edgelist.csv is a list of relationships between those Quakers (edges).
When you open the edge file, you will see that we use the names from the node file to identify the nodes connected by each edge.
These edges begin at a source node and end at a target node.
While this language derives from so-called directed network structures, we will be using our data as an undirected network: if Person A knows Person B, then Person B must also know Person A.
In directed networks, relationships need not be reciprocal (Person A can send a letter to B without getting one back), but in undirected networks the connections are always reciprocal, or symmetric.
'''
import csv
from operator import itemgetter
import networkx as nx
from networkx.algorithms import community # this part of netwrkx, for community detection, needs to be imported separately
with open('quakers_nodelist.csv', 'r') as nodecsv: # Open the file
nodereader = csv.reader(nodecsv) # Read the csv
# Retrieve the data (using Python list comprhension and list slicing to remove the header row, see footnote 3)
nodes = [n for n in nodereader][1:]
node_names = [n[0] for n in nodes] # Get a list of only the node names
with open('quakers_edgelist.csv', 'r') as edgecsv: # Open the file
edgereader = csv.reader(edgecsv) # Read the csv
edges = [tuple(e) for e in edgereader][1:] # Retrieve the data
'''
[NOTE]
Format of data:
nodes as print - [['Joseph Wyeth', 'religious writer', 'male', '1663', '1731', '10013191'], ... ... ]
node_names = ['Joseph Wyeth', 'Alexander Skene of Newtyle', .... ]
edges as print - [('George Keith', 'Robert Barclay'), ... ... ]
'''
print(len(node_names)) # supposedly 119 nodes
print(len(edges)) # supposedly 174 edges
G = nx.Graph()
G.add_nodes_from(node_names) # importing nodes
G.add_edges_from(edges) # importing edges
'''
This is one of several ways to add data to a network object.
Check out (https://networkx.github.io/documentation/stable/tutorial.html#adding-attributes-to-graphs-nodes-and-edges)
'''
print(nx.info(G))
```

```
import matplotlib.pyplot as plt
plt.figure(figsize=(6,6),dpi=150)
nx.draw(G, with_labels=True,font_size=5)
```

```
### Adding Attributes
hist_sig_dict = {}
gender_dict = {}
birth_dict = {}
death_dict = {}
id_dict = {}
for node in nodes: # Loop through the list, one row at a time
hist_sig_dict[node[0]] = node[1]
gender_dict[node[0]] = node[2]
birth_dict[node[0]] = node[3]
death_dict[node[0]] = node[4]
id_dict[node[0]] = node[5]
nx.set_node_attributes(G, hist_sig_dict, 'historical_significance')
nx.set_node_attributes(G, gender_dict, 'gender')
nx.set_node_attributes(G, birth_dict, 'birth_year')
nx.set_node_attributes(G, death_dict, 'death_year')
nx.set_node_attributes(G, id_dict, 'sdfb_id')
for n in G.nodes(): # Loop through every node, in our data "n" will be the name of the person
print(n, G.nodes[n]['birth_year']) # Access every node by its name, and then by the attribute "birth_year"
```

### Density of network¶

```
density = nx.density(G)
print("Network density:", density)
```

- The output of density is a number, so that’s what you’ll see when you print the value.
- In this case, the density of our network is approximately 0.0248. On a scale of 0 to 1, not a very dense network, which comports with what you can see in the visualization.
- A 0 would mean that there are no connections at all, and a 1 would indicate that all possible edges are present (a perfectly connected network): this Quaker network is on the lower end of that scale, but still far from 0.

### Shortest path measurement¶

- A shortest path measurement is a bit more complex.
- It calculates the shortest possible series of nodes and edges that stand between any two nodes, something hard to see in large network visualizations.
- This measure is essentially finding friends-of-friends—if my mother knows someone that I don’t, then mom is the shortest path between me and that person.
- The Six Degrees of Kevin Bacon game, from which our project takes its name, is basically a game of finding shortest paths (with a path length of six or less) from Kevin Bacon to any other actor.

```
fell_whitehead_path = nx.shortest_path(G, source="Margaret Fell", target="George Whitehead")
print("Shortest path between Fell and Whitehead:", fell_whitehead_path)
```

```
print("Length of that path:", len(fell_whitehead_path)-1)
```

### Diameter¶

- There are many network metrics derived from shortest path lengths.
- One such measure is diameter, which is the longest of all shortest paths.
- After calculating all shortest paths between every possible pair of nodes in the network, diameter is the length of the path between the two nodes that are furthest apart.
- The measure is designed to give you a sense of the network's overall size, the distance from one end of the network to another.

```
# If your Graph has more than one component, this will return False:
print(nx.is_connected(G))
# Next, use nx.connected_components to get the list of components,
# then use the max() command to find the largest one:
components = nx.connected_components(G)
largest_component = max(components, key=len)
# Create a "subgraph" of just the largest component
# Then calculate the diameter of the subgraph, just like you did with density.
#
subgraph = G.subgraph(largest_component)
diameter = nx.diameter(subgraph)
print("Network diameter of largest component:", diameter)
```

### Triadic closure¶

Unlike density which is scaled from 0 to 1, it is difficult to know from this number alone whether 8 is a large or small diameter. For some global metrics, it can be best to compare it to networks of similar size and shape.

- Triadic closure supposes that if two people know the same person, they are likely to know each other.
- If Fox knows both Fell and Whitehead, then Fell and Whitehead may very well know each other, completing a triangle in the visualization of three edges connecting Fox, Fell, and Whitehead.
- The number of these enclosed triangles in the network can be used to find clusters and communities of individuals that all know each other fairly well.
- One way of measuring triadic closure is called clustering coefficient because of this clustering tendency, but the structural network measure you will learn is known as transitivity.
- Transitivity is the ratio of all triangles over all possible triangles.
- A possible triangle exists when one person (Fox) knows two people (Fell and Whitehead).
- So transitivity, like density, expresses how interconnected a graph is in terms of a ratio of actual over possible connections.
- Remember, measurements like transitivity and density concern likelihoods rather than certainties.
- All the outputs of your Python script must be interpreted, like any other object of research.
- Transitivity allows you a way of thinking about all the relationships in your graph that may exist but currently do not.

```
triadic_closure = nx.transitivity(G)
print("Triadic closure:", triadic_closure)
```

Also like density, transitivity is scaled from 0 to 1, and you can see that the network’s transitivity is about 0.1694, somewhat higher than its 0.0248 density.

Because the graph is not very dense, there are fewer possible triangles to begin with, which may result in slightly higher transitivity.

That is, nodes that already have lots of connections are likely to be part of these enclosed triangles. To back this up, you’ll want to know more about nodes with many connections.

### Centrality¶

- After getting some basic measures of the entire network structure, a good next step is to find which nodes are the most important ones in your network.
- In network analysis, measures of the importance of nodes are referred to as centrality measures.
- Because there are many ways of approaching the question "which nodes are the most important?" there are many different ways of calculating centrality.
- Here you will learn about three of the most common centrality measures: degree, betweenness centrality, and eigenvector centrality.

#### Degree¶

Degree is the simplest and the most common way of finding important nodes.

- a node's degree is the sum of its edges.
- if a node has three lines extending from it to other nodes, its degree is three.
- Five edges, its degree is five.
- The nodes with the highest degree in a social network are the people who know the most people.
- These nodes are often referred to as hubs, and calculating degree is the quickest way of identifying hubs.

```
print(G.nodes['William Penn'])
```

```
degree_dict = dict(G.degree(G.nodes()))
nx.set_node_attributes(G, degree_dict, 'degree')
print(G.nodes['William Penn'])
```

```
sorted_degree = sorted(degree_dict.items(), key=itemgetter(1), reverse=True)
print("Top 20 nodes by degree:")
for d in sorted_degree[:20]:
print(d)
```

```
betweenness_dict = nx.betweenness_centrality(G) # Run betweenness centrality
eigenvector_dict = nx.eigenvector_centrality(G) # Run eigenvector centrality
# Assign each to an attribute in your network
nx.set_node_attributes(G, betweenness_dict, 'betweenness')
nx.set_node_attributes(G, eigenvector_dict, 'eigenvector')
```

```
sorted_betweenness = sorted(betweenness_dict.items(), key=itemgetter(1), reverse=True)
print("Top 20 nodes by betweenness centrality:")
for b in sorted_betweenness[:20]:
print(b)
```

- As you can see, Penn’s degree is 18, relatively high for this network.
- But printing out this ranking information illustrates the limitations of degree as a centrality measure.
- You probably didn’t need NetworkX to tell you that William Penn, Quaker leader and founder of Pennsylvania, was important.
- Most social networks will have just a few hubs of very high degree, with the rest of similar, much lower degree.
- 12 Degree can tell you about the biggest hubs, but it can’t tell you that much about the rest of the nodes.
- And in many cases, those hubs it’s telling you about (like Penn or Quakerism co-founder Margaret Fell, with a degree of 13) are not especially surprising.
- In this case almost all of the hubs are founders of the religion or otherwise important political figures.

### Eigenvector centrality and Betweenness centrality¶

#### Eigenvector Centrality¶

- it looks at a combination of a node’s edges and the edges of that node’s neighbors.
- Eigenvector centrality cares if you are a hub, but it also cares how many hubs you are connected to.
- It’s calculated as a value from 0 to 1: the closer to one, the greater the centrality.
- Eigenvector centrality is useful for understanding which nodes can get information to many other nodes quickly.
- If you know a lot of well-connected people, you could spread a message very efficiently.
- If you’ve used Google, then you’re already somewhat familiar with Eigenvector centrality.
- Their PageRank algorithm uses an extension of this formula to decide which webpages get to the top of its search results.

#### Betweenness Centrality¶

- Betweenness centrality is a bit different from the other two measures in that it doesn’t care about the number of edges any one node or set of nodes has.
- Betweenness centrality looks at all the shortest paths that pass through a particular node (see above).
- To do this, it must first calculate every possible shortest path in your network, so keep in mind that betweenness centrality will take longer to calculate than other centrality measures (but it won’t be an issue in a dataset of this size).
- Betweenness centrality, which is also expressed on a scale of 0 to 1, is fairly good at finding nodes that connect two otherwise disparate parts of a network.
- If you’re the only thing connecting two clusters, every communication between those clusters has to pass through you.
- In contrast to a hub, this sort of node is often referred to as a broker.
- Betweenness centrality is not the only way of finding brokerage (and other methods are more systematic), but it’s a quick way of giving you a sense of which nodes are important not because they have lots of connections themselves but because they stand between groups, giving the network connectivity and cohesion.

```
#First get the top 20 nodes by betweenness as a list
top_betweenness = sorted_betweenness[:20]
#Then find and print their degree
for tb in top_betweenness: # Loop through top_betweenness
degree = degree_dict[tb[0]] # Use degree_dict to access a node's degree, see footnote 2
print("Name:", tb[0], "| Betweenness Centrality:", tb[1], "| Degree:", degree)
```

## Advanced NetworkX: Community detection with modulairty¶

- Another common thing to ask about a network dataset is what the subgroups or communities are within the larger social structure.
- Is your network one big, happy family where everyone knows everyone else? Or is it a collection of smaller subgroups that are only connected by one or two intermediaries?
- The field of community detection in networks is designed to answer these questions.
- There are many ways of calculating communities, cliques, and clusters in your network, but the most popular method currently is modularity.
- Modularity is a measure of relative density in your network: a community (called a module or modularity class) has high density relative to other nodes within its module but low density with those outside.
- Modularity gives you an overall score of how fractious your network is, and that score can be used to partition the network and return the individual communities.
- Very dense networks are often more difficult to split into sensible partitions.
- Luckily, as you discovered earlier, this network is not all that dense.
- There aren’t nearly as many actual connections as possible connections, and there are several altogether disconnected components.
- Its worthwhile partitioning this sparse network with modularity and seeing if the result make historical and analytical sense.
- Community detection and partitioning in NetworkX requires a little more setup than some of the other metrics.
- There are some built-in approaches to community detection (like minimum cut, but modularity is not included with NetworkX.
- Fortunately there’s an additional python module you can use with NetworkX, which you already installed and imported at the beginning of this tutorial.
- You can read the full documentation for all of the functions it offers, but for most community detection purposes you’ll only want best_partition():

```
communities = community.greedy_modularity_communities(G) #tries to determine the number of communities appropriate for the graph, and groups all nodes into subsets based on these communities.
'''
the above code will not create a dictionary. Instead it creates a list of special “frozenset” objects (similar to lists).
There’s one set for each group, and the sets contain the names of the people in each group.
In order to add this information to your network in the now-familiar way, you must first create a dictionary that labels each person with a number value for the group to which they belong:
'''
modularity_dict = {} # Create a blank dictionary
for i,c in enumerate(communities): # Loop through the list of communities, keeping track of the number for the community
for name in c: # Loop through each person in a community
modularity_dict[name] = i # Create an entry in the dictionary for the person, where the value is which group they belong to.
# Now you can add modularity information like we did the other metrics
nx.set_node_attributes(G, modularity_dict, 'modularity')
# First get a list of just the nodes in that class
class0 = [n for n in G.nodes() if G.nodes[n]['modularity'] == 0]
# Then create a dictionary of the eigenvector centralities of those nodes
class0_eigenvector = {n:G.nodes[n]['eigenvector'] for n in class0}
# Then sort that dictionary and print the first 5 results
class0_sorted_by_eigenvector = sorted(class0_eigenvector.items(), key=itemgetter(1), reverse=True)
print("Modularity Class 0 Sorted by Eigenvector Centrality:")
for node in class0_sorted_by_eigenvector[:5]:
print("Name:", node[0], "| Eigenvector Centrality:", node[1])
```

- Using eigenvector centrality as a ranking can give you a sense of the important people within this modularity class.
- You’ll notice that some of these people, especially William Penn, William Bradford (not the Plymouth founder you’re thinking of), and James Logan, spent lots of time in America.
- Also, Bradford and Tace Sowle were both prominent Quaker printers. With just a little bit of digging, we can discover that there are both geographical and occupational reasons that this group of people belongs together.
- This is an indication that modularity is probably working as expected.
- In smaller networks like this one, a common task is to find and list all of the modularity classes and their members.
- You can do this by looping through the communities list:

```
for i,c in enumerate(communities): # Loop through the list of communities
if len(c) > 2: # Filter out modularity classes with 2 or fewer nodes
print('Class '+str(i)+':', list(c)) # Print out the classes and their members
```

### Exporting Data¶

gexf format is readable format for Gephi

```
nx.write_gexf(G, 'quaker_network.gexf')
```

```
```