Clustering network data
Clustering with infomap algorithm
- class semanticlayertools.clustering.infomap.Clustering(inpath: str, outpath: str, recreate: bool = False, silent: bool = True, num_trials: int = 5, flow_model: str = 'undirected', debug: bool = False)
Cluster mulitlayer time-dependent networks using the infomap algorithm.
Calculates clusters using the infomap algorithm. Input files are assumed to have multilayer Pajek format and contain the year in four digit format. The default settings for running the method assume an undirected multilayer network and will use at most 5 optimization runs.
- Parameters:
inpath (str) – Path to input pajek files
outpath (str) – Path for writing resulting cluster data
recreate (bool) – Toggle recreation of already exisiting files
silent (bool) – Toogle verbose mode for infomap
num_trials (int) – Number of runs for the infomap routine
flow_model (str) – Model for flow, directed or undirected.
debug (bool) – Toggle writing of debug info to standard output.
See also
Martin Rosvall and Carl T. Bergstrom (2008). Maps of information flow reveal community structure in complex networks. PNAS, 105, 1118. 10.1073/pnas.0706851105
- calcInfomap(inFilePath, writeStates=False, depthLevel=1)
Calculate clusters for one pajek file.
Writes found cluster (i.e. module) information in CLU and FlowTree file format to output path.
- Parameters:
inFilePath (str) – Path to input pajek file
- Raises:
OSError – If one of the output files for this year already exists.
- Returns:
Writes two files with found cluster information, method return value is empty
- Return type:
None
See also
Infomap python documentation on mapequation Infomap module
- run(states=False, depth=1)
Calculate infomap clustering for all pajek files in input path.
Clustering using Leiden algorithm
- class semanticlayertools.clustering.leiden.TimeCluster(inpath: str, outpath: str, timerange: tuple = (1945, 2005), useGC: bool = True, debug: bool = False)
Cluster time-sliced data with the Leiden algorithm.
Calculates temporal clusters of e.g. time-sliced cocitation or citation data, using the Leiden algorithm . Two nodes are assumed to be identical in different year slices, if the node name is the same. This could be e.g. the bibcode or DOI.
Input files are assumed to include the year in the filename, have an ending _GC.net to denote their giant component character and should be in Pajek format. Alternatively, NCOL format is supported with the ending .ncol for general network data in column format.
The resolution parameter can be seen as a limiting density above which neighbouring nodes are considered a cluster. The interslice coupling describes the influcence of yearly order on the clustering process. See doc for the Leiden algorithm for more detailed info.
- Parameters:
inpath (str) – Path for input network data
outpath (str) – Path for writing output data
resolution (float) – Main parameter for the clustering quality function (Constant Pots Model)
intersliceCoupling (float) – Coupling parameter between two year slices, also influences cluster detection
timerange (tuple) – The time range for considering input data (default=1945,2005))
useGC (bool) – If True use giant component for input data (format Pajek), if False use full network data in NCOL format.
- Raises:
OSError – If the output file already exists at class instantiation
See also
Traag, V.A., Waltman. L., Van Eck, N.-J. (2018). From Louvain to Leiden: guaranteeing well-connected communities. Scientific reports, 9(1), 5233. 10.1038/s41598-019-41695-z
- optimize(clusterSizeCompare: int = 1000, resolution: float = 3e-06, intersliceCoupling: float = 0.3, maxComSize: bool = False)
Optimize clusters accross time slices.
This runs the actual clustering and can be very time and memory consuming for large networks. Depending on the obtained cluster results, this method has to be run iteratively with varying resolution parameter. Output is written to file, with filename containing chosen parameters.
The output CSV contains information on which node in which year belongs to which cluster. As a first measure of returned clustering, the method prints the number of clusters found above a threshold defined by clusterSizeCompare. This does not influence the output clustering.
- Parameters:
clusterSizeCompare (int) – Threshold for interesting clusters
- Returns:
Tuple of output file path and list of found clusters in tuple format (node, year, cluster)
- Return type:
tuple
See also
Documentation of time-layer creation routine: Leiden documentation
Generating reports for time clusters
- class semanticlayertools.clustering.reports.ClusterReports(infile: str, metadatapath: str, outpath: str, textcolumn: str = 'title', authorColumnName: str = 'author', affiliationColumnName: str = 'aff', publicationIDcolumn: str = 'nodeID', numberProc: int = 0, languageModel: str = 'en_core_web_lg', minClusterSize: int = 1000, timerange: tuple = (1945, 2005), rerun: bool = False, debug: bool = False)
Generate reporting on time-clusters.
Generate reports to describe the content for all found clusters above a minimal size by collecting metadata for all publications in each cluster, finding the top 20 authors and affiliations of authors involved in the cluster publications, and running basic NMF topic modelling with N=20 and N=50 topics (english language models are used!). For each cluster a report file is written to the output path.
Input CSV filename is used to create the output folder in output path. For each cluster above the limit, a subfolder is created to contain all metadata for the cluster. The metadata files are assumed to be in JSONL format and contain the year in the filename.
- Parameters:
infile (str) – Path to input CSV file containing information on nodeid, clusterid, and year
metadatapath (str) – Path to JSONL (JSON line) formated metadata files.
outpath (str) – Path to create output folder in, foldername reflects input filename
textcolumn (str) – The dataframe column of metadata containing textutal for topic modelling (default=title)
numberProc (int) – Number of CPU the routine will use (default = all!)
minClusterSize (int) – The minimal cluster size, above which clusters are considered (default=1000)
timerange (tuple) – Time range to evalute clusters for (usefull for limiting computation time, default = (1945, 2005))
- create_corpus(dataframe, cluster)
Create corpus out of dataframe.
Using the text contained in the cluster metadata to generate a corpus. After some basic preprocessing each text is used to generate a Spacy doc, of which only the lemmatized words without stop words are considered.
- Params dataframe:
Input dataframe
- Returns:
A textacy corpus file with english as the base language
- Return type:
textacy.Corpus
- find_topics(corpus_titles: list, n_topics: int, top_words: int)
Calculate topics in corpus.
Use NMF algorithm to calculate topics in corpus file for n_topics topics, returning top_words most common words for each topic. Each word has to occure at least twice in the corpus and at most in 95% of all documents.
- Parameters:
corpus_titles (textacy.Corpus) – The corpus containing the preprocessed texts.
n_topics (int) – Number of considered topics
top_words (int) – Number of returned words for each found topic
- Returns:
List of found topics with top occuring words
- Return type:
str
- fullReport(cluster)
Generate full cluster report for one cluster.
- Parameters:
cluster (int or str) – The cluster number to process
- Raises:
ValueError – If input cluster data can not be read.
- Returns:
Report text with all gathered informations
- Return type:
str
- gatherClusterMetadata()
Initial gathering of metadata for clusters.
For all files in the metadata path, call _mergeData if the found year in the filename falls in the bounds.
This step needs to be run once, then all cluster metadata is generated and can be reused.
- writeReports()
Generate reports and write to output path.