Utility functions for visualizations

The usage of some of these methods requires installing the package with the extra requirements for text embedding and clustering

1pip install semanticlayertools[embeddml]

Representing temporal cluster evolution with a streamgraph

This utility function is meant to support the visualization of calculated temporal clusters. Parameters to vary are the smoothing (bool) and the minimal cluster size to consider (default=1000).

1streamgraph(file, smooth, minClusterSize)

Embedding a text corpus in 2 dimensions

Meant to be used to visualize a corpus on 2D by embedding a text column using the SentenceTransformer approach of SBERT and UMAP. Time consuming method!

1embeddedTextPlotting(infolderpath, columnName, outpath, umapNeighors)

Clustering texts using SentenceEmbedding

Similar to the above method but extended to help finding large scale structures of a given text corpus. Similar to topic modelling, in addition makes use of HDBSCAN clustering. Reuses previously generated embedding of corpus.

1embeddedTextClustering(
2    infolderpath, columnName, embeddingspath, outpath,
3    umapNeighors, umapComponents, hdbscanMinCluster
4)

See also

HDBSCAN docs

Generate citation and reference tree graph

Using the Dimensions AI dataset, this routine generates a structure starting from a source publications, that represents its references and their references as well as its citations and their citations. With this means, visualizations of it show academic roots and conduits and can display disciplinary pathways.

class semanticlayertools.visual.citationnet.GenerateTree(verbose: bool = False, api_key='')

Generate tree for citationent visualization.

For a given input document, its references and citations are evaluated. In a second step, citations of citations and references of references are extracted. This information is used to generate a tree like network for visualization.

_cleanTitleString(row)

Clean non-JSON characters from titles.

Removes newline characters, double backslashes and quoted ‘”’.

_editDF(inputdf, dftype='cite_l1', level2List=None)

Return reformated dataframe.

_formatFOR(row)

Format existing FOR codes.

Each publication has a total value of one. Only first level parts of codes are counted. If no FOR code exist, return ‘00:1’.

Example: “02, 0201, 0204, 06” yields “02:0.75;06:025”

_getMissing(idlist)

Get metadata for second level reference nodes.

generateNetworkFiles(outfolder)

Generates JSON with nodes and edges lists.

query(startDoi='', citationLimit=100)

Return all links as dataframe.

Plotting routines for 3D and stream- graphs

A 3d routine generates multiplex or multilayer network plots from sets of dataframes. Uses edge bundling for more clear visuals and allows manual setting of cluster colors.

Another routine creates 3D graphs for clustered centralities measures.

To compare found time cluster a third routine plots streamgraphs of the clustersizes across time.