Utility functions for visualizations

The usage of some of these methods requires installing the package with the extra requirements for text embedding and clustering

1pip install semanticlayertools[ml]
2pip install semanticlayertools[ml]

Plotting routines for 3D and stream- graphs

A 3d routine generates multiplex or multilayer network plots from sets of dataframes. Uses edge bundling for more clear visuals and allows manual setting of cluster colors.

Another routine creates 3D graphs for clustered centralities measures.

To compare found time cluster a third routine plots streamgraphs of the clustersizes across time.

class semanticlayertools.visual.plotting.Multilayer3D(dataframes, graphLabels, comColors=((0.12156862745098039, 0.4666666666666667, 0.7058823529411765), (1.0, 0.4980392156862745, 0.054901960784313725), (0.17254901960784313, 0.6274509803921569, 0.17254901960784313), (0.8392156862745098, 0.15294117647058825, 0.1568627450980392), (0.5803921568627451, 0.403921568627451, 0.7411764705882353), (0.5490196078431373, 0.33725490196078434, 0.29411764705882354), (0.8901960784313725, 0.4666666666666667, 0.7607843137254902), (0.4980392156862745, 0.4980392156862745, 0.4980392156862745), (0.7372549019607844, 0.7411764705882353, 0.13333333333333333), (0.09019607843137255, 0.7450980392156863, 0.8117647058823529)))

Plot multiplex network.

This solution is based on this StackOverflow answer: https://stackoverflow.com/questions/60392940/multi-layer-graph-in-networkx/60416989

createComposedGraphData()

Create graph data.

In each layer graph, edges are bundeled using hammer bundling. The resulting edge paths are exported as segments with additional third dimension.

A Louvain clustering on the composed graph gives communities of nodes. Each community is assigned a color.

draw(textposition=(0.1, 1.1), labelPrefix='Layer ', ax=False)

Draw figure with layer labels.

draw_edges(edges, *args, **kwargs)

Routine for edge drawing.

draw_edges_from_path(edgesData, *args, **kwargs)

Create line collections from bundled edges.

Since Line3dCollection expects a list of segments, this routine reformats the bundled data.

draw_nodes(nodes, *args, **kwargs)

Routine for node drawing.

draw_plane(z, *args, **kwargs)

Create layer plane.

get_edges_between_layers()

Determine edges between layers.

Nodes in subsequent layers are thought to be connected if they have the same ID.

get_edges_within_layers()

Remap edges in the individual layers to the internal representations of the node IDs.

get_extent(pad=0.1)

Calculate measures of plot.

get_node_positions(*args, **kwargs)

Get the node positions in the layered layout.

get_nodes()

Construct an internal representation of nodes with the format (node ID, layer).

prepareNetwork()

Prepare all necessary data.

semanticlayertools.visual.plotting.gaussian_smooth(x, y, grid, sd)
semanticlayertools.visual.plotting.generateLogPlot(outpath, centrality, clusterNr, centralityDF, clusterDF, additionalDF=False, save=False, upperLimitBin=1.0, lowerLimitBin=1e-07)

Generate 3D plot of chosen centrality with focus on specific clusters.

Optionally create additional highlights for specific institutions or actors in cluster by providing extra additional centralities for these.

semanticlayertools.visual.plotting.log_tick_formatter(val, pos=None)
semanticlayertools.visual.plotting.streamgraph(filepath: str, smooth: smoothing = False, excludeCluster: excludeClu = False, minClusterSize: int = 1000, showNthGrid: int = 5)

Plot streamgraph of cluster sizes vs years.

Input is the timeclusters CSV output of the clustering.leiden.TimeClusters class. To exclude specific clusters from the plotting, add an exclusion list. Only clusters with sizes larger then minClusterSize are considered for plotting.

For a cleaner visualization, consider the smoothing parameter, which uses a Gaussian smoothing.

Based on https://www.python-graph-gallery.com/streamchart-basic-matplotlib

Embedding routines for text

A BerTopic based routine to first generate embeddings, then topics, find descriptions of these topics using large-language models and then create an interactive visualization to assist researchers to find structures in large corpora of mixed content.

Create and use text embeddings.

class semanticlayertools.visual.embedding.TextEmbedder(inputDataframe: DataFrame, outputBasepath: Path, *, textColumnName: str, titleColumnName: str, corpusLanguage: str = 'German', topicsLanguage: str = 'German', subsample: bool = False, prompt: str = '', modelName: str = 'meta-llama/Meta-Llama-3-8B-Instruct', modelDir: str = '~/.cache/huggingface/hub/', device: str = 'cuda', embeddingName: str = 'BAAI/bge-m3', umapNeighbors: int = 15, umapComponents: int = 5, hdbMinClusterSize: int = 50, bertopicNrDocs: int = 10, bertopicNrWords: int = 10)

A text embedder creating visualizations.

Creates a base embedding, then highlights documents and clusters for each concept.

Returns output path and reduced 2D embedding.

_contains_keyword(row: int, keyword: str) bool

Return True if Keyword in text column or title.

run(concepts: tuple = (), dateColumnName: str = 'None') str

Run embedding, clustering and visualization.

class semanticlayertools.visual.embedding.TopicExplorerMap(inputfolder: Path, sourceDocDF: DataFrame, outputPath: Path, plotTitle: str, displayColumn: str, searchTextColumn: str, *, useZotero: bool = False, documentIDColumn: str = 'None', zoteroGroupe: str = 'None', zoteroGroupeID: str = 'None')

Generate explorable map of topics.

Used after generating data with the TopicEmbedder. Uses datamapplot to generate a map of the embedding space with LLM-described topic names. Searchable and zoomable. Can be connected to Zotero to display documents metadata.

semanticlayertools.visual.embedding.explainTopic(inputDataframe: DataFrame, textColumnName: str, *, corpusLanguage: str = 'English', topicsLanguage: str = 'English', prompt: str = '', modelName: str = 'meta-llama/Meta-Llama-3-8B-Instruct', modelDir: str = '~/.cache/huggingface/hub/', device: str = 'cuda', bertopicNrDocs: int = 10, topicsNr: int = 15) None

Find suitable labels for collections of texts and words.

Use a LLM to find the labels for a given corpus. Returns text explaining each topic.

Embedding a text corpus in 2 dimensions

Meant to be used to visualize a corpus on 2D by embedding a text column using the SentenceTransformer approach of SBERT and UMAP. Time consuming method!

1embeddedTextPlotting(infolderpath, columnName, outpath, umapNeighors)

Clustering texts using SentenceEmbedding

Similar to the above method but extended to help finding large scale structures of a given text corpus. Similar to topic modelling, in addition makes use of HDBSCAN clustering. Reuses previously generated embedding of corpus.

1embeddedTextClustering(
2    infolderpath, columnName, embeddingspath, outpath,
3    umapNeighors, umapComponents, hdbscanMinCluster
4)

See also

HDBSCAN docs