Pipelines for workflows

Cocitation clustering pipeline

semanticlayertools.pipelines.cocitetimeclusters.run(inputFilepath: str, outputPath: str, resolution: float, intersliceCoupling: float, inputFileType: str = 'files', minClusterSize: int = 1000, timerange: tuple = (1945, 2005), timeWindow: int = 3, pubIDColumnName: str = 'nodeID', referenceColumnName: str = 'reference', yearColumnName: str = 'year', numberproc: int = 2, limitRefLength: bool = False, useGC: bool = True, skipCocite: bool = False, skipClustering: bool = False, skipReporting: bool = False, timeclusterfile: str = '', debug: bool = False)

Runs all steps of the temporal clustering pipepline.

Creates cocitation networks, finds temporal clusters, writes report files for large clusters.

Default time range is 1945 to 2005. Minimal size for considered clusters is 1000 nodes. Lists of references are assumed to be contained in column “reference”.

By default this routine takes all available cpu cores. Limit this to a lower value to allow parallel performance of other tasks.

Parameters:

inputFilepath (str) – Path to corpora input data
inputFileType (str) – Type of input data (files or dataframe, default: files)
cociteOutpath (str) – Output path for cocitation networks
timeclusterOutpath (str) – Output path for time clusters
reportsOutpath (str) – Output path for reports
resolution (float) – Main parameter for the clustering quality function (Constant Pots Model)
intersliceCoupling (float) – Coupling parameter between two year slices, also influences cluster detection
minClusterSize (int) – The minimal cluster size, above which clusters are considered (default=1000)
timerange (tuple) – Time range to evalute clusters for (usefull for limiting computation time, default = (1945, 2005))
timeWindow (int) – Time window to join publications into (default: 3)
pubIDColumnName (str) – Column name containing the IDs of publications
referenceColumnName (str) – Column name containing the references of a publication
yearColumnName (str) – Column name containing the publiction year in integer format, only used for inputtype dataframe
referenceColumnName – Column name containing the references of a publication
numberProc (int) – Number of CPUs the package is allowed to use (default=all)
limitRefLength (bool or int) – Either False or integer giving the maximum number of references a considered publication is allowed to contain

Wordscore-Multilayer pipeline

semanticlayertools.pipelines.wordscorenet.run(dataframe, tempFiles: str = True, outPath: str = './', windowsize: int = 3, textColumn: str = 'text', yearColumn: str = 'year', authorColumn: str = 'author', pubIDColumn: str = 'publicationID', ngramRange: tuple = (2, 5), tokenMinLength: int = 2, normalize: bool = True, scoreLimit: float = 0.1, numTrials: int = 5, flowModel: str = 'undirected', recreate: bool = True, skipClean: bool = False)

Run all steps for multilayer network generation using wordscoring.

Calculates word scoring for corpus documents, creates multilayer network by linking co-authors and authors, their publications and used ngrams and calculates clusters for each timeslice using the infomap algorithm.

By default, temmporal folders are used such that only the found clusters are returned.

For details of the ngram method refere to the module documentation.

Parameters:

dataframe (class:pandas.DataFrame) – The input corpus dataframe.
tempFiles (bool) – Use temporal files during the pipeline run.
outpath (str) – Path for writing resulting cluster data, or all temporary data
windowsize (int) – Length of year window in which text corpus is joint and network files are created
textColumn (str) – Column name to use for ngram calculation
authorColumn (str) – Column name to use for author names, assumes a string with coauthors joined by a semicolon (;)
pubIDColumn (str) – Column name to use for publication identification (assumend to be unique)
yearColumn (str) – Column name for temporal ordering publications, used during writing the scoring files
ngramRange (tuple) – Range of considered ngrams (default: (2,5), i.e. 2- to 5-grams)
tokenMinLength (int) – Minimal token, i.e. word, length to consider in analysis, default 2
normalize (bool) – Trigger normalization of ngram scores for each year slice. Default is True, the maximal score in year slice is then 1.0
scoreLimit (float) – Minimal weight in each slice corpus to consider an ngram score (default: 0.1)
numTrials (int) – Number of iterations of the infomap algorithm, default is 5
flowModel (str) – Flow model for the infomap algorithm, defaults to “undirected”
recreate (bool) – Set the recreate parameter for all parts of the pipeline, i.e. existing files are overwritten, defaults to True
skipClean (bool) – Skip the text cleaning part of the pipeline.