Pipelines for workflows

Cocitation clustering pipeline

semanticlayertools.pipelines.cocitetimeclusters.run(inputFilepath: str, outputPath: str, resolution: float, intersliceCoupling: float, inputFileType: str = 'files', minClusterSize: int = 1000, timerange: tuple = (1945, 2005), timeWindow: int = 3, pubIDColumnName: str = 'nodeID', referenceColumnName: str = 'reference', yearColumnName: str = 'year', numberproc: int = 2, limitRefLength: bool = False, useGC: bool = True, skipCocite: bool = False, skipClustering: bool = False, skipReporting: bool = False, timeclusterfile: str = '', debug: bool = False)

Runs all steps of the temporal clustering pipepline.

Creates cocitation networks, finds temporal clusters, writes report files for large clusters.

Default time range is 1945 to 2005. Minimal size for considered clusters is 1000 nodes. Lists of references are assumed to be contained in column “reference”.

By default this routine takes all available cpu cores. Limit this to a lower value to allow parallel performance of other tasks.

Parameters:
  • inputFilepath (str) – Path to corpora input data

  • inputFileType (str) – Type of input data (files or dataframe, default: files)

  • cociteOutpath (str) – Output path for cocitation networks

  • timeclusterOutpath (str) – Output path for time clusters

  • reportsOutpath (str) – Output path for reports

  • resolution (float) – Main parameter for the clustering quality function (Constant Pots Model)

  • intersliceCoupling (float) – Coupling parameter between two year slices, also influences cluster detection

  • minClusterSize (int) – The minimal cluster size, above which clusters are considered (default=1000)

  • timerange (tuple) – Time range to evalute clusters for (usefull for limiting computation time, default = (1945, 2005))

  • timeWindow (int) – Time window to join publications into (default: 3)

  • pubIDColumnName (str) – Column name containing the IDs of publications

  • referenceColumnName (str) – Column name containing the references of a publication

  • yearColumnName (str) – Column name containing the publiction year in integer format, only used for inputtype dataframe

  • referenceColumnName – Column name containing the references of a publication

  • numberProc (int) – Number of CPUs the package is allowed to use (default=all)

  • limitRefLength (bool or int) – Either False or integer giving the maximum number of references a considered publication is allowed to contain

Wordscore-Multilayer pipeline

semanticlayertools.pipelines.wordscorenet.run(dataframe, tempFiles: str = True, outPath: str = './', windowsize: int = 3, textColumn: str = 'text', yearColumn: str = 'year', authorColumn: str = 'author', pubIDColumn: str = 'publicationID', ngramRange: tuple = (2, 5), tokenMinLength: int = 2, normalize: bool = True, scoreLimit: float = 0.1, numTrials: int = 5, flowModel: str = 'undirected', recreate: bool = True, skipClean: bool = False)

Run all steps for multilayer network generation using wordscoring.

Calculates word scoring for corpus documents, creates multilayer network by linking co-authors and authors, their publications and used ngrams and calculates clusters for each timeslice using the infomap algorithm.

By default, temmporal folders are used such that only the found clusters are returned.

For details of the ngram method refere to the module documentation.

Parameters:
  • dataframe (class:pandas.DataFrame) – The input corpus dataframe.

  • tempFiles (bool) – Use temporal files during the pipeline run.

  • outpath (str) – Path for writing resulting cluster data, or all temporary data

  • windowsize (int) – Length of year window in which text corpus is joint and network files are created

  • textColumn (str) – Column name to use for ngram calculation

  • authorColumn (str) – Column name to use for author names, assumes a string with coauthors joined by a semicolon (;)

  • pubIDColumn (str) – Column name to use for publication identification (assumend to be unique)

  • yearColumn (str) – Column name for temporal ordering publications, used during writing the scoring files

  • ngramRange (tuple) – Range of considered ngrams (default: (2,5), i.e. 2- to 5-grams)

  • tokenMinLength (int) – Minimal token, i.e. word, length to consider in analysis, default 2

  • normalize (bool) – Trigger normalization of ngram scores for each year slice. Default is True, the maximal score in year slice is then 1.0

  • scoreLimit (float) – Minimal weight in each slice corpus to consider an ngram score (default: 0.1)

  • numTrials (int) – Number of iterations of the infomap algorithm, default is 5

  • flowModel (str) – Flow model for the infomap algorithm, defaults to “undirected”

  • recreate (bool) – Set the recreate parameter for all parts of the pipeline, i.e. existing files are overwritten, defaults to True

  • skipClean (bool) – Skip the text cleaning part of the pipeline.