Word scoring and linkage

Link papers by Kullback-Leibler divergence

Calculate various corpus linguistic measures.

class semanticlayertools.linkage.worddistributions.CalculateKDL(targetData: DataFrame, compareData: DataFrame, yearColumnTarget: str = 'year', yearColumnCompare: str = 'year', tokenColumnTarget: str = 'tokens', tokenColumnCompare: str = 'tokens', *, debug: bool = False)

Calculates KDL scores for time slices.

See also

Stefania Degaetano-Ortlieb and Elke Teich. 2017. Modeling intra-textual variation with entropy and surprisal: topical vs. stylistic patterns. In Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 68-77, Vancouver, Canada. Association for Computational Linguistics.

getKDLRelations(windowSize: int = 3, minNgramNr: int = 5, specialChar: str = '#') list

Calculate KDL relations.

Parameters:

specialChar (str) – Special character used to delimit tokens in ngrams (default=#)

getNgramPatterns(windowSize: int = 3, specialChar: str = '#') None

Create dictionaries of occuring ngrams.

Parameters:

specialChar (str) – Special character used to delimit tokens in ngrams (default=#)

class semanticlayertools.linkage.worddistributions.UnigramKLD(data: DataFrame, targetName: str, lambdaParam: float = 0.995, yearCol: str = 'Year', authorCol: str = 'Author', tokenCol: str = 'tokens', docIDCol: str = 'bibcode', windowSize: int = 1, epsilon: float = 1e-10)

Calculate unigram KLD.

calculateJMS(term: str, targetUM: dict, fullUM: dict) float

Jelinek-Mercer smoothening with a lower bound.

calculateKLD(languageModelType: str = 'unigram', timeOrder: str = 'synchron') tuple

Calculate synchronous or asynchronous comparision with a lower bound.

perform_stat_test(test_type: str = 'welch', languageModelType: str = 'unigram', timeOrder: str = 'synchron') tuple

Calculate significance using specified statistical test.

Calculate trajectories of embeddings

Calculate change of publication densities.

class semanticlayertools.linkage.densities.EmbeddingDensities(file_path: Path, filters: dict, embedding_cols: list, *, year_col: str = 'Year', text_col: str = 'Title')

Compare embedding densities for filters.

For a given set of embedded text data, the filter selectes publications to compare to the overall density in the embedding space of all other publications. This allows to trace the evolution of groups of publications in relation to the popularity of the local topic space.

Assumes the file format Parquet. This allows filtering in very large text databases. Filters have the format of a dictionary, where keys are the column titles to filter, and values the strings or values that should be contained in the column. Embedding_cols denotes the column titles for the 2D embedding space in list format, e.g. [x,y].

compute_densities(plot_title: str, *, show_legend: bool) Figure

Compute densities and generate traces to plot.

create_figure(*, save_fig: bool = False, plot_title: str = 'Density change over time', show_legend: bool = True) Figure

Run all routines and create the actual plotly figure.

The plot title can be adjusted as well as whether a legend should be shown. If save_fig is set to a path, the figure is exported into a HTML file and not ploted.

shorten_title(title: str, max_length: int = 30) str

Shorten the title to a maximum length with ‘…’ at the end if needed.

Link papers by Ngram scoring

class semanticlayertools.linkage.wordscore.CalculateScores(sourceDataframe, pubIDColumn: str = 'pubID', yearColumn: str = 'year', tokenColumn: str = 'tokens', debug: bool = False)

Calculates ngram scores for documents.

All texts of the corpus are tokenized and POS tags are generated. A global dictionary of counts of different ngrams is build in counts. The ngram relations of every text are listed in outputDict.

Scoring is based on counts of occurances of different words left and right of each single token in each ngram, weighted by ngram size, for details see reference. #FIXME

Parameters:
  • sourceDataframe (class:pandas.DataFrame) – Dataframe containing the basic corpus

  • pubIDColumn (str) – Column name to use for publication identification (assumend to be unique)

  • yearColumn (str) – Column name for temporal ordering publications, used during writing the scoring files

  • ngramsize (int) – Maximum of considered ngrams (default: 5-gram)

See also

Abe H., Tsumoto S. (2011). Evaluating a Temporal Pattern Detection Method for Finding Research Keys in Bibliographical Data. In: Peters J.F. et al. (eds) Transactions on Rough Sets XIV. Lecture Notes in Computer Science, vol 6600. Springer, Berlin, Heidelberg. 10.1007/978-3-642-21563-6_1

getScore(target, specialChar='#')

Calculate ngram score.

getTermPatterns(year, dataframe, specialChar='#')

Create dictionaries of occuring ngrams.

getTfiDF(year)
run(windowsize: int = 3, write: bool = False, outpath: str = './', recreate: bool = False, tokenMinCount: int = 5, limitCPUs: bool = True)

Get score for all documents.

class semanticlayertools.linkage.wordscore.CalculateSurprise(sourceDataframe, pubIDColumn: str = 'pubID', yearColumn: str = 'year', tokenColumn: str = 'tokens', debug: bool = False)

Calculates surprise scores for documents.

The source dataframe is expected to contain pre-calculated ngrams (tokens) for each document in the form of lists of 1-grams, joined by a special character (default is “#” (hash)). For surprise calculation of e.g. 1- and 2-grams the precalculated n-grams need to contain at least 5-grams, to evaluate the surprise context of 1- and 2-grams, see reference for details. The main routine of this class is run().

Parameters:
  • sourceDataframe (class:pandas.DataFrame) – Dataframe containing the basic corpus

  • pubIDColumn (str) – Column name to use for publication identification (assumend to be unique)

  • yearColumn (str) – Column name for temporal ordering publications, used during writing the scoring files

  • tokenColumn (str) – Column name for tokens

See also

Stefania Degaetano-Ortlieb and Elke Teich. 2017. Modeling intra-textual variation with entropy and surprisal: topical vs. stylistic patterns. In Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 68–77, Vancouver, Canada. Association for Computational Linguistics.

getNgramPatterns(year, dataframe, ngramLimit=2, specialChar='#')

Create dictionaries of occuring ngrams.

Parameters:
  • year (int) – Current year for calculations

  • dataframe (class:pandas.DataFrame) – Current slice of main dataframe

  • ngramLimit (int) – Maximal ngram to consider for surprise calculation (default=2)

  • specialChar (str) – Special character used to delimit tokens in ngrams (default=#)

getSurprise(target: list, ngramNr: int = 1, specialChar: str = '#')

Calculate surprise score.

The surprise for a 1- or 2-gram is calculated based on the group of 4- or 5-grams which contain the target in the last or two last positions (e.g. experiment and the#last#experiment), and defined as the sum over the base two logarithm of the probabilities for 4- or 5-grams. The probability, e.g for a given 4-gram, is the number of realizations of that 4-gram devided by the number of possible 4-grams.

Parameters:
  • target (list) – Target list of tuples to use for surprise calculation.

  • ngramNr (int) – ngram length to use for calculation (1 or 2)

  • minNgramNr (int) – Minimal number of occurance of a 1- or 2-gram in the corpus to consider calculations (default=5)

  • specialChar (str) – Special character used to delimit tokens in ngrams (default=#)

getTfiDF(year)

Calculate augmented term-frequency inverse document frequency.

run(windowsize: int = 3, write: bool = False, outpath: str = './', recreate: bool = False, maxNgram: int = 2, minNgramNr: int = 5, limitCPUs: bool = True)

Calculate surprise for all documents.

Base corpus is sliced with a rolling window (see windowsize). For each slice the ngram distributions are created and surpise and tfidf scores calculated. The results are returned or saved.

Parameters:

minNgramNr (int) – Minimal number of occurance of a 1- or 2-gram in the corpus to consider calculations (default=5)

class semanticlayertools.linkage.wordscore.LinksOverTime(dataframe: DataFrame, authorColumn: str = 'authors', pubIDColumn: str = 'pubID', yearColumn: str = 'year', debug: bool = False)

Create multilayer pajek files for corpus.

To keep track of nodes over time, we need a global register of node names. This class takes care of this, by adding new keys of authors, papers or ngrams to the register. Central routine is “writeLinks”.

Parameters:
  • dataframe (class:pandas.DataFrame) – Source dataframe containing metadata of texts (authors, publicationID and year)

  • authorColumn (str) – Column name for author information, author names are assumed to be separated by semikolon

  • pubIDColumn (str) – Column name to identify publications

  • yearColumn (str) – Column name with year information (year encoded as integer)

createNodeRegister(scorePath: str, scoreLimit: float, scoreType: str = 'score')

Create multilayer node register for all time slices.

run(windowsize: int = 3, normalize: bool = True, scoreType: str = 'score', coauthorValue: float = 0.0, authorValue: float = 0.0, recreate: bool = False, scorePath: str = './', outPath: str = './', scoreLimit: float = 0.1)

Create data for all slices.

The slice window size needs to correspondent to the one used for calculating the scores to be consistent.

Choose normalize=True (default) to normalize ngram weights. In this case the maximal score for each time slice is 1.0. Choose the score limit accordingly.

Write multilayer links to file in Pajek format.

For ngrams with score above the limit, the corresponding tfidf value is extracted. If no preset value is given, links between coauthors and authors and publications are set to the median of the score values of the time slice. The created graphs are saved as pajek files, containing the information on node names and layers (1: authors, 2: publications, 3: ngrams).

Parameters:
  • sl (list) – Year slice of calculation

  • scorePath (str) – Path to score files.

  • scoreLimit (float) – Lower limit of scores to consider for network creation

  • normalize (bool) – Normalize the scores (True/False)

  • coauthorValue – Set manual value for coauthor weight (default: Median of score weight)

  • authorValue (float) – Set manual value for author to publication weight (default: Median of score weight)

  • outPath (str) – Path to write multilayer pajek files (default = ‘./’)

  • recreate (bool) – Rewrite existing files (default = False)

Generate network of citations

Calculate different bibliometric measures.

class semanticlayertools.linkage.citation.Couplings(inpath: inPath, outpath: str, inputType: str = 'files', pubIDColumn: str = 'nodeID', referenceColumn: str = 'reference', dateColumn: str = 'year', timerange: [<class 'int'>, <class 'int'>]=(1945, 2005), timeWindow: int = 3, numberProc: int = 2, *, limitRefLength: limitRefLength = False, debug: bool = False)

Calculate different coupling networks based on citation data.

Expected input format in the INPATH are JSONL files with names containing the year data. The files itself should contain ids for each publication (PUBIDCOLUMN) and information on its references (REFERENCESCOLUMN). For the years in the TIMERANGE (default 1945-2005) files within TIMEWINDOW (default 3 years) are joined together.

Parameters:
  • inpath (str) – Path to input JSONL files

  • inputType (str) – Type of input data, files or dataframe (default: files)

  • outpath (str) – Path to write output to

  • pubIDColumn (str) – Column for unique publication IDs (default: nodeID)

  • referencesColumn (str) – Column for references data (default: reference)

  • timerange (tuple) – Time range for analysis, tuple of integers (default (1945, 2005))

  • dateColumn (str) – Column for year data in case of inputType = dataframes (default: year)

  • timeWindow (int) – Rolling window in years (default: 3)

  • numberProc (int) – Number of CPU processes for parallelization (default: all)

  • limitRefLength (bool) – Limit the maximal length of reference list (default: False)

  • debug (bool) – Switch on additional debug messages (default: False)

See also

Rajmund Kleminski, Przemysiaw Kazienko, and Tomasz Kajdanowicz (2020) Analysis of direct citation, co-citation and bibliographic coupling in scientific topic identification J of Information Science, 48, 3. 10.1177/0165551520962775

getBibliometricCoupling() None

Calculate bibliometric coupling.

For all publication in each time slice, combinations of two publications are created. For each combination the overlap between the references is determined. If the overlap is larger then 1, an edge between the two publications is generated. All edges are saved in NCOL format. The list of all edges is read-in as a graph, and the giant component is saved in Pajek format.

Due to the nature of the combinatorics, this routine can be time-intensive. Switch on debugging messages to get a rough estimate of the runtime in hours.

getCitationCoupling() None

Calculate direct citation coupling.

For each time slice, direct citation links are created if a publication of a specific time slice is cited in the same time slice by another publication. The edge has a weight of one. Giant component behaviour is highly unlikely, therefore only information about the components is written to the output path. The full network is saved in NCOL format.

getCocitationCoupling() None

Calculate cocitation coupling.

Creates three files: Metadata-File with all components information, Giant component network data in pajek format and full graph data in edgelist format.

The input dataframe is split in chunks depending on the available cpu processes. All possible combinations for all elements of the reference column are calculated. The resulting values are counted to define the weight of two papers being cocited in the source dataframe.

Returns:

A tuple of GC information: Number of nodes and percentage of total, Number of edges and percentage of total

Return type:

tuple