Text and data cleaning

semanticlayertools.cleaning.text.htmlTags(text)

Reformat html tags in text using replacement list..

Some specific html formating leads to confusion with sentence and token border detection. This method outputs the cleaned text using a replacement list.

Parameters:

text (str) – Input text

semanticlayertools.cleaning.text.lemmaSpacy(text)

Clean text using Spacy english language model.

A spacy doc is created using the text. For each token which is not a stopword and longer then 3 letters the lemma is returned in lowered form. For historical reasons, input can also be of the form text = list(“Actual text”), which sometimes results from data harvesting. In these cases only the first element is considered!

Parameters:

text (str) – Input text

semanticlayertools.cleaning.text.tokenize(text, languageModel=<spacy.lang.en.English object>, ngramRange=(1, 5), limitPOS=False, excludeStopWords=False, excludePunctuation=False, excludeNumerical=False, excludeNonAlphabetic=False, tokenMinLength=1, debug=False)

Tokenize the provided text using the specified Spacy language model.

Limit tokens to specific Parts-of-Speech by providing a list,e.g. limitPOS=[“NN”, “NNS”, “NNP”, “NNPS”, “JJ”, “JJR”, “JJS”]. Found ngrams are joined with the special character “#” (hash) which needs to be taken into account in later steps of processing pipelines.

Exclude stop words by setting excludeStopWords=True.

Parameters:
  • languageModel (class:spacy.nlp) – The Spacy language model used for tokenizing.

  • ngramRange (tuple) – Range of ngrams to be returned, default 1- to 5-gram.

  • limitPOS (bool list) – Limit returned tokens to specific Parts of Speech.

  • excludeStopWords (bool) – Exclude stop words from returned tokens.

  • tokenMinLength (int) – Set minimal length of returned token.