Text and data cleaning

semanticlayertools.cleaning.text.htmlTags(text)

Reformat html tags in text using replacement list..

Some specific html formating leads to confusion with sentence and token border detection. This method outputs the cleaned text using a replacement list.

Parameters:: text (str) – Input text

semanticlayertools.cleaning.text.lemmaSpacy(text)

Clean text using Spacy english language model.

A spacy doc is created using the text. For each token which is not a stopword and longer then 3 letters the lemma is returned in lowered form. For historical reasons, input can also be of the form text = list(“Actual text”), which sometimes results from data harvesting. In these cases only the first element is considered!

Parameters:: text (str) – Input text

semanticlayertools.cleaning.text.tokenize(text, languageModel=<spacy.lang.en.English object>, ngramRange=(1, 5), limitPOS=False, excludeStopWords=False, excludePunctuation=False, excludeNumerical=False, excludeNonAlphabetic=False, tokenMinLength=1, debug=False)

Tokenize the provided text using the specified Spacy language model.

Limit tokens to specific Parts-of-Speech by providing a list,e.g. limitPOS=[“NN”, “NNS”, “NNP”, “NNPS”, “JJ”, “JJR”, “JJS”]. Found ngrams are joined with the special character “#” (hash) which needs to be taken into account in later steps of processing pipelines.

Exclude stop words by setting excludeStopWords=True.

Parameters:

languageModel (class:spacy.nlp) – The Spacy language model used for tokenizing.
ngramRange (tuple) – Range of ngrams to be returned, default 1- to 5-gram.
limitPOS (bool list) – Limit returned tokens to specific Parts of Speech.
excludeStopWords (bool) – Exclude stop words from returned tokens.
tokenMinLength (int) – Set minimal length of returned token.