public class Utils
extends java.lang.Object
Constructor | Description |
---|---|
Utils() |
Modifier and Type | Method | Description |
---|---|---|
static it.unimi.dsi.fastutil.objects.Object2IntMap<java.lang.String> |
calculateTermFreq(java.util.List<java.lang.String> tokens,
java.lang.String prefix,
boolean freqWeights) |
Calculates a vector of attributes from a list of tokens
|
static java.util.List<java.lang.String> |
calculateTokenNgram(java.util.List<java.lang.String> tokens,
int n) |
Calculates token n-grams from a sequence of tokens.
|
static java.util.List<java.lang.String> |
clustList(java.util.List<java.lang.String> tokens,
java.util.Map<java.lang.String,java.lang.String> dict) |
Calculates a sequence of word-clusters from a list of tokens and a dictionary.
|
static java.util.List<java.lang.String> |
extractCharNgram(java.lang.String content,
int n) |
Calculates character n-grams from a String.
|
static java.util.List<java.lang.String> |
negateTokens(java.util.List<java.lang.String> tokens,
java.util.Set<java.lang.String> set) |
Adds a negation prefix to the tokens that follow a negation word until
the next punctuation mark.
|
static java.util.List<java.lang.String> |
tokenize(java.lang.String content,
boolean toLowerCase,
boolean standarizeUrlsUsers,
boolean reduceRepeatedLetters,
Tokenizer tokenizer,
Stemmer stemmer,
StopwordsHandler stop) |
Tokenizes a String
|
public static java.util.List<java.lang.String> negateTokens(java.util.List<java.lang.String> tokens, java.util.Set<java.lang.String> set)
tokens
- the list of tokens to negateset
- the set with the negated words to usepublic static java.util.List<java.lang.String> clustList(java.util.List<java.lang.String> tokens, java.util.Map<java.lang.String,java.lang.String> dict)
tokens
- the input tokensdict
- the dictionary with the word clusterspublic static it.unimi.dsi.fastutil.objects.Object2IntMap<java.lang.String> calculateTermFreq(java.util.List<java.lang.String> tokens, java.lang.String prefix, boolean freqWeights)
tokens
- the input tokensprefix
- the prefix of each vector attributefreqWeights
- true for considering term-frequency weights (booleans weights are used otherwise)public static java.util.List<java.lang.String> calculateTokenNgram(java.util.List<java.lang.String> tokens, int n)
tokens
- the input tokens from which the word n-grams will be calculatedn
- the size of the word n-grampublic static java.util.List<java.lang.String> extractCharNgram(java.lang.String content, int n)
content
- the input Stringn
- the size of the character n-grampublic static java.util.List<java.lang.String> tokenize(java.lang.String content, boolean toLowerCase, boolean standarizeUrlsUsers, boolean reduceRepeatedLetters, Tokenizer tokenizer, Stemmer stemmer, StopwordsHandler stop)
content
- the contenttoLowerCase
- true for lowercasing the contentstandarizeUrlsUsers
- true for standarizing urls and usersreduceRepeatedLetters
- true for reduing repeated letterstokenizer
- the tokenizerstemmer
- the stemmerstop
- the stopwords handler