TweetToSparseFeatureVector

java.lang.Object
- weka.filters.Filter
- - weka.filters.SimpleFilter
  - - weka.filters.SimpleBatchFilter
    - - weka.filters.unsupervised.attribute.TweetToFeatureVector
      - weka.filters.unsupervised.attribute.TweetToSparseFeatureVector

All Implemented Interfaces:

java.io.Serializable, CapabilitiesHandler, CapabilitiesIgnorer, CommandlineRunnable, OptionHandler, RevisionHandler
```
public class TweetToSparseFeatureVector
extends TweetToFeatureVector
```
An attribute filter that calculates different types of sparse features for a tweet represented as a string attribute. The type of features include: word n-grams, character n-grams, POS tags and Brown word clusters. The size of the attribute space would depend on the training dataset. BibTeX:
```
 @Article{NRCJAIR14,
 Title                    = {Sentiment analysis of short informal texts},
 Author                   = {Kiritchenko, Svetlana and Zhu, Xiaodan and Mohammad, Saif M},
 Journal                  = {Journal of Artificial Intelligence Research},
 Year                     = {2014},
 Pages                    = {723--762},
 Volume                   = {50}
}
 
```
Version:

$Revision: 2 $

Author:

Felipe Bravo-Marquez (fbravoma@waikato.ac.nz)

See Also:

Serialized Form

Field Summary

Fields
Modifier and Type Field Description

static java.lang.String RESOURCES_FOLDER_NAME
Default path to where resources are stored.

Fields
Modifier and Type	Field	Description
`static java.lang.String`	`RESOURCES_FOLDER_NAME`	Default path to where resources are stored.

Constructor Summary

Constructors
Constructor Description

TweetToSparseFeatureVector()

Constructors
Constructor	Description
`TweetToSparseFeatureVector()`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`it.unimi.dsi.fastutil.objects.Object2IntMap<java.lang.String>`	`calculateDocVec(java.lang.String content)`	Calculates a vector of attributes from a String
`int`	`getCharNgramMaxDim()`
`int`	`getCharNgramMinDim()`
`int`	`getClustNgramMaxDim()`
`int`	`getMinAttDocs()`
`int`	`getPosNgramMaxDim()`
`java.util.List<java.lang.String>`	`getPOStags(java.util.List<java.lang.String> tokens)`	Returns POS tags from a List of tokens using the CMU TweetNLP tool
`java.io.File`	`getTaggerFile()`
`TechnicalInformation`	`getTechnicalInformation()`	Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.
`java.io.File`	`getWordClustFile()`
`int`	`getWordNgramMaxDim()`
`java.lang.String`	`globalInfo()`	Returns a string describing this filter.
`void`	`initializeTagger()`	Initializes the POS tagger
`void`	`initiliazeNegationEvaluator()`	Initializes the NegationEvaluator object
`boolean`	`isCalculateCharNgram()`
`boolean`	`isFreqWeights()`
`boolean`	`isNegateTokens()`
`static void`	`main(java.lang.String[] args)`	Main method for testing this class.
`void`	`setCalculateCharNgram(boolean calculateCharNgram)`
`void`	`setCharNgramMaxDim(int charNgramMaxDim)`
`void`	`setCharNgramMinDim(int charNgramMinDim)`
`void`	`setClustNgramMaxDim(int clustNgramMaxDim)`
`void`	`setFreqWeights(boolean freqWeights)`
`void`	`setMinAttDocs(int minAttDocs)`
`void`	`setNegateTokens(boolean negateTokens)`
`void`	`setPosNgramMaxDim(int posNgramMaxDim)`
`void`	`setTaggerFile(java.io.File taggerFile)`
`void`	`setWordClustFile(java.io.File wordClustFile)`
`void`	`setWordNgramMaxDim(int wordNgramMaxDim)`
`void`	`tweetsToVectors(Instances tweetInstances)`	Processes a batch of tweets.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

Methods inherited from class weka.filters.SimpleBatchFilter
batchFinished, input

Methods inherited from class weka.filters.SimpleFilter
setInputFormat

Methods inherited from class weka.filters.unsupervised.attribute.TweetToFeatureVector
allowAccessToFullInputFormat, getCapabilities, getOptions, getStemmer, getStopwordsHandler, getTextIndex, getTokenizer, isReduceRepeatedLetters, isStandarizeUrlsUsers, isToLowerCase, listOptions, setOptions, setReduceRepeatedLetters, setStandarizeUrlsUsers, setStemmer, setStopwordsHandler, setTextIndex, setTokenizer, setToLowerCase

Field Detail
- RESOURCES_FOLDER_NAME
```
public static java.lang.String RESOURCES_FOLDER_NAME
```
  Default path to where resources are stored.

Constructor Detail
- TweetToSparseFeatureVector
```
public TweetToSparseFeatureVector()
```

Method Detail

globalInfo
```
public java.lang.String globalInfo()
```
Returns a string describing this filter.

Specified by:

globalInfo in class SimpleFilter

Returns:

a description of the filter suitable for displaying in the explorer/experimenter gui

getTechnicalInformation
```
public TechnicalInformation getTechnicalInformation()
```
Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.

Returns:

the technical information about this class

initializeTagger
```
public void initializeTagger()
```
Initializes the POS tagger

initiliazeNegationEvaluator
```
public void initiliazeNegationEvaluator()
```
Initializes the NegationEvaluator object

getPOStags
```
public java.util.List<java.lang.String> getPOStags(java.util.List<java.lang.String> tokens)
```
Returns POS tags from a List of tokens using the CMU TweetNLP tool

Parameters:

tokens - the input tokens

Returns:

the list of POS tags

calculateDocVec
```
public it.unimi.dsi.fastutil.objects.Object2IntMap<java.lang.String> calculateDocVec(java.lang.String content)
```
Calculates a vector of attributes from a String

Parameters:

content - the input

Returns:

an Object2IntMap object mapping the attributes to their values

tweetsToVectors
```
public void tweetsToVectors(Instances tweetInstances)
```
Processes a batch of tweets.

Parameters:

tweetInstances - the input tweets

getMinAttDocs

@OptionMetadata(displayName="minAttDocs",
                description="Minimum frequency of a sparse attribute to be considered in the attribute space.",
                commandLineParamName="M",
                commandLineParamSynopsis="-M <int>",
                displayOrder=6)
public int getMinAttDocs()

setMinAttDocs

public void setMinAttDocs(int minAttDocs)

isFreqWeights

@OptionMetadata(displayName="freqWeights",
                description="True if the value of each feature is set to its frequency in the tweet. Boolean weights are used otherwise.\n",
                commandLineParamIsFlag=true,
                commandLineParamName="F",
                commandLineParamSynopsis="-F",
                displayOrder=7)
public boolean isFreqWeights()

setFreqWeights

public void setFreqWeights(boolean freqWeights)

getWordNgramMaxDim

@OptionMetadata(displayName="wordNgramMaxDim",
                description="Maximum size for the word n-gram features. \n\t Set this variable to zero for no word n-gram attributes. All word n-grams from i=1 to this value will be extracted.",
                commandLineParamName="Q",
                commandLineParamSynopsis="-Q <int>",
                displayOrder=8)
public int getWordNgramMaxDim()

setWordNgramMaxDim

public void setWordNgramMaxDim(int wordNgramMaxDim)

isNegateTokens

@OptionMetadata(displayName="negateTokens",
                description="Add a prefix to words occurring in negated contexts e.g., I don\'t like you => I don\'t NEG-like NEG-you.\n \t The prefixes only affect word n-gram features. The scope of negation finishes with the next punctuation mark.",
                commandLineParamIsFlag=true,
                commandLineParamName="R",
                commandLineParamSynopsis="-R",
                displayOrder=9)
public boolean isNegateTokens()

setNegateTokens

public void setNegateTokens(boolean negateTokens)

isCalculateCharNgram

@OptionMetadata(displayName="calculateCharNgram",
                description="Calculate character n-gram features.",
                commandLineParamIsFlag=true,
                commandLineParamName="A",
                commandLineParamSynopsis="-A",
                displayOrder=10)
public boolean isCalculateCharNgram()

setCalculateCharNgram

public void setCalculateCharNgram(boolean calculateCharNgram)

getCharNgramMinDim

@OptionMetadata(displayName="charNgramMinDim",
                description="The minimum dimension for character n-grams.",
                commandLineParamName="D",
                commandLineParamSynopsis="-D <int>",
                displayOrder=11)
public int getCharNgramMinDim()

setCharNgramMinDim

public void setCharNgramMinDim(int charNgramMinDim)

getCharNgramMaxDim

@OptionMetadata(displayName="charNgramMaxDim",
                description="The maximum dimension for character n-grams.",
                commandLineParamName="E",
                commandLineParamSynopsis="-E <int>",
                displayOrder=12)
public int getCharNgramMaxDim()

setCharNgramMaxDim

public void setCharNgramMaxDim(int charNgramMaxDim)

getPosNgramMaxDim

@OptionMetadata(displayName="posNgramMaxDim",
                description="The maximum size for POS n-grams. Set this variable to zero for no POS attributes. \n\t The tweets are POS-tagged using the CMU TweetNLP tool.",
                commandLineParamName="G",
                commandLineParamSynopsis="-G <int>",
                displayOrder=13)
public int getPosNgramMaxDim()

setPosNgramMaxDim

public void setPosNgramMaxDim(int posNgramMaxDim)

getClustNgramMaxDim

@OptionMetadata(displayName="clustNgramMaxDim",
                description="The maximum dimension for n-grams calculated with Brown word clusters.\n\t Set this variable to zero for no word-clusters attributes. \n\t The word clusters are taken from the CMU Tweet NLP tool.",
                commandLineParamName="I",
                commandLineParamSynopsis="-I <int>",
                displayOrder=14)
public int getClustNgramMaxDim()

setClustNgramMaxDim

public void setClustNgramMaxDim(int clustNgramMaxDim)

getTaggerFile

@OptionMetadata(displayName="taggerFile",
                description="The file with TweetNLP POS tagger model.",
                commandLineParamName="taggerFile",
                commandLineParamSynopsis="-taggerFile <string>",
                displayOrder=15)
public java.io.File getTaggerFile()

setTaggerFile

public void setTaggerFile(java.io.File taggerFile)

getWordClustFile

@OptionMetadata(displayName="wordClustFile",
                description="The file with the word clusters in gzip format.",
                commandLineParamName="wordClustFile",
                commandLineParamSynopsis="-wordClustFile <string>",
                displayOrder=16)
public java.io.File getWordClustFile()

setWordClustFile

public void setWordClustFile(java.io.File wordClustFile)

main
```
public static void main(java.lang.String[] args)
```
Main method for testing this class.

Parameters:

args - should contain arguments to the filter: use -h for help

Class TweetToSparseFeatureVector

Field Summary

Constructor Summary

Method Summary

Methods inherited from class weka.filters.Filter

Methods inherited from class java.lang.Object

Methods inherited from class weka.filters.SimpleBatchFilter

Methods inherited from class weka.filters.SimpleFilter

Methods inherited from class weka.filters.unsupervised.attribute.TweetToFeatureVector

Field Detail

RESOURCES_FOLDER_NAME

Constructor Detail

TweetToSparseFeatureVector

Method Detail

globalInfo

getTechnicalInformation

initializeTagger

initiliazeNegationEvaluator

getPOStags

calculateDocVec

tweetsToVectors

getMinAttDocs

setMinAttDocs

isFreqWeights

setFreqWeights

getWordNgramMaxDim

setWordNgramMaxDim

isNegateTokens

setNegateTokens

isCalculateCharNgram

setCalculateCharNgram

getCharNgramMinDim

setCharNgramMinDim

getCharNgramMaxDim

setCharNgramMaxDim

getPosNgramMaxDim

setPosNgramMaxDim

getClustNgramMaxDim

setClustNgramMaxDim

getTaggerFile

setTaggerFile

getWordClustFile

setWordClustFile

main