Grouping tweets into categories and filtering before applying Sentiment Analysis functions
Note: this is a question/idea b/c I may be incorrect in thinking that Pulse does not address the issues I'm outlining below. Perhaps Pulse can address this to some extent with a little ingenuity on my part.
Vertica's new innovation Pulse has been very cool in that we can analyze the sentiment of tweets. However, I've been running into two separate (but related) issues/needs that I don't believe Pulse addresses (please correct me if I'm wrong).
1. Grouping tweets: Say I've filtered my incoming tweets down to the tweets I want. But now I want to group these tweets based on different "types". For example, these "types" could be the following - a) expression of feeling or sentiment; b) funny; c) initiation of a long/short discussion; d) user sharing an experience; e) hoping for something; etc..
A use case for grouping tweets is say I own a chain of restaurants in a given city. Say I want to analyze tweets for my company as well as many companies like my company throughout the world so that I can see in real time what they are doing, how customers are reacting, etc. This would allow me to know what mistakes my competitors have made (so I don't make them), understand my competitors better and much more useful information. Currently I'd have to depend on keywords such as "small restaurant chain", etc. But most of the time this kind of key word would not be included in a tweet. So I'd have to hardcode the whitelist to include my company's name, competitor company names, and any other similar companies. But I'd have to know about all of these companies to do this. I'd rather be able to invoke some kind of function or AI machine to determine what "kind" tweets I'm interested in and then further categorize them so that I'm not mixing up McDonalds with some small Cupcake shop (for example) - neither of which I'd be interested in b/c I'm more similar to a steakhouse (again for example). And then even further, group these into similar ways as I proposed above so I better understand the reason for results. Anyhow, my point addresses the ability to group tweets based on different attributes for different cases.
2. Sophisticated filtering: Assume for this that I've utilized all (or as many as I want) of Pulse's tuning techniques (i.e. updating Pulse dictionaries to fit my needs, utilizing mappings, whitelistonly, etc.). Now after tuning to the best of my ability, say I still have random tweets that are completely useless to my analysis and is making the outcome biased in some direction. Now, maybe there aren't many of these "useless" tweets, but I'd rather not have to go look for them and delete them from my analysis. I realize that there always will be "noise" associated with an analysis of this type (the type being social media data). However, perhaps Pulse has a function to account for possible noise (which most likely will always be there). If not, I guess I'm just throwing out a thought.
An example of what I'm talking about is say I want to determine the way people feel about a certain city's airport for a particular day. Well, after viewing some of these tweets some people actually tag the airport in a tweet; others don't. So sometimes you have to use different attributes like place attributes, or whatever. Or it may involve having to broaden the whitelist (which in doing so you probably will get tweets about every city's airport). Or do both but then you might miss out on some critical tweets about the airport of interest; however that tweeter is currently in a different country. Now all of this comes down to the fact that we will never account for all the tweets that are important to answer our question; and/or we'll get too many tweets which adds bias to the answer.
So both 1. and 2. are basically linked - if 1. can be solved then 2. can be solved (note that I'm more interested in solving 1. b/c Pulse does have many tuning techniques that can be addressed to almost solve 2.
Anyhow, I didn't provide examples here due to time (most of my examples that I recall I ran into a couple weeks ago so I'd have to go looking for them). But just thought I'd post my thoughts. Feedback and possibly suggestions are useful. I am already onto the idea of machine learning to tackle this problem of grouping, however, it is somewhat time-consuming.
Vertica's new innovation Pulse has been very cool in that we can analyze the sentiment of tweets. However, I've been running into two separate (but related) issues/needs that I don't believe Pulse addresses (please correct me if I'm wrong).
1. Grouping tweets: Say I've filtered my incoming tweets down to the tweets I want. But now I want to group these tweets based on different "types". For example, these "types" could be the following - a) expression of feeling or sentiment; b) funny; c) initiation of a long/short discussion; d) user sharing an experience; e) hoping for something; etc..
A use case for grouping tweets is say I own a chain of restaurants in a given city. Say I want to analyze tweets for my company as well as many companies like my company throughout the world so that I can see in real time what they are doing, how customers are reacting, etc. This would allow me to know what mistakes my competitors have made (so I don't make them), understand my competitors better and much more useful information. Currently I'd have to depend on keywords such as "small restaurant chain", etc. But most of the time this kind of key word would not be included in a tweet. So I'd have to hardcode the whitelist to include my company's name, competitor company names, and any other similar companies. But I'd have to know about all of these companies to do this. I'd rather be able to invoke some kind of function or AI machine to determine what "kind" tweets I'm interested in and then further categorize them so that I'm not mixing up McDonalds with some small Cupcake shop (for example) - neither of which I'd be interested in b/c I'm more similar to a steakhouse (again for example). And then even further, group these into similar ways as I proposed above so I better understand the reason for results. Anyhow, my point addresses the ability to group tweets based on different attributes for different cases.
2. Sophisticated filtering: Assume for this that I've utilized all (or as many as I want) of Pulse's tuning techniques (i.e. updating Pulse dictionaries to fit my needs, utilizing mappings, whitelistonly, etc.). Now after tuning to the best of my ability, say I still have random tweets that are completely useless to my analysis and is making the outcome biased in some direction. Now, maybe there aren't many of these "useless" tweets, but I'd rather not have to go look for them and delete them from my analysis. I realize that there always will be "noise" associated with an analysis of this type (the type being social media data). However, perhaps Pulse has a function to account for possible noise (which most likely will always be there). If not, I guess I'm just throwing out a thought.
An example of what I'm talking about is say I want to determine the way people feel about a certain city's airport for a particular day. Well, after viewing some of these tweets some people actually tag the airport in a tweet; others don't. So sometimes you have to use different attributes like place attributes, or whatever. Or it may involve having to broaden the whitelist (which in doing so you probably will get tweets about every city's airport). Or do both but then you might miss out on some critical tweets about the airport of interest; however that tweeter is currently in a different country. Now all of this comes down to the fact that we will never account for all the tweets that are important to answer our question; and/or we'll get too many tweets which adds bias to the answer.
So both 1. and 2. are basically linked - if 1. can be solved then 2. can be solved (note that I'm more interested in solving 1. b/c Pulse does have many tuning techniques that can be addressed to almost solve 2.
Anyhow, I didn't provide examples here due to time (most of my examples that I recall I ran into a couple weeks ago so I'd have to go looking for them). But just thought I'd post my thoughts. Feedback and possibly suggestions are useful. I am already onto the idea of machine learning to tackle this problem of grouping, however, it is somewhat time-consuming.
0
Comments
Thanks for giving Pulse a try and we're glad that it works well for you. You are right about the fact that Pulse doesn't address the mentioned concerns. Pulse has been designed to analyze the sentiment on individual attributes in a tweet and not the tweet as a whole.
1. Grouping tweets: Pulse doesn't have the capabililty of grouping the tweets. The only way to achieve this effect at present is to use the whitelist to identify the interesting attributes.
For example, put "steak" in the whitelist. Then do
insert into tweet_sentiment select id, sentimentanalysis(text) over() from tweets;
Pick out the tweets with steak:
select id from tweets where attribute='steak';
and then look at the other attributes in those tweets:
select attribute, sentiment_score from tweet_sentiment where id in (select id from tweets where attribute='steak');
But you really need to know company names in order to differentiate between large chain-steakhouses and small local ones.
2. Sophisticated Filtering: Again since Pulse doesn't analyze the tweet as a whole, there's no way of identifying which tweets are noise and which are relevant. You'd get a sentiment on all the attributes in a tweet. That's why we suggest that you look at the results in aggregate. Using the whitelist to identify the tweets that are relevant is the way to go.