Inquiries on the LINEAR_REG function
I am interested in using the machine learning functions for linear regression, mainly:
- LINEAR_REG
- PREDICT_LINEAR_REG
I have a table that has these columns:
- time_epoch (numeric) – denoting a "x" the independent variable
- score (numeric) - denoting a "y" the dependent variables
- location (VARCHAR) – for each value of this variable, the values of (x,y) are the linear relationship I want to model and predict on
Example:
dbadmin=> select * from test.my_table; time_epoch (x) | location | score (y) -------------------+------------------------------------+ 1591754400.000000 | AAAA | 0.0427034187223203 1591758000.000000 | AAAA | 0.0447975381705089 1591761600.000000 | AAAA | 0.0444945871539881 1591765200.000000 | AAAA | 0.0368878016429624 1591768800.000000 | AAAA | 0.0478755857529369 1591754400.000000 | BBBB | 0.0297034187223203 1591758000.000000 | BBBB | 0.0283424324324324 1591761600.000000 | BBBB | 0.0274945871539881 1591765200.000000 | BBBB | 0.0268878016429624 1591768800.000000 | BBBB | 0.0268755857529369 1591754400.000000 | CCCC | 0.0392133432432432 1591758000.000000 | CCCC | 0.0372343243243234 1591761600.000000 | CCCC | 0.0363143243243243 1591765200.000000 | CCCC | 0.0354382943203243 1591768800.000000 | CCCC | 0.0342343243243243 1591754400.000000 | DDDD | 0.0213213213213213 1591758000.000000 | DDDD | 0.0202132132132132 1591761600.000000 | DDDD | 0.0181321321321321 1591765200.000000 | DDDD | 0.0162132132132132 1591768800.000000 | DDDD | 0.0123213213232321
I want to use LINEAR_REG function to predict the "score" column (i.e., "y" values) at several time_epochs (i.e., "x" values) for each location.
Note: The values of "x" for which we have data for AND for what to predict "y" for will be the same for all locations.
I have 2 questions:
Question 1
From documentation, I cannot use non-numeric values, but is there a way around it? I would like to do something like:
- **score **column as the "response column"
- time_epoch, and location as the "predictor columns"
Invocation example:
SELECT LINEAR_REG('my_prediction', 'test.my_table', 'score', 'time_epoch,location' USING PARAMETERS optimizer='BFGS');
Question 2
Is there a function that can predict linear regression without having to perform two calls (LINEAR_REG and PREDICT_LINEAR_REG). Ideally, I am looking for a single function call for both.
Thanks for your help!
Answers
@jquan Based on the information you have shared location is varchar column. Since it is categorical column, please perform encoding and then use the encoded data. Please find link below which explains regarding one hot encoding.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/MachineLearning/DataPreparation/EncodingCategoricalColumns.htm
Yes you can have more than one predictor variable please find the sample below
SELECT LINEAR_REG('my_prediction', 'test.my_table', 'score', 'time_epoch','location_id' USING PARAMETERS optimizer='BFGS');
Regarding question 2, you have to perform 2 calls, There is no function which can perform both the tasks single function call.