Poor mans data_collector forever

Dont let data collector info roll over.

Here is how you can save it into a vertica database before its rolled over and deleted.


# create a dc schema for all dc tables 
vsql -c "create schema dc;"
mkdir temp

# in catalog directory
cd DataCollector

# load a copy of the DC Schema
for file in CREATE_*.sql ; do vsql -a -f $file; done

## to load data, watch for deleted files and load them into vertica

while read -r event_filename_ts; do
#echo --${event_filename_ts};
event=$(echo $event_filename_ts | sed "s/\(.*\) \(.*\)_\(.*\).log/\1/g")
filename=$(echo $event_filename_ts | sed "s/\(.*\) \(.*\)_\(.*\).log/\2/g")
ts=$(echo $event_filename_ts | sed "s/\(.*\) \(.*\)_\(.*\).log/\3/g")
#echo --${event}
#echo --{$filename}
#echo --${ts}
if [ "${event}" = "CREATE" ]; then
# create a temporary link to the file
ln ${filename}_${ts}.log temp/${filename}_${ts}.ln
elif [ "${event}" = "DELETE" -a -e temp/${filename}_${ts}.ln ] ; then
# load the file and dispose of it
sed "s/${filename}_\*\.log/temp\/${filename}_${ts}.ln/g" COPY_${filename}_TABLE.sql | vsql
rm -rf temp/${filename}_${ts}.ln
done < <(inotifywait -m -e "CREATE,DELETE" --format "%e %f" . )

## clean up when done
rm -rf temp


1. do it on all nodes

2. deamonize this script and run it as a service



    Hi Sumeet,


    Interesting script.  A couple of caveats:


    - At least in the version of 7.2 that I looked at, the CREATE sql scripts don't include PARTITION BY clauses, so using this exact method, there's no way to implement a data retention policy.  Unless that's changed in more recent versions of 7.2.


    - Noone will truly want "data_collector forever" because the amount of data would balloon out of control.  For example the dc_execution_engine_profiles data.  I typically set a 5 or 10 second threshold and even then collect billions of rows in a 2-4 week period.  This script should be accompanied by a drop_partition() script that keeps the less-interesting lower level data for a short period of time.  I tend to keep such lower level data for 2-4 weeks and query summarization data such as query_requests or query_profiles or similar forever.  Though I collect that data by querying the actual dc tables rather than loading directly from the DataCollector directory.





