Poor mans data_collector forever

skeswaniskeswani - Select Field - Employee

 

Dont let data collector info roll over.

Here is how you can save it into a vertica database before its rolled over and deleted.

 

# create a dc schema for all dc tables 
vsql -c "create schema dc;"
mkdir temp

# in catalog directory
cd DataCollector

# load a copy of the DC Schema
for file in CREATE_*.sql ; do vsql -a -f $file; done

## to load data, watch for deleted files and load them into vertica

while read -r event_filename_ts; do
#echo --${event_filename_ts};
event=$(echo $event_filename_ts | sed "s/\(.*\) \(.*\)_\(.*\).log/\1/g")
filename=$(echo $event_filename_ts | sed "s/\(.*\) \(.*\)_\(.*\).log/\2/g")
ts=$(echo $event_filename_ts | sed "s/\(.*\) \(.*\)_\(.*\).log/\3/g")
#echo --${event}
#echo --{$filename}
#echo --${ts}
if [ "${event}" = "CREATE" ]; then
# create a temporary link to the file
ln ${filename}_${ts}.log temp/${filename}_${ts}.ln
elif [ "${event}" = "DELETE" -a -e temp/${filename}_${ts}.ln ] ; then
# load the file and dispose of it
sed "s/${filename}_\*\.log/temp\/${filename}_${ts}.ln/g" COPY_${filename}_TABLE.sql | vsql
rm -rf temp/${filename}_${ts}.ln
fi
done < <(inotifywait -m -e "CREATE,DELETE" --format "%e %f" . )


## clean up when done
rm -rf temp

remember

1. do it on all nodes

2. deamonize this script and run it as a service

 

Comments

  • Hi Sumeet,

     

    Interesting script.  A couple of caveats:

     

    - At least in the version of 7.2 that I looked at, the CREATE sql scripts don't include PARTITION BY clauses, so using this exact method, there's no way to implement a data retention policy.  Unless that's changed in more recent versions of 7.2.

     

    - Noone will truly want "data_collector forever" because the amount of data would balloon out of control.  For example the dc_execution_engine_profiles data.  I typically set a 5 or 10 second threshold and even then collect billions of rows in a 2-4 week period.  This script should be accompanied by a drop_partition() script that keeps the less-interesting lower level data for a short period of time.  I tend to keep such lower level data for 2-4 weeks and query summarization data such as query_requests or query_profiles or similar forever.  Though I collect that data by querying the actual dc tables rather than loading directly from the DataCollector directory.

     

      --Sharon

     

     

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file