We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


Poor mans data_collector forever — Vertica Forum

Poor mans data_collector forever

skeswaniskeswani - Select Field - Employee

 

Dont let data collector info roll over.

Here is how you can save it into a vertica database before its rolled over and deleted.

 

# create a dc schema for all dc tables 
vsql -c "create schema dc;"
mkdir temp

# in catalog directory
cd DataCollector

# load a copy of the DC Schema
for file in CREATE_*.sql ; do vsql -a -f $file; done

## to load data, watch for deleted files and load them into vertica

while read -r event_filename_ts; do
#echo --${event_filename_ts};
event=$(echo $event_filename_ts | sed "s/\(.*\) \(.*\)_\(.*\).log/\1/g")
filename=$(echo $event_filename_ts | sed "s/\(.*\) \(.*\)_\(.*\).log/\2/g")
ts=$(echo $event_filename_ts | sed "s/\(.*\) \(.*\)_\(.*\).log/\3/g")
#echo --${event}
#echo --{$filename}
#echo --${ts}
if [ "${event}" = "CREATE" ]; then
# create a temporary link to the file
ln ${filename}_${ts}.log temp/${filename}_${ts}.ln
elif [ "${event}" = "DELETE" -a -e temp/${filename}_${ts}.ln ] ; then
# load the file and dispose of it
sed "s/${filename}_\*\.log/temp\/${filename}_${ts}.ln/g" COPY_${filename}_TABLE.sql | vsql
rm -rf temp/${filename}_${ts}.ln
fi
done < <(inotifywait -m -e "CREATE,DELETE" --format "%e %f" . )


## clean up when done
rm -rf temp

remember

1. do it on all nodes

2. deamonize this script and run it as a service

 

Comments

  • Hi Sumeet,

     

    Interesting script.  A couple of caveats:

     

    - At least in the version of 7.2 that I looked at, the CREATE sql scripts don't include PARTITION BY clauses, so using this exact method, there's no way to implement a data retention policy.  Unless that's changed in more recent versions of 7.2.

     

    - Noone will truly want "data_collector forever" because the amount of data would balloon out of control.  For example the dc_execution_engine_profiles data.  I typically set a 5 or 10 second threshold and even then collect billions of rows in a 2-4 week period.  This script should be accompanied by a drop_partition() script that keeps the less-interesting lower level data for a short period of time.  I tend to keep such lower level data for 2-4 weeks and query summarization data such as query_requests or query_profiles or similar forever.  Though I collect that data by querying the actual dc tables rather than loading directly from the DataCollector directory.

     

      --Sharon

     

     

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file