Friday, 15 November 2019

Ezproxy harvesting - walk through of the steps so far

I am genuinely happy at the prospect of being able to analyse usage statistics for our electronic resources.  I have heard myself telling colleagues it was impossible for so many years that I feel ashamed that I never seriously tried to pull it off before.

My Python -> MySQL model is shaping up well.

Here is an outline of how the process works so far.  (This will all be automated at a later stage, but at the moment involves me taking the place of scheduled jobs).

Step One: get the Ezproxy logs

We host our own Ezproxy server, so I just FTP the most recent batch to a network drive that allows me to run Python.

The log files I need are named along the lines:
  • ezproxy.log.04Nov2019
  • ezproxy.log.05Nov2019
  • ezproxy.log.06Nov2019

Step Two: extract the details I need

From these huge logfiles, I only need a tiny subset of information:
  • IP address of the requester
  • User name of the requester
  • Timestamp
  • Which of our electronic resources they viewed
I do this at the command line, by going through the logs and cutting out what I need:

cat ezproxy*.log* | cut -d' ' -f1,3,4,7 |  grep 'connect?session' > ezproxy.out

(This basically retrieves columns 1, 3, 4, and 7 from the log file, from each line that shows the user authenticating their session)

With the user names redacted, the output looks like:



Step Three - run it through my Python script

Details in next post

No comments:

Post a Comment