Using Python to Emulate Unix Pipes

Thu 09 January 2020

Filed under Python

Tags python

Unix utility emulation


Introduction

This post is an example of emulation unix utilities in Python. I was prompted by another blog posting about the difficulties of Unix shell scripting. The blogger was trying to accumulate the time taken to execute a given script multiple times (difficulties arose because Unix Shell scripting doesn't do floating calculations easily). Ignoring the details, I wondered how I would do the core script in Python.

See: https://blog.plover.com/Unix/tools.html

awk '{print $11}' FILE_NAME_PATTERN | sort | uniq -c | sort -n | grep -v EXCLUDE_PATTERN

Basically, this says:

  • For all files with names that match a specified pattern

  • Read the file, extracting the 11-th field of each line

  • Sort the fields

  • Use uniq output the unique field values, and append the count of the unique field values

  • Sort into ascending order

  • Exclude fields that match a given pattern.

    The script is being used to process website logs.

Implementation

Imports

glob does file name search, with pattern matching

re does regular expressions

collections contains Counter, which (no suprise) counts instances ob objects

In [2]:
import glob
import re
import collections

Load up lab_black to format our Python nicely.

In [3]:
%reload_ext lab_black

Find file names

Find all file names that match a pattern in target directory

In [4]:
SOURCE_DIR = '../data/'

fnames = glob.glob(SOURCE_DIR + 'test01 - Copy (*).txt')

Read Files

We create a Counter instance, read the contents of each file, split each line into fields, and update the count of field 2. Note that we use the with statement to avoid all the file closing cleanup.

We also cater for the case where the log file has NO field 2.

Note a gotcha: If you give a string to Counter, it thinks it is a list of chars, and counts each character. You have to wrap strings in a list.

Finally, we get the list in descending order, reverse it, and make it into a list again.

In [5]:
# create a counter of field values from field 1
f1_count = collections.Counter()

# open each file name
for fname in fnames:
    with open(fname) as f:
        # read all lines in this file
        lines = f.readlines()
        # strip off leading and trailing whitespace, split on whitespace
        # update count of second field
        for line in lines:
            try:
                # get field (if present)
                field = line.strip().split()[1]
                # update count, note passing in a string gets it chopped into chars
                # have to pass list with string as only item
                f1_count.update([field])
            except IndexError as e:
                # no field 1, ignore this (maybe blank line?)
                pass
            # end try
        # end for
    # end with
# end for
# sort list by count, then reverse, then turn into list again
counts = list(reversed(f1_count.most_common()))

Excluding Don't Cares

Finally, we go through the list of (field, count) tuples, excluding those that match the specified pattern. I made this a little fancy, in that I catered for the case with no exclusion pattern.

In the spirit of Unix, the output is just the raw tuples.

In [6]:
# do the exclusion on RE pattern

exclude = '^c$'
final_counts = [
    (v, n)
    for v, n in counts
    if (exclude == None) or (re.search(exclude, v) == None)
]

# raw display on counts
__ = [print(v, n) for v, n in final_counts]
vvvvvv 1
a 1
d 1
b 3
f 6

Fancy Report

I decided to add a reporting function that had the exclusion function build it. Not quite in the spirit of Unix, but nicer to look at (a manager of Arsenal FC once famously said "If you want entertainment, go to circus": Unix bros would probably say "If you want to look at something nice, go to an art gallery")

In [7]:
def report_counts(
    counts: list, exclude: str = None
) -> None:
    '''
    report_counts: prints a formatted report show values and counts, 
                excluding values that match a RE pattern
    Parameters
    counts: list of form [(v1, n1), (v2, n2) ...], v_i strings, n_i counts
    
    exclude: string holding RE pattern to supress a line if pattern matches in v_i string
              default None
    
    '''
    title1 = 'Value'
    title2 = 'Count'
    underbar = '-'
    col1 = 15
    col2 = 5
    print(f'{title1:^{col1}}|{title2:^{col2}}')

    print(
        f'{underbar:{underbar}^{col1}}|{underbar:{underbar}^{col2}}'
    )

    # print line of report if no exclude pattern given,
    # or if exclude pattern (non-None) not seen
    __ = [
        print(f'{v:>{col1}}|{n:>{col2}}')
        for v, n in counts
        if (exclude == None)
        or (re.search(exclude, v) == None)
    ]


# end
In [8]:
report_counts(counts, exclude='^c|b$')
     Value     |Count
---------------|-----
         vvvvvv|    1
              a|    1
              d|    1
              f|    6
In [9]:
report_counts(final_counts)
     Value     |Count
---------------|-----
         vvvvvv|    1
              a|    1
              d|    1
              b|    3
              f|    6

More Pythonic?

The nested for loops above are not very Pythonic-looking.

The code below collapses them into a set of nested comprehensions.

Sadly, so far as I can see there is no way the get the effect of a with Context Manager in a list comprehension. Also sadly, I can't see any way to catch and ignore Exceptions in a list comprehension, which makes them very brittle in this case.

In [10]:
zz = collections.Counter(
    [
        x
        for file_list in [
            [
                line.split()[1]
                for line in open(fname).readlines()
            ]
            for fname in fnames
        ]
        for x in file_list
    ]
)
In [11]:
zz
Out[11]:
Counter({'b': 3, 'c': 5, 'f': 6, 'd': 1, 'a': 1, 'vvvvvv': 1})

This shows the input to the Counter object (a list of field tokens)

In [12]:
[
    x
    for file_list in [
        [
            line.split()[1]
            for line in open(fname).readlines()
        ]
        for fname in fnames
    ]
    for x in file_list
]
Out[12]:
['b',
 'b',
 'c',
 'f',
 'f',
 'f',
 'f',
 'f',
 'f',
 'b',
 'd',
 'c',
 'c',
 'c',
 'c',
 'a',
 'vvvvvv']

This comprehension returns a list, each item of which is the list of field token in the corresponding file. The code above flattens this into a single list.

In [13]:
[
    [line.split()[1] for line in open(fname).readlines()]
    for fname in fnames
]
Out[13]:
[['b'],
 ['b'],
 ['c', 'f', 'f', 'f', 'f', 'f', 'f'],
 ['b'],
 ['d', 'c', 'c', 'c', 'c', 'a', 'vvvvvv']]

Scaling Up

The toy data sets used about are all very well, but then I thought about "What if my files are Mega-or-Giga bytes big". So I recast my code to use generators (i.e. lazy evaluation, rather than greedy evaluation).

This approach might fall over if there are huge numbers of log files, more than the allowed number of open files, because as I said above, open file cleanup is not supported.

In [14]:
zz_gen = (
    [line.split()[1] for line in open(fname).readlines()]
    for fname in fnames
)
In [15]:
collections.Counter(
    (x for file_list in zz_gen for x in file_list)
)
Out[15]:
Counter({'b': 3, 'c': 5, 'f': 6, 'd': 1, 'a': 1, 'vvvvvv': 1})
In [16]:
zz_gen
Out[16]:
<generator object <genexpr> at 0x0000012F35ABE1B0>

Comments


net-analysis.com Data Analysis Blog © Don Cameron Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More