Sifting through a lot of similar photos

To keep the amount of photos in my photo library sane, I had to sift through many pictures and get rid of redundant ones. I defined redundancy as many pictures taken at the same time. Thus I had to pick one of the redundant pictures and delete the other ones.

My strategy so far was to use Nautilus and Eye of GNOME to spot pictures of the same group and delete all but the best one.

I realised that photos usually show the same picture if they were shot at the same time, i.e. many quick shots after another. I also realised that usually the best photograph was the biggest one in terms on bytes in JPEG format.

To automate the whole selection and deletion process, I hacked together a tiny script that stupidly groups files in a directory according to their mtime and deletes all but the biggest one.

Before deletion, it will show the pictures with eog and ask whether or not to delete the other pictures.

It worked quite well and helped to quickly weed out 15% of my pictures 🙂

I played around with another method: Getting the difference of the histograms of the images, to compare the similarity. But as the pictures were shot with a different exposure, the histograms were quite different, too. Hence that didn’t work out very well. But I’ll leave it in, just for reference.

So if you happen to have a similar problem, feel free to grab the following script 🙂

#!/usr/bin/env python
import collections
import math
import os
from os.path import join, getsize, getmtime
import operator
import subprocess
import sys
subprocess.Popen.__enter__ = lambda self: self
subprocess.Popen.__exit__ = lambda self, type, value, traceback: self.kill()
directory = '.'
GET_RMS = False
mtimes = collections.defaultdict(list)
def get_picgroups_by_time(directory='.'):
	for root, dirs, files in os.walk(directory):
		for name in files:
			fname = join(root, name)
			mtime = getmtime(fname)
	# It's gotten a bit messy, but a OrderedDict is available in Python 3.1 hence this is the manually created ordered list.
	picgroups = [v for (k, v) in sorted([(k, v) for k, v in mtimes.iteritems() if len(v) >= THRESHOLD])]
	return picgroups
def get_picgroups(directory='.'):
	return get_picgroups_by_time()
picgroups = get_picgroups(directory)
print 'Got %d groups' % len(picgroups)
def get_max_and_picgroups(picgroups):
	for picgroup in picgroups:
		max_of_group = max(picgroup, key=lambda x: getsize(x))
		print picgroup
		print 'max: %s: %d' % (max_of_group, getsize(max_of_group))
		if GET_RMS:
			import PIL.Image
			last_pic = picgroup[0]
			for pic in picgroup[1:]:
				image1 =
				image2 =
				rms = math.sqrt(reduce(operator.add, map(lambda a,b: (a-b)**2, image1, image2))/len(image1))
				print 'RMS %s %s: %s' % (last_pic, pic, rms)
			last_pic = pic
		yield (max_of_group, picgroup)
max_and_picgroups = get_max_and_picgroups(picgroups)
def decide(prompt, decisions):
	import termios, fcntl, sys, os, select
	fd = sys.stdin.fileno()
	oldterm = termios.tcgetattr(fd)
	newattr = oldterm[:]
	newattr[3] = newattr[3] & ~termios.ICANON & ~termios.ECHO
	termios.tcsetattr(fd, termios.TCSANOW, newattr)
	oldflags = fcntl.fcntl(fd, fcntl.F_GETFL)
	fcntl.fcntl(fd, fcntl.F_SETFL, oldflags | os.O_NONBLOCK)
	print prompt
	decided = None
		while not decided:
			r, w, e =[fd], [], [])
			if r:
				c =
				print "Got character", repr(c)
				decision_made = decisions.get(c, None)
				if decision_made:
					decided = True
	    termios.tcsetattr(fd, termios.TCSAFLUSH, oldterm)
	    fcntl.fcntl(fd, fcntl.F_SETFL, oldflags)
for max_of_group, picgroup in max_and_picgroups:
	cmd = ['eog', '-n'] + picgroup
	print 'Showing %s' % ', '.join(picgroup)
	def delete_others():
		to_delete = picgroup[:]
		print 'deleting %s' % ', '.join (to_delete)
		[os.unlink(f) for f in to_delete]
	with subprocess.Popen(cmd) as p:
		decide('%s is max, delete others?' % max_of_group, {'y': delete_others, 'n': lambda: ''})

Wikify Pages

In one of our modules, “System Software”, we were asked to write a bash script which wikifies a page. That means to identify all nouns and replace them with a link to the Wikipedia.

I managed to write that up in two hours or so and I think I have a not so ugly solution (*cough* it’s still bash… *cough*). It has (major) drawbacks though. Valid X(HT)ML, i.e.

<![CDATA[ <body>

before the actual body will be recognized as the beginning. But parsing XML with plain bash is not that easy.

Also, my script somehow does not parse the payload correctly, that is it tails all the way down to the end of the file instead of stopping when </body> is reached.

Anyway, here’s the script:

### A script which tries to Wikipediafy a given HTML page
### That means that every proper noun is linked against the Wikipedia
### but only if it's not already linked against something.
### Assumptions are, that the HTML file has a "<body>" Tag on a seperate
### line and that "<a>" Tags don't span multiple lines.
### Also, Capitalised words in XML Tags are treated as payload words, just
### because parsing XML properly is a matter of 100k Shellscript (consider
### parsing <[DATA[!). Also, this assumption is not too far off, because
### Captd words in XML happen seldomly anyway.
### As this is written in Bash, it is horribly slow. You'd rather want to do
### this in a language that actually understand {X,HT}ML like Python
# You might want to change this
set -e
### Better not change anything below, it might kill kittens
# To break only on newlines, set the IFS to that
HTML=$(cat $1) # Read the file for performance reasons
# Find the beginning and end of Document and try to be permissive with HTML
# Errors by only stopping after hitting one <body> Tag
START_BODY=$(grep --max-count=1 -ni '<body>'<<<"$HTML" | cut -d: -f1)
END_BODY=$(grep --max-count=1 -ni '</body>'<<<"$HTML" | cut -d: -f1)
HEAD=$(head -n $START_BODY<<<"$HTML") # Extract the Head for later use
# $(()) is most probably a non-portable bashism, so one wants to get rid of that
# And the extract the body
PAYLOAD=$(tail -n +${START_BODY} <<<"$HTML" | tail -n ${RANGE_BODY})
### This is the main part
### Basically search for all words beginning with a capital letter
### and match that. We can use that later with \1.
### Try to find already linked words, replace them by their MD5 hash,
### Run generic Word finding mechanism and replace back later
# We simply assume that a link doesn't span multiple lines
LINKMATCHES=$(grep -i -E --only-matching '<a .*>.*</a>' $1 || true)
if [[ -n $LINKMATCHES ]]; then
    # We have found already linked words, put them into an array
    index=0 # iterate over array
    for MATCH in $LINKMATCHES; do
        # Uniquely hash the found link and replace it, saving it's origin
        MATCHMD5=$(md5sum <<<$MATCH | awk '{print $1}')
        # We simply assume that there's no "," in the match
        # Use Bash internals string replacement facilities
        let "index = $index + 1"
# Find the matches
WORDMATCHES=$(grep --only-matching '[A-Z][a-z][a-z]*'<<<$CLEANEDPAYLOAD | sort | uniq)
while [[ "$index" -lt ${#WORDMATCHES_ARRAY[@]} ]]; do
    # Yeah, iterating over an array with 300+ entries is fun *sigh*
    # You could now ask Wikipedia and only continue if the page exist
    # if wget -q "${BASEURL}${SEARCH}"; then ...; else ...; fi
    REPLACE="<a href=\"${BASEURL}${SEARCH}\">\2</a>"
    # Note, that we replace the first occurence only
    #WIKIFIED=${WIKIFIED/$SEARCH/$REPLACE} ## That's horribly slow, so use sed
    # Watch out for a problem: "<p>King" shall match as well as "King</p>"
    # or "King." but not eBook.
    # We thus match the needle plus the previous/following char,
    # iff it's not [A-Za-z]
    WIKIFIED=$(sed -e "s,\([^A-Za-z]\)\($SEARCH\)\([^A-Za-z]\),\1$REPLACE\3,"<<<$WIKIFIED) # so use sed
    let "index += 1"
# Replace hashed links with their original, same as above, only reverse.
# One could apply this technique to other tags besides <a>, but you really
# want to write that in a proper language :P
while [[ "$index" -lt ${#MD5_ARRAY[@]} ]]; do
    let "index += 1"
### Since we have the head and the payload separate, echo both
echo $HEAD
# Reset the IFS, e.g. for following scripts
Creative Commons Attribution-ShareAlike 3.0 Unported
This work by Muelli is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported.