flask - GreenAsh

A lightweight per-transaction Python function queue for Flask

2017-12-04T00:00:00Z

The premise: each time a certain API method is called within a Flask / SQLAlchemy app (a method that primarily involves saving something to the database), send various notifications, e.g. log to the standard logger, and send an email to site admins. However, the way the API works, is that several different methods can be forced to run in a single DB transaction, by specifying that SQLAlchemy only perform a commit when the last method is called. Ideally, no notifications should actually get triggered until the DB transaction has been successfully committed; and when the commit has finished, the notifications should trigger in the order that the API methods were called.

There are various possible solutions that can accomplish this, for example: a celery task queue, an event scheduler, and a synchronised / threaded queue. However, those are all fairly heavy solutions to this problem, because we only need a queue that runs inside one thread, and that lives for the duration of a single DB transaction (and therefore also only for a single request).

To solve this problem, I implemented a very lightweight function queue, where each queue is a deque instance, that lives inside flask.g, and that is therefore available for the duration of a given request context (or app context).

The code

The whole implementation really just consists of this one function:

from collections import deque

from flask import g


def queue_and_delayed_execute(
        queue_key, session_hash, func_to_enqueue,
        func_to_enqueue_ctx=None, is_time_to_execute_funcs=False):
    """Add a function to a queue, then execute the funcs now or later.

    Creates a unique deque() queue for each queue_key / session_hash
    combination, and stores the queue in flask.g. The idea is that
    queue_key is some meaningful identifier for the functions in the
    queue (e.g. 'banana_masher_queue'), and that session_hash is some
    identifier that's guaranteed to be unique, in the case of there
    being multiple queues for the same queue_key at the same time (e.g.
    if there's a one-to-one mapping between a queue and a SQLAlchemy
    transaction, then hash(db.session) is a suitable value to pass in
    for session_hash).

    Since flask.g only stores data for the lifetime of the current
    request (or for the lifetime of the current app context, if not
    running in a request context), this function should only be used for
    a queue of functions that's guaranteed to only be built up and
    executed within a single request (e.g. within a single DB
    transaction).

    Adds func_to_enqueue to the queue (and passes func_to_enqueue_ctx as
    kwargs if it has been provided). If is_time_to_execute_funcs is
    True (e.g. if a DB transaction has just been committed), then takes
    each function out of the queue in FIFO order, and executes the
    function.
    """
    # Initialise the set of queues for queue_key
    if queue_key not in g:
        setattr(g, queue_key, {})

    # Initialise the unique queue for the specified session_hash
    func_queues = getattr(g, queue_key)
    if session_hash not in func_queues:
        func_queues[session_hash] = deque()

    func_queue = func_queues[session_hash]

    # Add the passed-in function and its context values to the queue
    func_queue.append((func_to_enqueue, func_to_enqueue_ctx))

    if is_time_to_execute_funcs:
        # Take each function out of the queue and execute it
        while func_queue:
            func_to_execute, func_to_execute_ctx = (
                func_queue.popleft())
            func_ctx = (
                func_to_execute_ctx
                if func_to_execute_ctx is not None
                else {})
            func_to_execute(**func_ctx)

        # The queue is now empty, so clean up by deleting the queue
        # object from flask.g
        del func_queues[session_hash]

To use the function queue, calling code should look something like this:

from flask import current_app as app
from flask_mail import Message
from sqlalchemy.exc import SQLAlchemyError

from myapp.extensions import db, mail


def do_api_log_msg(log_msg):
    """Log the specified message to the app logger."""
    app.logger.info(log_msg)


def do_api_notify_email(mail_subject, mail_body):
    """Send the specified notification email to site admins."""
    msg = Message(
        mail_subject,
        sender=app.config['MAIL_DEFAULT_SENDER'],
        recipients=app.config['CONTACT_EMAIL_RECIPIENTS'])
    msg.body = mail_body

    mail.send(msg)

    # Added for demonstration purposes, not really needed in production
    app.logger.info('Sent email: {0}'.format(mail_subject))


def finalise_api_op(
        log_msg=None, mail_subject=None, mail_body=None,
        is_db_session_commit=False, is_app_logger=False,
        is_send_notify_email=False):
    """Finalise an API operation by committing and logging."""
    # Get a unique identifier for this DB transaction
    session_hash = hash(db.session)

    if is_db_session_commit:
        try:
            db.session.commit()

            # Added for demonstration purposes, not really needed in
            # production
            app.logger.info('Committed DB transaction')
        except SQLAlchemyError as exc:
            db.session.rollback()
            return {'error': 'error finalising api op'}

    if is_app_logger:
        queue_key = 'api_log_msg_queue'

        func_to_enqueue_ctx = dict(log_msg=log_msg)

        queue_and_delayed_execute(
            queue_key=queue_key, session_hash=session_hash,
            func_to_enqueue=do_api_log_msg,
            func_to_enqueue_ctx=func_to_enqueue_ctx,
            is_time_to_execute_funcs=is_db_session_commit)

    if is_send_notify_email:
        queue_key = 'api_notify_email_queue'

        func_to_enqueue_ctx = dict(
            mail_subject=mail_subject, mail_body=mail_body)

        queue_and_delayed_execute(
            queue_key=queue_key, session_hash=session_hash,
            func_to_enqueue=do_api_notify_email,
            func_to_enqueue_ctx=func_to_enqueue_ctx,
            is_time_to_execute_funcs=is_db_session_commit)

    return {'message': 'api op finalised ok'}

And that code can be called from a bunch of API methods like so:

def update_froggy_colour(
        froggy, colour, is_db_session_commit=False, is_app_logger=False,
        is_send_notify_email=False):
    """Update a froggy's colour."""
    froggy.colour = colour

    db.session.add(froggy)

    log_msg = ((
        'Froggy colour updated: {froggy.id}; new value: '
        '{froggy.colour}').format(froggy=froggy))
    mail_body = (
        'Froggy: {froggy.id}; new colour: {froggy.colour}'.format(
            froggy=froggy))

    result = finalise_api_op(
        log_msg=log_msg, mail_subject='Froggy colour updated',
        mail_body=mail_body, is_db_session_commit=is_db_session_commit,
        is_app_logger=is_app_logger,
        is_send_notify_email=is_send_notify_email)

    return result


def make_froggy_jump(
        froggy, jump_height, is_db_session_commit=False,
        is_app_logger=False, is_send_notify_email=False):
    """Make a froggy jump."""
    froggy.is_jumping = True
    froggy.jump_height = jump_height

    db.session.add(froggy)

    log_msg = ((
        'Made froggy jump: {froggy.id}; jump height: '
        '{froggy.jump_height}').format(froggy=froggy))
    mail_body = (
        'Froggy: {froggy.id}; jump height: {froggy.jump_height}'.format(
            froggy=froggy))

    result = finalise_api_op(
        log_msg=log_msg, mail_subject='Made froggy jump',
        mail_body=mail_body, is_db_session_commit=is_db_session_commit,
        is_app_logger=is_app_logger,
        is_send_notify_email=is_send_notify_email)

    return result

And the API methods can be called like so:

def make_froggy_brightpink_and_highjump(froggy):
    """Make a froggy bright pink and jumping high."""
    results = []

    result1 = update_froggy_colour(
        froggy, "bright_pink", is_app_logger=True)
    results.append(result1)

    result2 = make_froggy_jump(
        froggy, "50 metres", is_db_session_commit=True,
        is_app_logger=True, is_send_notify_email=True)
    results.append(result2)

    return results

If make_froggy_brightpink_and_highjump() is called from within a Flask app context, the app's log should include output that looks something like this:

INFO [2017-12-01 09:00:00] Committed DB transaction
INFO [2017-12-01 09:00:00] Froggy colour updated: 123; new value: bright_pink
INFO [2017-12-01 09:00:00] Made froggy jump: 123; jump height: 50 metres
INFO [2017-12-01 09:00:00] Sent email: Made froggy jump

The log output demonstrates that the desired behaviour has been achieved: first, the DB transaction finishes (i.e. the froggy actually gets set to bright pink, and made to jump high, in one atomic write operation); then, the API actions are logged in the order that they were called (first the colour was updated, then the froggy was made to jump); then, email notifications are sent in order (in this case, we only want an email notification sent for when the froggy jumps high – but if we had also asked for an email notification for when the froggy's colour was changed, that would have been the first email sent).

In summary

That's about all there is to this "task queue" implementation – as I said, it's very lightweight, because it only needs to be simple and short-lived. I'm sharing this solution, mainly to serve as a reminder that you shouldn't just use your standard hammer, because sometimes the hammer is disproportionately big compared to the nail. In this case, the solution doesn't need an asynchronous queue, it doesn't need a scheduled queue, and it doesn't need a threaded queue. (Although moving the email sending off to a celery task is a good idea in production; and moving the logging to celery would be warranted too, if it was logging to a third-party service rather than just to a local file.) It just needs a queue that builds up and that then gets processed, for a single DB transaction.

Introducing Flask Editable Site

2015-10-27T00:00:00Z

I'd like to humbly present Flask Editable Site, a template for building a small marketing web site in Flask where all content is live editable. Here's a demo of the app in action.

Text and image block editing with Flask Editable Site.

The aim of this app is to demonstrate that, with the help of modern JS libraries, and with some well-thought-out server-side snippets, it's now perfectly possible to "bake in" live in-place editing for virtually every content element in a typical brochureware site.

This app is not a CMS. On the contrary, think of it as a proof-of-concept alternative to a CMS. An alternative where there's no "admin area", there's no "editing mode", and there's no "preview button". There's only direct manipulation.

"Template" means that this is a sample app. It comes with a bunch of models that work out-of-the-box (e.g. text content block, image content block, gallery item, event). However, these are just a starting point: you can and should define your own models when building a real site. Same with the front-end templates: the home page layout and the CSS styles are just examples.

About that "template" idea

I can't stress enough that this is not a CMS. There are of course plenty of CMSes out there already, in Python and in every other language under the sun. Several of those CMSes I have used extensively. I've even been paid to build web sites with them, for most of my professional life so far. I desire neither to add to that list, nor to take on the heavy maintenance burden that doing so would entail.

What I have discovered as a web developer, and what I'm sure that all web developers discover sooner or later, is that there's no such thing as the perfect CMS. Possibly, there isn't even such thing as a good CMS! If you want to build a web site with a content management experience that's highly tailored to the project in question, then really, you have to build a unique custom CMS just for that site. Deride me as a perfectionist if you want, but that's my opinion.

There is such a thing as a good framework. Flask Editable Site, as its name suggests, uses the Flask framework, which has the glorious honour of being my favourite framework these days. And there is definitely such a thing as a good library. Flask Editable Site uses a number of both front-end and back-end libraries. The best libraries can be easily mashed up together in different configurations, on top of different frameworks, to help power a variety of different apps.

Flask Editable Site is not a CMS. It's a sample app, which is a template for building a unique CMS-like app tailor-made for a given project. If you're doing it right, then no two projects based on Flask Editable Site will be the same app. Every project has at least slightly different data models, users / permissions, custom forms, front-end widgets, and so on.

So, there's the practical aim of demonstrating direct manipulation / live editing. However, Flask Editable Site has a philosophical aim, too. The traditional "building a super one-size-fits-all app to power 90% of sites" approach isn't necessarily a good one. You inevitably end up fighting the super-app, and hacking around things to make it work for you. Instead, how about "building and sharing a template for making each site its own tailored app"? How about accepting that "every site is a hack", and embracing that instead of fighting it?

Thanks and acknowledgements

Thanks to all the libraries that Flask Editable Site uses; in each case, I tried to choose the best library available at the present time, for achieving a given purpose:

Dante contenteditable WYSIWYG editor, a Medium editor clone. I had previously used MediumEditor, and I recommend it too, but I feel that Dante gives a more polished out-of-the-box experience for now. I think the folks at Medium have done a great job in setting the bar high for beautiful rich-text editing, which is an important part of the admin experience for many web sites / apps.
Dropzone.js image upload widget. C'mon, people, it's 2015. Death to HTML file fields for uploads. Drag and drop with image preview, bring it on. From my limited research, Dropzone.js seems to be the clear leader of this pack at the moment.
Bootstrap datetimepicker for calendar picker and hour/minute selector.
Bootstrap 3 for pretty CSS styles and grid layouts. I admit I've become a bit of a Bootstrap addict lately. For developers with non-existent artistic ability, like myself, it's impossible to resist. Font Awesome is rather nice, too.
Markovify for random text generation. I discovered this one (and several alternative implementations of it) while building Flask Editable Site, and I'm hooked. Adios, Lorem Ipsum, and don't hit the door on your way out.
Bootstrap Freelancer theme by Start Bootstrap. Although Flask Editable Site uses vanilla Bootstrap, I borrowed various snippets of CSS / JS from this theme, as well as the overall layout.
cookiecutter-flask, a Flask app template. I highly recommend this as a guide to best-practice directory layout, configuration management, and use of patterns in a Flask app. Thanks to these best practices, Flask Editable Site is also reasonably Twelve-Factor compliant, especially in terms of config and backing services.

Flask Editable Site began as the codebase for The Daydream Believers Performers web site, which I built pro-bono as a side project recently. So, acknowledgements to that group for helping to make Flask Editable Site happen.

For the live editing UX, I acknowledge that I drew inspiration from several examples. First and foremost, from Mezzanine, a CMS (based on Django) which I've used on occasion. Mezzanine puts "edit" buttons in-place next to most text fields on a site, and pops up a traditional (i.e. non contenteditable) WYSIWYG editor when these are clicked.

I also had a peek at Create.js, which takes care of the front-end side of live content editing quite similarly to the way I've cobbled it together. In Flask Editable Site, the combo of Dante editor and my custom "autosave" JS could easily be replaced with Create.js (particularly when using Hallo editor, which is quite minimalist like Dante); I guess it's just a question of personal taste.

Sir Trevor JS is an interesting new kid on the block. I'm quite impressed with Sir Trevor, but its philosophy of "adding blocks of anything down the page" isn't such a great fit for Flask Editable Site, where the idea is that site admins can only add / edit content within specific constraints for each block on the page. However, for sites with no structured content models, where it's OK for each page to be a free canvas (or for a "free canvas" within, say, each blog post on a site), I can see Sir Trevor being a real game-changer.

There's also X-editable, which is the only JS solution that I've come across for nice live editing of list-type content (i.e. checkoxes, radio buttons, tag fields, autocomplete boxes, etc). I haven't used X-editable in Flask Editable Site, because I'm mainly dealing with text and image fields (and for date / time fields, I prefer a proper calendar widget). But if I needed live editing of list fields, X-editable would be my first choice.

Final thoughts

I must stress that, as I said above, Flask Editable site is a proof-of-concept. It doesn't have all the features you're going to need for your project foo. In particular, it doesn't support very many field types: only text ("short text" and "rich text"), date, time, and image. It should also support inline images and (YouTube / Vimeo) videos out-of-the-box, as this is included with Dante, but I haven't tested it. For other field types, forks / pull requests / sister projects are welcome.

If you look at the code (particularly the settings.py file and the home view), you should be able to add live editing of new content models quite easily, with just a bit of copy-pasting and tweaking. The idea is that the editable.views code is generic enough, that you won't need to change it at all when adding new models / fields in your back-end. At least, that's the idea.

Quite a lot of the code in Flask Editable Site is more complex than it strictly needs to be, in order to support "session store mode", where all content is saved to the current user's session instead of to the database (preferably using something like Memcached or temp files, rather than cookies, although that depends on what settings you use). I developed "session store mode" in order to make the demo site work without requiring any hackery such as a scheduled DB refresh (which is the usual solution in such cases). However, I can see it also being useful for sandbox environments, for UAT, and for reviewing design / functionality changes without "real" content getting in the way.

The app also includes a fair bit of code for random generation and selection of sample text and image content. This was also done primarily for the purposes of the demo site. But, upon reflection, I think that a robust solution for randomly populating a site's content is really something that all CMS-like apps should consider more seriously. The exact algorithms and sample content pools for this, of course, are a matter of taste. But the point is that it's not just about pretty pictures and amusing Dickensian text. It's about the mindset of treating content dynamically, and of recognising the bounds and the parameters of each placeholder area on the page. And what better way to enforce that mindset, than by seeing a different random set of content every time you restart the app?

I decided to make this project a good opportunity for getting my hands dirty with thorough unit / functional testing. As such, Flask Editable Site is my first open-source effort that features automated testing via Travis CI, as well as test coverage reporting via Coveralls. As you can see on the GitHub page, tests are passing and coverage is pretty good. The tests are written in pytest, with significant help from webtest, too. I hope that the tests also serve as a template for other projects; all too often, with small brochureware sites, formal testing is done sparingly if at all.

Regarding the "no admin area" principle, Flask Editable Site has taken quite a purist approach to this. Personally, I think that radically reducing the role of "admin areas" in web site administration will lead to better UX. Anything that's publicly visible on the site, should be editable first and foremost via direct manipulation. However, in reality there will always be things that aren't publicly visible, and that admins still need to edit. For example, sites will always need user / role CRUD pages (unless you're happy to only manage users via shell commands). So, if you do add admin pages to a project based on Flask Editable Site, please don't feel as though you're breaking some golden rule.

Hope you enjoy playing around with the app. Who knows, maybe you'll even build something useful based on it. Feedback, bug reports, pull requests, all welcome.

Cookies can't be more than 4KiB in size

2015-10-15T00:00:00Z

Did you know: you can't reliably store more than 4KiB (4096 bytes) of data in a single browser cookie? I didn't until this week.

What, I can't have my giant cookie and eat it too? Outrageous!
Image source: Giant Chocolate chip cookie recipe.

I'd never before stopped to think about whether or not there was a limit to how much you can put in a cookie. Usually, cookies only store very small string values, such as a session ID, a tracking code, or a browsing preference (e.g. "tile" or "list" for search results). So, usually, there's no need to consider its size limits.

However, while working on a new side project of mine that heavily uses session storage, I discovered this limit the hard (to debug) way. Anyway, now I've got one more adage to add to my developer's phrasebook: if you're trying to store more than 4KiB in a cookie, you're doing it wrong.

Actually, according to the web site Browser Cookie Limits, the safe "lowest common denominator" maximum size to stay below is 4093 bytes. Also check out the Stack Overflow discussion, What is the maximum size of a web browser's cookie's key?, for more commentary regarding the limit.

In my case – working with Flask, which depends on Werkzeug – trying to store an oversized cookie doesn't throw any errors, it simply fails silently. I've submitted a patch to Werkzeug, to make oversized cookies raise an exception, so hopefully it will be more obvious in future when this problem occurs.

It appears that this is not an isolated issue; many web frameworks and libraries fail silently with storage of too-big cookies. It's the case with Django, where the decision was made to not fix it, for technical reasons. Same story with CodeIgniter. Seems that Ruby on Rails is well-behaved and raises exceptions. Basically, your mileage may vary: don't count on your framework of choice alerting you, if you're being a cookie monster.

Also, as several others have pointed out, trying to store too much data in cookies is a bad idea anyway, because that data travels with every HTTP request and response, so it should be as small as possible. As I learned, if you find that you're dealing with non-trivial amounts of session data, then ditch client-side storage for the app in question, and switch to server-side session data storage (preferably using something like Memcached or Redis).

Storing Flask uploaded images and files on Amazon S3

2015-04-20T00:00:00Z

Flask is still a relative newcomer in the world of Python frameworks (it recently celebrated its fifth birthday); and because of this, it's still sometimes trailing behind its rivals in terms of plugins to scratch a given itch. I recently discovered that this was the case, with storing and retrieving user-uploaded files on Amazon S3.

For static files (i.e. an app's seldom-changing CSS, JS, and images), Flask-Assets and Flask-S3 work together like a charm. For more dynamic files, there exist numerous snippets of solutions, but I couldn't find anything to fill in all the gaps and tie it together nicely.

Due to a pressing itch in one of my projects, I decided to rectify this situation somewhat. Over the past few weeks, I've whipped up a bunch of Python / Flask tidbits, to handle the features that I needed:

I've also published an example app, that demonstrates how all these tools can be used together. Feel free to dive straight into the example code on GitHub; or read on for a step-by-step guide of how this Flask S3 tool suite works.

Using s3-saver

The key feature across most of this tool suite, is being able to use the same code for working with local and with S3-based files. Just change a single config option, or a single function argument, to switch from one to the other. This is critical to the way I need to work with files in my Flask projects: on my development environment, everything should be on the local filesystem; but on other environments (especially production), everything should be on S3. Others may have the same business requirements (in which case you're in luck). This is most evident with s3-saver.

Here's a sample of the typical code you might use, when working with s3-saver:

from io import BytesIO
from os import path

from flask import current_app as app
from flask import Blueprint
from flask import flash
from flask import redirect
from flask import render_template
from flask import url_for
from s3_saver import S3Saver

from project import db
from library.prefix_file_utcnow import prefix_file_utcnow
from foo.forms import ThingySaveForm
from foo.models import Thingy


mod = Blueprint('foo', __name__)


@mod.route('/', methods=['GET', 'POST'])
def home():
    """Displays the Flask S3 Save Example home page."""

    model = Thingy.query.first() or Thingy()

    form = ThingySaveForm(obj=model)

    if form.validate_on_submit():
        image_orig = model.image
        image_storage_type_orig = model.image_storage_type
        image_bucket_name_orig = model.image_storage_bucket_name

        # Initialise s3-saver.
        image_saver = S3Saver(
            storage_type=app.config['USE_S3'] and 's3' or None,
            bucket_name=app.config['S3_BUCKET_NAME'],
            access_key_id=app.config['AWS_ACCESS_KEY_ID'],
            access_key_secret=app.config['AWS_SECRET_ACCESS_KEY'],
            field_name='image',
            storage_type_field='image_storage_type',
            bucket_name_field='image_storage_bucket_name',
            base_path=app.config['UPLOADS_FOLDER'],
            static_root_parent=path.abspath(
                path.join(app.config['PROJECT_ROOT'], '..')))

        form.populate_obj(model)

        if form.image.data:
            filename = prefix_file_utcnow(model, form.image.data)

            filepath = path.abspath(
                path.join(
                    path.join(
                        app.config['UPLOADS_FOLDER'],
                        app.config['THINGY_IMAGE_RELATIVE_PATH']),
                    filename))

            # Best to pass in a BytesIO to S3Saver, containing the
            # contents of the file to save. A file from any source
            # (e.g. in a Flask form submission, a
            # werkzeug.datastructures.FileStorage object; or if
            # reading in a local file in a shell script, perhaps a
            # Python file object) can be easily converted to BytesIO.
            # This way, S3Saver isn't coupled to a Werkzeug POST
            # request or to anything else. It just wants the file.
            temp_file = BytesIO()
            form.image.data.save(temp_file)

            # Save the file. Depending on how S3Saver was initialised,
            # could get saved to local filesystem or to S3.
            image_saver.save(
                temp_file,
                app.config['THINGY_IMAGE_RELATIVE_PATH'] + filename,
                model)

            # If updating an existing image,
            # delete old original and thumbnails.
            if image_orig:
                if image_orig != model.image:
                    filepath = path.join(
                        app.config['UPLOADS_FOLDER'],
                        image_orig)

                    image_saver.delete(filepath,
                        storage_type=image_storage_type_orig,
                        bucket_name=image_bucket_name_orig)

                glob_filepath_split = path.splitext(path.join(
                    app.config['MEDIA_THUMBNAIL_FOLDER'],
                    image_orig))
                glob_filepath = glob_filepath_split[0]
                glob_matches = image_saver.find_by_path(
                    glob_filepath,
                    storage_type=image_storage_type_orig,
                    bucket_name=image_bucket_name_orig)

                for filepath in glob_matches:
                    image_saver.delete(
                        filepath,
                        storage_type=image_storage_type_orig,
                        bucket_name=image_bucket_name_orig)
        else:
            model.image = image_orig

        # Handle image deletion
        if form.image_delete.data and image_orig:
            filepath = path.join(
                app.config['UPLOADS_FOLDER'], image_orig)

            # Delete the file. In this case, we have to pass in
            # arguments specifying whether to delete locally or on
            # S3, as this should depend on where the file was
            # originally saved, rather than on how S3Saver was
            # initialised.
            image_saver.delete(filepath,
                storage_type=image_storage_type_orig,
                bucket_name=image_bucket_name_orig)

            # Also delete thumbnails
            glob_filepath_split = path.splitext(path.join(
                app.config['MEDIA_THUMBNAIL_FOLDER'],
                image_orig))
            glob_filepath = glob_filepath_split[0]

            # S3Saver can search for files too. When searching locally,
            # it uses glob(); when searching on S3, it uses key
            # prefixes.
            glob_matches = image_saver.find_by_path(
                glob_filepath,
                storage_type=image_storage_type_orig,
                bucket_name=image_bucket_name_orig)

            for filepath in glob_matches:
                image_saver.delete(filepath,
                                   storage_type=image_storage_type_orig,
                                   bucket_name=image_bucket_name_orig)

            model.image = ''
            model.image_storage_type = ''
            model.image_storage_bucket_name = ''

        if form.image.data or form.image_delete.data:
            db.session.add(model)
            db.session.commit()
            flash('Thingy %s' % (
                      form.image_delete.data and 'deleted' or 'saved'),
                  'success')
        else:
            flash(
                'Please upload a new thingy or delete the ' +
                    'existing thingy',
                'warning')

        return redirect(url_for('foo.home'))

    return render_template('home.html',
                           form=form,
                           model=model)

(From: https://github.com/Jaza/flask-s3-save-example/blob/master/project/foo/views.py).

As is hopefully evident in the sample code above, the idea with s3-saver is that as little S3-specific code as possible is needed, when performing operations on a file. Just find, save, and delete files as usual, per the user's input, without worrying about the details of that file's storage back-end.

s3-saver uses the excellent Python boto library, as well as Python's built-in file handling functions, so that you don't have to. As you can see in the sample code, you don't need to directly import either boto, or the file-handling functions such as glob or os.remove. All you need to import is io.BytesIO, and os.path, in order to be able to pass s3-saver the parameters that it needs.

Using url-for-s3

This is a simple utility function, that generates a URL to a given S3-based file. It's designed to match flask.url_for as closely as possible, so that one can be swapped out for the other with minimal fuss.

from __future__ import print_function

from flask import url_for
from url_for_s3 import url_for_s3

from project import db


class Thingy(db.Model):
    """Sample model for flask-s3-save-example."""

    id = db.Column(db.Integer(), primary_key=True)
    image = db.Column(db.String(255), default='')
    image_storage_type = db.Column(db.String(255), default='')
    image_storage_bucket_name = db.Column(db.String(255), default='')

    def __repr__(self):
        return 'A thingy'

    @property
    def image_url(self):
        from flask import current_app as app
        return (self.image
            and '%s%s' % (
                app.config['UPLOADS_RELATIVE_PATH'],
                self.image)
            or None)

    @property
    def image_url_storageaware(self):
        if not self.image:
            return None

        if not (
                self.image_storage_type
                and self.image_storage_bucket_name):
            return url_for(
                'static',
                filename=self.image_url,
                _external=True)

        if self.image_storage_type != 's3':
            raise ValueError((
                'Storage type "%s" is invalid, the only supported ' +
                'storage type (apart from default local storage) ' +
                'is s3.') % self.image_storage_type)

        return url_for_s3(
            'static',
            bucket_name=self.image_storage_bucket_name,
            filename=self.image_url)

(From: https://github.com/Jaza/flask-s3-save-example/blob/master/project/foo/models.py).

The above sample code illustrates how I typically use url_for_s3. For a given instance of a model, if that model's file is stored locally, then generate its URL using flask.url_for; otherwise, switch to url_for_s3. Only one extra parameter is needed: the S3 bucket name.

  {% if model.image %}
  View original
  {% endif %}

(From: https://github.com/Jaza/flask-s3-save-example/blob/master/templates/home.html).

I can then easily show the "storage-aware URL" for this model in my front-end templates.

Using flask-thumbnails-s3

In my use case, the majority of the files being uploaded are images, and most of those images need to be resized when displayed in the front-end. Also, ideally, the dimensions for resizing shouldn't have to be pre-specified (i.e. thumbnails shouldn't only be able to get generated when the original image is first uploaded); new thumbnails of any size should get generated on-demand per the templates' needs. The front-end may change according to the design / branding whims of clients and other stakeholders, further on down the road.

flask-thumbnails handles just this workflow for local files; so, I decided to fork it and to create flask-thumbnails-s3, which works the same as flask-thumbnails when set to use local files, but which can also store and retrieve thumbnails on a S3 bucket.

    {% if image %}
    
    
    
    {% endif %}

(From: https://github.com/Jaza/flask-s3-save-example/blob/master/templates/macros/imagethumb.html).

Like its parent project, flask-thumbnails-s3 is most commonly invoked by way of a template filter. If a thumbnail of the given original file exists, with the specified size and attributes, then it's returned straightaway; if not, then the original file is retrieved, a thumbnail is generated, and the thumbnail is saved to the specified storage back-end.

At the moment, flask-thumbnails-s3 blocks the running thread while it generates a thumbnail and saves it to S3. Ideally, this task would get sent to a queue, and a "dummy" thumbnail would be returned in the immediate request, until the "real" thumbnail is ready in a later request. The Sorlery plugin for Django uses the queued approach. It would be cool if flask-thumbnails-s3 (optionally) did the same. Anyway, it works without this fanciness for now; extra contributions welcome!

(By the way, in my testing, this is much less of a problem if your Flask app is deployed on an Amazon EC2 box, particularly if it's in the same region as your S3 bucket; unsurprisingly, there appears to be much less latency between an EC2 server and S3, than there is between a non-Amazon server and S3).

Using flask-admin-s3-upload

The purpose of flask-admin-s3-upload is basically to provide the same 'save' functionality as s3-saver, but automatically within Flask-Admin. It does this by providing alternatives to the flask_admin.form.upload.FileUploadField and flask_admin.form.upload.ImageUploadField classes, namely flask_admin_s3_upload.S3FileUploadField and flask_admin_s3_upload.S3ImageUploadField.

(Anecdote: I actually wrote flask-admin-s3-upload before any of the other tools in this suite, because I began by working with a part of my project that has no custom front-end, only a Flask-Admin based management console).

Using the utilities provided by flask-admin-s3-upload is fairly simple:

from os import path

from flask_admin_s3_upload import S3ImageUploadField

from project import admin, app, db
from foo.models import Thingy
from library.admin_utils import ProtectedModelView
from library.prefix_file_utcnow import prefix_file_utcnow


class ThingyView(ProtectedModelView):
    column_list = ('image',)
    form_excluded_columns = ('image_storage_type',
                             'image_storage_bucket_name')

    form_overrides = dict(
        image=S3ImageUploadField)

    form_args = dict(
        image=dict(
            base_path=app.config['UPLOADS_FOLDER'],
            relative_path=app.config['THINGY_IMAGE_RELATIVE_PATH'],
            url_relative_path=app.config['UPLOADS_RELATIVE_PATH'],
            namegen=prefix_file_utcnow,
            storage_type_field='image_storage_type',
            bucket_name_field='image_storage_bucket_name',
        ))

    def scaffold_form(self):
        form_class = super(ThingyView, self).scaffold_form()
        static_root_parent = path.abspath(
            path.join(app.config['PROJECT_ROOT'], '..'))

        if app.config['USE_S3']:
            form_class.image.kwargs['storage_type'] = 's3'

        form_class.image.kwargs['bucket_name'] = \
            app.config['S3_BUCKET_NAME']
        form_class.image.kwargs['access_key_id'] = \
            app.config['AWS_ACCESS_KEY_ID']
        form_class.image.kwargs['access_key_secret'] = \
            app.config['AWS_SECRET_ACCESS_KEY']
        form_class.image.kwargs['static_root_parent'] = \
            static_root_parent

        return form_class


admin.add_view(ThingyView(Thingy, db.session, name='Thingies'))

(From: https://github.com/Jaza/flask-s3-save-example/blob/master/project/foo/admin.py).

Note that flask-admin-s3-upload only handles saving, not deleting (the same as the regular Flask-Admin file / image upload fields only handle saving). If you wanted to handle deleting files in the admin as well, you could (for example) use s3-saver, and hook it in to one of the Flask-Admin event callbacks.

In summary

I'd also like to mention: one thing that others have implemented in Flask, is direct JavaScript-based upload to S3. Implementing this sort of functionality in my tool suite would be a great next step; however, it would have to play nice with everything else I've built (particularly with flask-thumbnails-s3), and it would have to work for local- and for S3-based files, the same as all the other tools do. I don't have time to address those hurdles right now – another area where contributions are welcome.

I hope that this article serves as a comprehensive guide, of how to use the Flask S3 tools that I've recently built and contributed to the community. Any questions or concerns, please drop me a line.

Conditionally adding HTTP response headers in Flask and Apache

2014-12-29T00:00:00Z

For a Flask-based project that I'm currently working on, I just added some front-end functionality that depends on Font Awesome. Getting Font Awesome to load properly (in well-behaved modern browsers) shouldn't be much of a chore. However, my app spans multiple subdomains (achieved with the help of Flask's Blueprints per-subdomain feature), and my static assets (CSS, JS, etc) are only served from one of those subdomains. And as it turns out (and unlike cross-domain CSS / JS / image requests), cross-domain font requests are forbidden unless the font files are served with an appropriate Access-Control-Allow-Origin HTTP response header. For example, this is the error message that's shown in Google Chrome for such a request:

Font from origin 'http://foo.local' has been blocked from loading by Cross-Origin Resource Sharing policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://bar.foo.local' is therefore not allowed access.

As a result of this, I had to quickly learn how to conditionally add custom HTTP response headers based on the URL being requested, both for Flask (when running locally with Flask's built-in development server), and for Apache (when running in staging and production). In a typical production Flask setup, it's impossible to do anything at the Python level when serving static files, because these are served directly by the web server (e.g. Apache, Nginx), without ever hitting WSGI. Conversely, in a typical development setup, there is no web server running separately to the WSGI app, and so playing around with static files must be done at the Python level.

The code

For a regular Flask request that's handled by one of the app's custom routes, adding another header to the HTTP response would be a simple matter of modifying the flask.Response object before returning it. However, static files (in a development setup) are served by Flask's built-in app.send_static_file() function, not by any route that you have control over. So, instead, it's necessary to intercept the response object via Flask's API.

Fortunately, this interception is easily accomplished, courtesy of Flask's app.after_request() function, which can either be passed a callback function, or used as a decorator. Here's what did the trick for me:

import re

from flask import Flask
from flask import request


app = Flask(__name__)

def add_headers_to_fontawesome_static_files(response):
    """
    Fix for font-awesome files: after Flask static send_file() does its
    thing, but before the response is sent, add an
    Access-Control-Allow-Origin: *
    HTTP header to the response (otherwise browsers complain).
    """

    if (request.path and
        re.search(r'\.(ttf|woff|svg|eot)$', request.path)):
        response.headers.add('Access-Control-Allow-Origin', '*')

    return response

if app.debug:
    app.after_request(add_headers_to_fontawesome_static_files)

For a production setup, the above Python code achieves nothing, and it's therefore necessary to add something like this to the config file for the app's VirtualHost:


  # ...

  Alias /static /path/to/myapp/static
  
    Order deny,allow
    Allow from all
    Satisfy Any

    SetEnvIf Request_URI "\.(ttf|woff|svg|eot)$" is_font_file
    Header set Access-Control-Allow-Origin "*" env=is_font_file

Done

And there you go: an easy way to add custom HTTP headers to any response, in two different web server environments, based on a conditional request path. So far, cleanly serving cross-domain font files is all that I've neede this for. But it's a very handy little snippet, and no doubt there are plenty of other scenarios in which it could save the day.

Thread progress monitoring in Python

2011-06-30T00:00:00Z

My recent hobby hack-together, my photo cleanup tool FotoJazz, required me getting my hands dirty with threads for the first time (in Python or otherwise). Threads allow you to run a task in the background, and to continue doing whatever else you want your program to do, while you wait for the (usually long-running) task to finish (that's one definition / use of threads, anyway — a much less complex one than usual, I dare say).

However, if your program hasn't got much else to do in the meantime (as was the case for me), threads are still very useful, because they allow you to report on the progress of a long-running task at the UI level, which is better than your task simply blocking execution, leaving the UI hanging, and providing no feedback.

As part of coding up FotoJazz, I developed a re-usable architecture for running batch processing tasks in a thread, and for reporting on the thread's progress in both a web-based (AJAX-based) UI, and in a shell UI. This article is a tour of what I've developed, in the hope that it helps others with their thread progress monitoring needs in Python or in other languages.

The FotoJazzProcess base class

The foundation of the system is a Python class called FotoJazzProcess, which is in the project/fotojazz/fotojazzprocess.py file in the source code. This is a base class, designed to be sub-classed for actual implementations of batch tasks; although the base class itself also contains a "dummy" batch task, which can be run and monitored for testing / example purposes. All the dummy task does, is sleep for 100ms, for each file in the directory path provided:

#!/usr/bin/env python

# ...

from threading import Thread
from time import sleep

class FotoJazzProcess(Thread):
    """Parent / example class for running threaded FotoJazz processes.
    You should use this as a base class if you want to process a
    directory full of files, in batch, within a thread, and you want to
    report on the progress of the thread."""

    # ...

    filenames = []
    total_file_count = 0

    # This number is updated continuously as the thread runs.
    # Check the value of this number to determine the current progress
    # of FotoJazzProcess (if it equals 0, progress is 0%; if it equals
    # total_file_count, progress is 100%).
    files_processed_count = 0

    def __init__(self, *args, **kwargs):
        """When initialising this class, you can pass in either a list
        of filenames (first param), or a string of space-delimited
        filenames (second param). No need to pass in both."""
        Thread.__init__(self)

        # ...

    def run(self):
        """Iterates through the files in the specified directory. This
        example implementation just sleeps on each file - in subclass
        implementations, you should do some real processing on each
        file (e.g. re-orient the image, change date modified). You
        should also generally call self.prepare_filenames() at the
        start, and increment self.files_processed_count, in subclass
        implementations."""
        self.prepare_filenames()

        for filename in self.filenames:
            sleep(0.1)
            self.files_processed_count += 1

You could monitor the thread's progress, simply by checking obj.files_processed_count from your calling code. However, the base class also provides some convenience methods, for getting the progress value in a more refined form — i.e. as a percentage value, or as a formatted string:

    # ...

    def percent_done(self):
        """Gets the current percent done for the thread."""
        return float(self.files_processed_count) / \
                     float(self.total_file_count) \
               * 100.0

    def get_progress(self):
        """Can be called at any time before, during or after thread
        execution, to get current progress."""
        return '%d files (%.2f%%)' % (self.files_processed_count,
                                      self.percent_done())

The FotoJazzProcessShellRun command-line progress class

FotoJazzProcessShellRun contains all the code needed to report on a thread's progress via the command-line. All you have to do is instantiate it, and pass it a class (as an object) that inherits from FotoJazzProcess (or, if no class is provided, it uses the FotoJazzProcess base class). Then, execute the instantiated object — it takes care of the rest for you:

class FotoJazzProcessShellRun(object):
    """Runs an instance of the thread with shell output / feedback."""

    def __init__(self, init_class=FotoJazzProcess):
        self.init_class = init_class

    def __call__(self, *args, **kwargs):
        # ...

        fjp = self.init_class(*args, **kwargs)

        print '%s threaded process beginning.' % fjp.__class__.__name__
        print '%d files will be processed. ' % fjp.total_file_count + \
              'Now beginning progress output.'
        print fjp.get_progress()

        fjp.start()

        while fjp.is_alive() and \
              fjp.files_processed_count < fjp.total_file_count:
            sleep(1)
            if fjp.files_processed_count < fjp.total_file_count:
                print fjp.get_progress()

        print fjp.get_progress()
        print '%s threaded process complete. Now exiting.' \
              % fjp.__class__.__name__


if __name__ == '__main__':
    FotoJazzProcessShellRun()()

At this point, we're able to see the progress feedback in action already, through the command-line interface. This is just running the dummy batch task, but the feedback looks the same regardless of what process is running:

Thread progress output on the command-line.

The way this command-line progress system is implemented, it provides feedback once per second (timing handled with a simple sleep() call), and outputs feedback in terms of both number of files and percentage done. These details, of course, merely form an example for the purposes of this article — when implementing your own command-line progress feedback, you would change these details per your own tastes and needs.

The web front-end: HTML

Cool, we've now got a framework for running batch tasks within a thread, and for monitoring the progress of the thread; and we've built a simple interface for printing the thread's progress via command-line execution.

That was the easy part! Now, let's build an AJAX-powered web front-end on top of all that.

To start off, let's look at the basic HTML we'd need, for allowing the user to initiate a batch task (e.g. by pushing a submit button), and to see the latest progress of that task (e.g. with a JavaScript progress bar widget):

  
    Run dummy task

Close your eyes for a second, and pretend we've also just coded up some gorgeous, orgasmic CSS styling for this markup (and don't worry about the class / id names for now, either — they're needed for the JavaScript, which we'll get to shortly). Now, open your eyes, and behold! A glorious little web-based dialog for our dummy task:

Web-based dialog for initiating and monitoring our dummy task.

The web front-end: JavaScript

That's a lovely little interface we've just built. Now, let's begin to actually make it do something. Let's write some JavaScript that hooks into our new submit button and progress indicator (with the help of jQuery, and the jQuery UI progress bar — this code can be found in the static/js/fotojazz.js file in the source code):

fotojazz.operations = function() {
    function process_start(process_css_name,
                           process_class_name,
                           extra_args) {

        // ...

        $('#operation-' + process_css_name).click(function() {
            // ...

            $.getJSON(SCRIPT_ROOT + '/process/start/' +
                      process_class_name + '/',
            args,
            function(data) {
                $('#operation-' + process_css_name).attr('disabled',
                                                         'disabled');
                $('#operation-' + process_css_name + '-progress')
                .progressbar('option', 'disabled', false);
                $('#operation-' + process_css_name + '-progress')
                .progressbar('option', 'value', data.percent);
                setTimeout(function() {
                    process_progress(process_css_name,
                                     process_class_name,
                                     data.key);
                }, 100);
            });
            return false;
        });
    }

    function process_progress(process_css_name,
                              process_class_name,
                              key) {
        $.getJSON(SCRIPT_ROOT + '/process/progress/' +
                  process_class_name + '/',
        {
            'key': key
        }, function(data) {
            $('#operation-' + process_css_name + '-progress')
            .progressbar('option', 'value', data.percent);
            if (!data.done) {
                setTimeout(function() {
                    process_progress(process_css_name,
                                     process_class_name,
                                     data.key);
                }, 100);
            }
            else {
                $('#operation-' + process_css_name)
                .removeAttr('disabled');
                $('#operation-' + process_css_name + '-progress')
                .progressbar('option', 'value', 0);
                $('#operation-' + process_css_name + '-progress')
                .progressbar('option', 'disabled', true);

                // ...
            }
        });
    }

    // ...

    return {
        init: function() {
            $('.operation-progress').progressbar({'disabled': true});

            // ...

            process_start('dummy', 'FotoJazzProcess');

            // ...
        }
    }
}();


$(function() {
    fotojazz.operations.init();
});

This code is best read by starting at the bottom. First off, we call fotojazz.operations.init(). If you look up just a few lines, you'll see that function defined (it's the init: function() one). In the init() function, the first thing we do is initialise a (disabled) jQuery progress bar widget, on our div with class operation-progress. Then, we call process_start(), passing in a process_css_name of 'dummy', and a process_class_name of 'FotoJazzProcess'.

The process_start() function binds all of its code to the click() event of our submit button. So, when we click the button, an AJAX request is sent to the path /process/start/ process_class_name/ on the server side. We haven't yet implemented this server-side callback, but for now let's assume that (as its pathname suggests), this callback starts a new process thread, and returns some info about the new thread (e.g. a reference ID, a progress indication, etc). The AJAX 'success' callback for this request then waits 100ms (with the help of setTimeout()), before calling process_progress(), passing it the CSS name and the class name that process_start() originally received, plus data.key, which is the unique ID of the new thread on the server.

The main job of process_progress(), is to make AJAX calls to the server that request the latest progress of the thread (again, let's imagine that the callback for this is done on the server side). When it receives the latest progress data, it then updates the jQuery progress bar widget's value, waits 100ms, and calls itself recursively. Via this recursion loop, it continues to update the progress bar widget, until the process is 100% complete, at which point the JavaScript terminates, and our job is done.

This code is extremely generic and re-usable. There's only one line in all the code, that's actually specific to the batch task that we're running: the process_start('dummy', 'FotoJazzProcess'); call. To implement another task on the front-end, all we'd have to do is copy and paste this one-line function call, changing the two parameter values that get passed to it (along with also copy-pasting the HTML markup to match). Or, if things started to get unwieldy, we could even put the function call inside a loop, and iterate through an array of parameter values.

The web front-end: Python

Now, let's take a look at the Python code to implement our server-side callback paths (which, in this case, are built as views in the Flask framework, and can be found in the project/fotojazz/views.py file in the source code):

from uuid import uuid4

from flask import jsonify
from flask import Module
from flask import request

from project import fotojazz_processes

# ...

mod = Module(__name__, 'fotojazz')

# ...

@mod.route('/process/start//')
def process_start(process_class_name):
    """Starts the specified threaded process. This is a sort-of
    'generic' view, all the different FotoJazz tasks share it."""

    # ...

    process_module_name = process_class_name
    if process_class_name != 'FotoJazzProcess':
        process_module_name = process_module_name.replace('Process', '')
    process_module_name = process_module_name.lower()

    # Dynamically import the class / module for the particular process
    # being started. This saves needing to import all possible
    # modules / classes.
    process_module_obj = __import__('%s.%s.%s' % ('project',
                                                  'fotojazz',
                                                  process_module_name),
                                    fromlist=[process_class_name])
    process_class_obj = getattr(process_module_obj, process_class_name)

    # ...

    # Initialise the process thread object.
    fjp = process_class_obj(*args, **kwargs)
    fjp.start()

    if not process_class_name in fotojazz_processes:
        fotojazz_processes[process_class_name] = {}
    key = str(uuid4())

    # Store the process thread object in a global dict variable, so it
    # continues to run and can have its progress queried, independent
    # of the current session or the current request.
    fotojazz_processes[process_class_name][key] = fjp

    percent_done = round(fjp.percent_done(), 1)
    done=False

    return jsonify(key=key, percent=percent_done, done=done)

@mod.route('/process/progress//')
def process_progress(process_class_name):
    """Reports on the progress of the specified threaded process.
    This is a sort-of 'generic' view, all the different FotoJazz tasks
    share it."""

    key = request.args.get('key', '', type=str)

    if not process_class_name in fotojazz_processes:
        fotojazz_processes[process_class_name] = {}

    if not key in fotojazz_processes[process_class_name]:
        return jsonify(error='Invalid process key.')

    # Retrieve progress of requested process thread, from global
    # dict variable where the thread reference is stored.
    percent_done = fotojazz_processes[process_class_name][key] \
                   .percent_done()

    done = False
    if not fotojazz_processes[process_class_name][key].is_alive() or \
       percent_done == 100.0:
        del fotojazz_processes[process_class_name][key]
        done = True
    percent_done = round(percent_done, 1)

    return jsonify(key=key, percent=percent_done, done=done)

As with the JavaScript, these Python functions are completely generic and re-usable. The process_start() function dynamically imports and instantiates the process class object needed for this particular task, based on the parameter sent to it in the URL path. It then kicks off the thread, and stores the thread in fotojazz_processes, which is a global dictionary variable. A unique ID is generated as the key for this dictionary, and that ID is then sent back to the javascript, via the JSON response object.

The process_progress() function retrieves the running thread by its unique key, and finds the progress of the thread as a percentage value. It also checks if the thread is now finished, as this is valuable information back on the JavaScript end (we don't want that recursive AJAX polling to continue forever!). It also returns its data to the front-end, via a JSON response object.

With code now in place at all necessary levels, our AJAX interface to the dummy batch task should now be working smoothly:

The dummy task is off and running, and the progress bar is working.

Absolutely no extra Python view code is needed, in order to implement new batch tasks. As long as the correct new thread class (inheriting from FotoJazzProcess) exists and can be found, everything Just Works™. Not bad, eh?

Final thoughts

Progress feedback on threads is a fairly common development pattern in more traditional desktop GUI apps. There's a lot of info out there on threads and progress bars in Python's version of the Qt GUI library, for example. However, I had trouble finding much info about implementing threads and progress bars in a web-based app. Hopefully, this article will help those of you looking for info on the topic.

The example code I've used here is taken directly from my FotoJazz app, and is still loosely coupled to it. As such, it's example code, not a ready-to-go framework or library for Python threads with web-based progress indication. However, it wouldn't take that much more work to get the code to that level. Consider it your homework!

Also, an important note: the code demonstrated in this article — and the FotoJazz app in general — is not suitable for a real-life online web app (in its current state), as it has not been developed with security, performance, or scalability in mind at all. In particular, I'm pretty sure that the AJAX in its current state is vulnerable to all sorts of CSRF attacks; not to mention the fact that all sorts of errors and exceptions are liable to occur, most of them currently uncaught. I'm also a total newbie to threads, and I understand that threads in web apps are particularly prone to cause strange explosions. You must remember: FotoJazz is a web-based desktop app, not an actual web app; and web-based desktop app code is not necessarily web-app-ready code.

Finally, what I've demonstrated here is not particularly specific to the technologies I've chosen to use. Instead of jQuery, any number of other JavaScript libraries could be used (e.g. YUI, Prototype). And instead of Python, the whole back-end could be implemented in any other server-side language (e.g. PHP, Java), or in another Python framework (e.g. Django, web.py). I'd be interested to hear if anyone else has done (or plans to do) similar work, but with a different technology stack.

FotoJazz: A tool for cleaning up your photo collection

2011-06-29T00:00:00Z

I am at times quite a prolific photographer. Particularly when I'm travelling, I tend to accumulate quite a quantity of digital snaps (although am still working on the quality of said snaps). I'm also a reasonably organised and systematic person: as such, I've developed a workflow for fixing up, naming and archiving my soft-copy photos; and I've also come to depend on a variety of scripts and little apps, that perform various steps of the workflow for me.

Sadly, my system ~~has~~ had some disadvantages. Most importantly, there are too many separate scripts / apps involved, and with too many different interfaces (mix of manual point-and-click, drap-and-drop, and command-line). Ideally, I'd like all the functionality unified in one app, with one streamlined graphical interface (and also everything with equivalent shell access). Also, my various tools are platform-dependent, with most of them being Windows-based, and one being *nix-based. I'd like everything to be platform-independent, and in particular, I'd like everything to run best on Linux — as I'm trying to do as much as possible on Ubuntu these days.

Plus, I felt in the mood for getting my hands dirty coding up the photo-management app of my dreams. Hence, it is with pleasure that I present FotoJazz, a browser-based (plus shell-accessible) tool built with Python and Flask.

The FotoJazz web browser interface in action.

FotoJazz is a simple app, that performs a few common tasks involved in cleaning up photos copied off a digital camera. It does the following:

Orientation

FotoJazz rotates an image to its correct orientation, per its Exif metadata. This is done via the exiftran utility. Some people don't bother to rotate their photos, as many modern apps pay attention to the Exif orientation metadata anyway, when displaying a photo. However, not all apps do (in particular, the Windows XP / Vista / 7 default photo viewer does not). I like to be on the safe side, and to rotate the actual image myself.

I was previously doing this manually, using the 'rotate left / right' buttons in the Windows photo viewer. Hardly ideal. Discovering exiftran was a very pleasant surprise for me — I thought I'd at least have to code an auto-orientation script myself, but turns out all I had to do was build on the shoulders of giants. After doing this task manually for so long, I can't say I 100% trust the Exif orientation tags in my digital photos. But that's OK — while I wait for my trust to develop, FotoJazz lets me review Exiftran's handiwork as part of the process.

Date / time shifting

FotoJazz shifts the Exif 'date taken' value of an image backwards or forwards by a specified time interval. This is handy in two situations that I find myself facing quite often. First, the clock on my camera has been set wrongly, usually if I recently travelled to a new time zone and forgot to adjust it (or if daylight savings has recently begun or ended). And secondly, if I copy photos from a friend's camera (to add to my own photo collection), and the clock on my friend's camera has been set wrongly (this is particularly bad, because I'll usually then be wanting to merge my friend's photos with my own, and to sort the combined set of photos by date / time). In both cases, the result is a batch of photos whose 'date taken' values are off by a particular time interval.

FotoJazz lets you specify a time interval in the format:

[-][Xhr][Xm][Xs]

For example, to shift dates forward by 3 hours and 30 seconds, enter:

3hr30s

Or to shift dates back by 23 minutes, enter:

-23m

I was previously doing this using Exif Date Changer, a small freeware Windows app. Exif Date Changer works quite well, and it has a nice enough interface; but it is Windows-only. It also has a fairly robust batch rename feature, which unfortunately doesn't support my preferred renaming scheme (which I'll be discussing next).

Batch rename

FotoJazz renames a batch of images per a specified prefix, and with a unique integer ID. For example, say you specify this prefix:

new_york_trip_may2008

And say you have 11 photos in your set. The photos would then be renamed to:

new_york_trip_may2008_01.jpg
new_york_trip_may2008_02.jpg
new_york_trip_may2008_03.jpg
new_york_trip_may2008_04.jpg
new_york_trip_may2008_05.jpg
new_york_trip_may2008_06.jpg
new_york_trip_may2008_07.jpg
new_york_trip_may2008_08.jpg
new_york_trip_may2008_09.jpg
new_york_trip_may2008_10.jpg
new_york_trip_may2008_11.jpg

As you can see, the unique ID added to the filenames is padded with leading zeros, as needed per the batch. This is important for sorting the photos by filename in most systems / apps.

I was previously using mvb for this. Mvb ("batch mv") is a bash script that renames files according to the same scheme — i.e. you specify a prefix, and it renames the files with the prefix, plus a unique incremented ID padded with zeros. Unfortunately, mvb always worked extremely slowly for me (probably because I ran it through cygwin, hardly ideal).

Date modified

FotoJazz updates the 'date modified' metadata of an image to match its 'take taken' value. It will also fix the date accessed, and the Exif 'PhotoDate' value (which might be different to the Exif 'PhotoDateOriginal' value, which is the authoritative 'date taken' field). This is very important for the many systems / apps that sort photos by their 'date modified' file metadata, rather than by their 'date taken' Exif metadata.

I was previously using JpgDateChanger for this task. I had no problems with JpgDateChanger — it has a great drag-n-drop interface, and it's very fast. However, it is Windows-based, and it is one more app that I have to open as part of my workflow.

Shell interface

The FotoJazz command-line interface.

All of the functionality of FotoJazz can also be accessed via the command-line. This is great if you want to use one or more FotoJazz features as part of another script, or if you just don't like using GUIs. For example, to do some date shifting on the command line, just enter a command like this:

./project/fotojazz/shiftdate.py /path/to/photos/ 3hr30s

More information on shell usage is available in the README file.

Technical side

I've been getting into Python a lot lately, and FotoJazz was a good excuse to do some solid Python hacking, I don't deny it. I've also been working with Django a lot, but I haven't before used a Python microframework. FotoJazz was a good excuse to dive into one for the first time, and the microframework that I chose was Flask (and Flask ships with the Jinja template engine, something I was also overdue on playing with).

From my point of view, FotoJazz's coolest code feature is its handling of the batch photo tasks as threads. This is mainly encapsulated in the FotoJazzProcess Python class in the code. The architecture allows the tasks to run asynchronously, and for either the command-line or the browser-based (slash AJAX-based) interface to easily provide feedback on the progress of the thread. I'll be discussing this in more detail, in a separate article — stay tuned.

Update (30 Jun 2011): said separate article on thread progress monitoring in Python is now published.

FotoJazz makes heavy use of pyexiv2 for its reading / writing of Jpg Exif metadata within a Python environment. Also, as mentioned earlier, it uses exiftran for the photo auto-orientation task; exiftran is called directly on the command-line, and its stream output is captured, monitored, and transformed into progress feedback on the Python end.

Get the code

All the code is availble on GitHub. Use it as you will: hack, fork, play.