PDF processing API using Flask + Heroku + Gunicorn

Hello! When I was developing a personal project, I found myself forced to process PDF files with python in the cloud. In order to do this, I have learned to use flask, gunicorn and Heroku. In this post I will show how I have done it.

Heroku is a platform-as-a-service (PaaS) that enables developers to build, run, and operate applications entirely in the cloud.

Gunicorn is a Python WSGI HTTP server for UNIX. It is widely compatible with various web frameworks, is easy to deploy, has low server resources, and is quite fast.

Flask is a minimalist framework written in Python that allows you to create web applications quickly and with a minimum number of lines of code.

Creating the server locally

First of all we will first create and test the server locally.

To do this, create your working folder, enter it and then create a python virtual environment:

python -m venv env

Then you must activate it, in Windows:

.\env\Scripts\activate

On Linux or MacOS:

source env/bin/activate

.gitignore

This file is important so as not to upload all external dependencies to heroku, as they will be downloaded automatically thanks to requirements.txt, so create a file called .gitignore, and inside it will contain only the word env/. You can do this easily with the echo env/ > .gitignore command for both UNIX and Windows.

server.py

Now it’s time to create our web application. In my case, I’ll call the file server.py. I’ll give the basic code to create an application that processes PDF files, but your application can be anything you want.

To start working with flask and gunicorn, install them with pip install flask flask_cors gunicorn and if you are going to follow this example to edit PDFs install PyPDF2 also pip install PyPDF2. To use it, look at the following code:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
# Import Flask and other functions
from flask import Flask, request, send_file, jsonify
from werkzeug.utils import secure_filename
import tempfile
imports
import gc
from io import BytesIO

# Import CORS to receive requests from any origin
from flask_cors import CORS

# import PyPDF2
from PyPDF2 import PdfReader, PdfWriter

# Create the Flask instance
app = Flask(__name__)

# enable CORS in the Flask app
CORS(app)

# The function that processes the PDF
# This example only removes links from the pdf
def process_pdf_file(pdf_path):
    try:
        output = pdf_path[:-4] + "_out.pdf"
        reader = PdfReader(pdf_path)
        writer = PDFWriter()

        for i in range(len(reader.pages)):
            page = reader.pages[i]
            writer.add_page(page)
            
        writer.remove_links()

        with open(output, "wb") as fp:
            writer.write(fp)

        gc.collect()
        return {
            "Success": True,
            "return_path": output,
            "Mistake": "",
        }
    except Exception as e:
        return {
            "Success": False,
            "return_path": "",
            "Error": str(e),
        }

# Define the endpoint of the api, the allowed methods and the function that is called
@app.route('/upload', methods=['POST'])
def upload_file():
    # We verify that a file has been uploaded
    if 'file' not in request.files:
        return {"Success": False, "Error": "No file part in the request."}, 400

    # We verify that the file has a name (it is not an empty field)
    file = request.files['file']
    if file.filename == '':
        return {"Success": False, "Error": "No file selected."}, 400

    # Check if the file is a PDF
    if not file.filename.lower().endswith('.pdf'):
        return {"Success": False, "Error": "Invalid file type. Only PDF files are allowed."}, 400
    
    # temporarily save the file with a random name
    with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_pdf:
        file.save(temp_pdf.name)

        # The processing function is called
        result = process_pdf_file(temp_pdf.name)

        temp_pdf.close()

        # delete the initial file
        os.remove(temp_pdf.name)

    if result["Success"]:
        output_path = result["return_path"]

        # We save the data of the modified file
        with open(output_path, 'rb') as f:
            file_data = BytesIO(f.read())

        # And delete it from memory
        os.remove(output_path)

        # Call the garbage collector
        # (I've had memory leak failures)
        gc.collect()

        # return the modified file
        return send_file(file_data, as_attachment=True, download_name=secure_filename(os.path.basename(output_path)))
    else:
        gc.collect()
        return jsonify({"Success": False, "Error": result["Error"]}), 500

# To run locally
if __name__ == '__main__':
    app.run()

To test the server locally, start it with python .\server.py. It will be available at http://localhost:5000

Procfile

To deploy the server on Heroku using Gunicorn we will need a file called Procfile, which in our case will contain the following:

web: gunicorn server:app where server is the name of the python file and app is the flask instance.

Requirements.txt

This is the last file we need. It is the one that gathers all the necessary dependencies for the project, heroku needs it to install them. To get this file, run pip freeze > requirements.txt

Heroku

In order to easily deploy to heroku, install the heroku CLI tool. Once installed and started the session (by doing heroku login), inside our working folder where we have requirements.txt, server.py and Procfile, we will execute heroku create and create the application.

Now it’s time to create a git repository with our code and finally deploy it.

  • git init
  • heroku git:remote -a <your app name, for example shrouded-shelf-55195 >
  • git add .
  • git commit -am "make it better"
  • git push heroku master

If everything went well, the last command will deploy your application to heroku! 😀