Tech

Facial Recognition on a TX1

Getting Started

Here's a way to hack facial recognition system together in relatively short time on NVIDIA's Jetson TX1

Installation and Setup

Assumptions: you have a TX1 with a fresh install of JetPack 2.3 L4T.

First things first. We need to remove all the fat from the install. There are tons of optimized libraries in JetPack, but so much of it takes up the valuable memory space we need to get the facial recognition app up and running.

                
    # get rid of liboffice, games, libvisionworks, perfkit, multimedia api, opencv4tegra, etc. 
    sudo apt-get purge libreoffice*
    sudo apt-get purge aisleriot gnome-sudoku mahjongg ace-of-penguins gnomine gbrainy
    sudo apt-get clean
    sudo apt-get autoremove

    rm -rf libvision*
    rm -rf PerfKit*

    # something along these lines; might be different for you
    # delete all libvision-works and opencv4tegra stuff
    cd var && rm -rf libopencv4tegra* && rm -rf libvision*

    # I deleted practically everything. Almost as if I shouldn't have even installed JetPack in the first place
    # delete all deb files, Firefox, chrome, all the stuff I really didn't need that was taking up memory. 
    # find big files and remove them assuming they're not important. Google is your friend.
    find / -size +10M -ls
                

Installing protobuf, bazel, and tensorflow

Thankfully, others have paved the way and made these steps pretty much a walk in the park. Thank you to StackOverflow user, Dwight Crowe for his stellar post on how to get Tensorflow R0.9 working on a TX1. I'm literally just going to post his exact methodology.

                
    # install deps
    cd ~
    sudo add-apt-repository ppa:webupd8team/java
    sudo apt-get update
    sudo apt-get install oracle-java8-installer
    sudo apt-get install git zip unzip autoconf automake libtool curl zlib1g-dev maven swig bzip2

    #build  build protobuf 3.0.0-beta-2 jar
    git clone https://github.com/google/protobuf.git
    cd protobuf
    # autogen.sh downloads broken gmock.zip in d5fb408d
    git checkout master
    ./autogen.sh
    git checkout d5fb408d
    ./configure --prefix=/usr
    make -j 4
    sudo make install
    cd java
    mvn package

    #Get bazel version 0.2.1, it doesn't require gRPC 
    git clone https://github.com/bazelbuild/bazel.git
    cd bazel
    git checkout 0.2.1
    cp /usr/bin/protoc third_party/protobuf/protoc-linux-arm32.exe
    cp ../protobuf/java/target/protobuf-java-3.0.0-beta-2.jar third_party/protobuf/protobuf-java-3.0.0-beta-1.jar
    

Here we need to make an edit so that the bazel build will recognize aarch64 as ARM


    --- a/src/main/java/com/google/devtools/build/lib/util/CPU.java
    +++ b/src/main/java/com/google/devtools/build/lib/util/CPU.java
    @@ -25,7 +25,7 @@ import java.util.Set;
     public enum CPU {
       X86_32("x86_32", ImmutableSet.of("i386", "i486", "i586", "i686", "i786", "x86")),
       X86_64("x86_64", ImmutableSet.of("amd64", "x86_64", "x64")),
    -  ARM("arm", ImmutableSet.of("arm", "armv7l")),
    +  ARM("arm", ImmutableSet.of("arm", "armv7l", "aarch64")),
       UNKNOWN("unknown", ImmutableSet.of());
    

Now it's time to compile bazel.


    ./compile.sh
    

Now we install tensorflow R0.9. Any higher than R0.9 and it requires bazel 0.3.0, which we didn't install.

You will build tensorflow once and it will fail. But by building it with the failure, it gives you the bazel .cache dir you need to place updated config.guess and config.sub files necessary for the full installation.


    git clone -b r0.9 https://github.com/tensorflow/tensorflow.git
    ./configure
    # this will fail, but that's ok
    bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
    

Download the proper config files and update the .cache dir


    cd ~
    wget -O config.guess 'http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD'
    wget -O config.sub 'http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD'

    # below are commands Dwight Crowe ran, yours will vary depending on .cache details.
    # look for '_bazel_socialh', 'farmhash_archive', and 'farmhash'
    cp config.guess ./.cache/bazel/_bazel_socialh/742c01ff0765b098544431b60b1eed9f/external/farmhash_archive/farmhash-34c13ddfab0e35422f4c3979f360635a8c050260/config.guess
    cp config.sub ./.cache/bazel/_bazel_socialh/742c01ff0765b098544431b60b1eed9f/external/farmhash_archive/farmhash-34c13ddfab0e35422f4c3979f360635a8c050260/config.sub
    

Here is where things get a bit tricky. As Dwight suggests, you'll have to change a few files so that tensorflow compiles correctly.


    --- a/tensorflow/core/kernels/BUILD
    +++ b/tensorflow/core/kernels/BUILD
    @@ -985,7 +985,7 @@ tf_kernel_libraries(
             "reduction_ops",
             "segment_reduction_ops",
             "sequence_ops",
    -        "sparse_matmul_op",
    +        #DC "sparse_matmul_op",
         ],
         deps = [
             ":bounds_check",
    

    --- a/tensorflow/python/BUILD
    +++ b/tensorflow/python/BUILD
    @@ -1110,7 +1110,7 @@ medium_kernel_test_list = glob([
         "kernel_tests/seq2seq_test.py",
         "kernel_tests/slice_op_test.py",
         "kernel_tests/sparse_ops_test.py",
    -    "kernel_tests/sparse_matmul_op_test.py",
    +    #DC "kernel_tests/sparse_matmul_op_test.py",
         "kernel_tests/sparse_tensor_dense_matmul_op_test.py",
     ])
    

TX1 can't do fancy constructors in cwise_op_gpu_select.cu.cc or sparse_tensor_dense_matmul_op_gpu.cu.cc



    --- a/tensorflow/core/kernels/cwise_op_gpu_select.cu.cc
    +++ b/tensorflow/core/kernels/cwise_op_gpu_select.cu.cc
    @@ -43,8 +43,14 @@ struct BatchSelectFunctor {
         const int all_but_batch = then_flat_outer_dims.dimension(1);

     #if !defined(EIGEN_HAS_INDEX_LIST)
    -    Eigen::array broadcast_dims(1, Undefined);
    -    Eigen::Tensor::Dimensions reshape_dims(Undefined, 1);
    +    //DC Eigen::array broadcast_dims(1, Undefined);
    +    Eigen::array broadcast_dims;
    +    broadcast_dims[0] = 1;
    +    broadcast_dims[1] = all_but_batch;
    +    //DC Eigen::Tensor::Dimensions reshape_dims(Undefined, 1);
    +    Eigen::Tensor::Dimensions reshape_dims;
    +    reshape_dims[0] = batch;
    +    reshape_dims[1] = 1;
     #else
         Eigen::IndexList, int> broadcast_dims;
         broadcast_dims.set(1, all_but_batch);
    

    --- a/tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc
    +++ b/tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc
    @@ -104,9 +104,17 @@ struct SparseTensorDenseMatMulFunctor {
         int n = (ADJ_B) ? b.dimension(0) : b.dimension(1);

     #if !defined(EIGEN_HAS_INDEX_LIST)
    -    Eigen::Tensor::Dimensions matrix_1_by_nnz(1, Undefined);
    -    Eigen::array n_by_1(Undefined, 1);
    -    Eigen::array reduce_on_rows0;
    +    //DC Eigen::Tensor::Dimensions matrix_1_by_nnz(1, Undefined);
    +    Eigen::Tensor::Dimensions matrix_1_by_nnz;
    +    matrix_1_by_nnz[0] = 1;
    +    matrix_1_by_nnz[1] = nnz;
    +    //DC Eigen::array n_by_1(Undefined, 1);
    +    Eigen::array n_by_1;
    +    n_by_1[0] = n;
    +    n_by_1[1] = 1;
    +    //DC Eigen::array reduce_on_rows0;
    +    Eigen::array reduce_on_rows;
    +    reduce_on_rows[0] = 0;
     #else
         Eigen::IndexList, int> matrix_1_by_nnz;
         matrix_1_by_nnz.set(1, nnz);
    

Running with CUDA 8.0 requires new macros for FP16. Dwight throws some thanks to Kashif/Mrry for pointing out the fix, so I'm throwing some thanks to whoever those people are as well.



    --- a/tensorflow/stream_executor/cuda/cuda_blas.cc
    +++ b/tensorflow/stream_executor/cuda/cuda_blas.cc
    @@ -25,6 +25,12 @@ limitations under the License.
     #define EIGEN_HAS_CUDA_FP16
     #endif

    +#if CUDA_VERSION >= 8000
    +#define SE_CUDA_DATA_HALF CUDA_R_16F
    +#else
    +#define SE_CUDA_DATA_HALF CUBLAS_DATA_HALF
    +#endif
    +
     #include "tensorflow/stream_executor/cuda/cuda_blas.h"

     #include 
    @@ -1680,10 +1686,10 @@ bool CUDABlas::DoBlasGemm(
       return DoBlasInternal(
           dynload::cublasSgemmEx, stream, true /* = pointer_mode_host */,
           CUDABlasTranspose(transa), CUDABlasTranspose(transb), m, n, k, &alpha,
    -      CUDAMemory(a), CUBLAS_DATA_HALF, lda,
    -      CUDAMemory(b), CUBLAS_DATA_HALF, ldb,
    +      CUDAMemory(a), SE_CUDA_DATA_HALF, lda,
    +      CUDAMemory(b), SE_CUDA_DATA_HALF, ldb,
           &beta,
    -      CUDAMemoryMutable(c), CUBLAS_DATA_HALF, ldc);
    +      CUDAMemoryMutable(c), SE_CUDA_DATA_HALF, ldc);
     #else
       LOG(ERROR) << "fp16 sgemm is not implemented in this cuBLAS version "
                  << "(need at least CUDA 7.5)";
    

And lastly, ARM has no NUMA nodes so this needs to be added or you will get an immediate crash on starting tf.Session()


    
    --- a/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
    +++ b/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
    @@ -888,6 +888,9 @@ CudaContext* CUDAExecutor::cuda_context() { return context_; }
     // For anything more complicated/prod-focused than this, you'll likely want to
     // turn to gsys' topology modeling.
     static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal) {
    +  // DC - make this clever later. ARM has no NUMA node, just return 0
    +  LOG(INFO) << "ARM has no NUMA node, hardcoding to return zero";
    +  return 0;
     #if defined(__APPLE__)
       LOG(INFO) << "OS X does not support NUMA - returning NUMA node zero";
       return 0;
    

So I ran into strange errors that were solved by accident. After running the above commands, bazel fails in weird places. Sometimes at a random op. Sometimes a 'cross_tool' error. Truth be told, I accidently reran the command with a different job number and the op that it had failed on previously ended up compiling just fine. And that was it. Just changing the job number. I switched between 3 and 4 a few times and it compiled just fine. Very weird. But whatever. It works. Just to verify it, repeated this process on a few devices and it always works.


    bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package --jobs 4 #3, 4, 3, 4, etc.
    

Now that Tensorflow is installed, remove bazel and all of bazel's caches that eat memory.


    find / -size +10M -ls
    # delete the big bazel files
    

Installing OpenCv

All we need here are the image reading and displaying opts. Nothing else. So the compile is small and takes up minimal space.


    # install deps
    sudo apt-get install build-essential
    sudo apt-get install cmake git libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev
    sudo apt-get install python-dev python-numpy libtbb2 libtbb-dev libjpeg-dev libpng-dev libtiff-dev libjasper-dev libdc1394-22-dev gcc-4.9

    # download opencv
    git clone https://github.com/opencv/opencv && cd opencv && mkdir release

    # build 
    cmake -D CMAKE_C_COMPILER=/usr/bin/gcc-4.9 \
    -D CMAKE_CXX_COMPILER=/usr/bin/g++-4.9 \
    -D CMAKE_BUILD_TYPE=RELEASE \
    -D CMAKE_INSTALL_PREFIX=/usr/local \
    -D BUILD_opencv_python=ON  \
    -D BUILD_DOCS=OFF \
    -D BUILD_EXAMPLES=OFF  \
    -D BUILD_TESTS=OFF -D BUILD_opencv_ts=OFF  \
    -D BUILD_PERF_TESTS=OFF -D WITH_OPENCL=OFF \
    -D BUILD_SHARED_LIBS=OFF \
    -D WITH_OPENCLAMDFFT=OFF \
    -D WITH_OPENCLAMDBLAS=OFF \
    -D WITH_VA_INTEL=OFF \
    -D BUILD_opencv_python=ON \
    -D BUILD_opencv_flann=OFF \
    -D BUILD_opencv_ml=OFF \
    -D BUILD_opencv_video=OFF \
    -D BUILD_opencv_cudabgsegm=OFF \
    -D BUILD_opencv_cudafilters=OFF \
    -D BUILD_opencv_cudaimgproc=OFF \
    -D BUILD_opencv_cudawarping=OFF \
    -D BUILD_opencv_cudacodec=OFF \
    -D BUILD_opencv_objdetect=OFF \
    -D BUILD_opencv_features2d=OFF \
    -D BUILD_opencv_calib3d=OFF \
    -D BUILD_opencv_cudafeatures2d=OFF \
    -D BUILD_opencv_cudaobjdetect=OFF \
    -D BUILD_opencv_cudastereo=OFF \
    -D BUILD_opencv_cdev=ON \
    -D BUILD_opencv_java=OFF \
    -D BUILD_opencv_hal=OFF \
    -D ENABLE_NEON=OFF \
    -D BUILD_opencv_cudev=ON \
    -D BUILD_opencv_cudaarithm=OFF \
    -D BUILD_opencv_highgui=ON \
    -D BUILD_opencv_photo=ON ..

    sudo make -j 4 && sudo make install -j 4

    

Installing Python libs - scipy, dlib, and sklearn


sudo apt-get install libboost-dev-all python-scipy python-sklearn python-pip
pip install dlib
    

Almost finished!

Head over to David Sandberg's tensorflow implementation of OpenFace and download the resnet model weights in the Pre-trained model section. Then download the dlib facedetector from dlib.net.


wget -nv http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2
bzip2 -d shape_predictor_68_face_landmarks.dat.bz2
    

BOOM! Finished with downloads and installation. Now it's time to build our embedded face detector.

Building the Face Detection pipeline

First thing we need to do is copy align_dlib.py from here and make some quick changes. In the 'stock' version, it looks for the 'biggest' bounding box and only processes that one. But we're going to augment it so that it will classify all bounding boxes that it finds; eg every face will classified rather than just the largest. Also, we're going to make another quick change to the face detector based on the issues from this thread, whereby the detector shears the faces and warps them slightly.

It should be noted that while David Sandberg uses a version of MTCNN to detect faces, we have to use the augmented dlib version. This is done so that when the final detection system is running, the memory profile doesn't get out of wack and spontaneously kill our processes. By changing the face detector, there will be an effect on the overall detection accuracy of our system, but the difference will be minimal.


# align_dlib.py  
# Copyright 2015-2016 Carnegie Mellon University
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Module for dlib-based alignment."""

# NOTE: This file has been copied from the openface project.
#  https://github.com/cmusatyalab/openface/blob/master/openface/align_dlib.py

import cv2
import dlib
import numpy as np

TEMPLATE = np.float32([
    (0.0792396913815, 0.339223741112), (0.0829219487236, 0.456955367943),
    (0.0967927109165, 0.575648016728), (0.122141515615, 0.691921601066),
    (0.168687863544, 0.800341263616), (0.239789390707, 0.895732504778),
    (0.325662452515, 0.977068762493), (0.422318282013, 1.04329000149),
    (0.531777802068, 1.06080371126), (0.641296298053, 1.03981924107),
    (0.738105872266, 0.972268833998), (0.824444363295, 0.889624082279),
    (0.894792677532, 0.792494155836), (0.939395486253, 0.681546643421),
    (0.96111933829, 0.562238253072), (0.970579841181, 0.441758925744),
    (0.971193274221, 0.322118743967), (0.163846223133, 0.249151738053),
    (0.21780354657, 0.204255863861), (0.291299351124, 0.192367318323),
    (0.367460241458, 0.203582210627), (0.4392945113, 0.233135599851),
    (0.586445962425, 0.228141644834), (0.660152671635, 0.195923841854),
    (0.737466449096, 0.182360984545), (0.813236546239, 0.192828009114),
    (0.8707571886, 0.235293377042), (0.51534533827, 0.31863546193),
    (0.516221448289, 0.396200446263), (0.517118861835, 0.473797687758),
    (0.51816430343, 0.553157797772), (0.433701156035, 0.604054457668),
    (0.475501237769, 0.62076344024), (0.520712933176, 0.634268222208),
    (0.565874114041, 0.618796581487), (0.607054002672, 0.60157671656),
    (0.252418718401, 0.331052263829), (0.298663015648, 0.302646354002),
    (0.355749724218, 0.303020650651), (0.403718978315, 0.33867711083),
    (0.352507175597, 0.349987615384), (0.296791759886, 0.350478978225),
    (0.631326076346, 0.334136672344), (0.679073381078, 0.29645404267),
    (0.73597236153, 0.294721285802), (0.782865376271, 0.321305281656),
    (0.740312274764, 0.341849376713), (0.68499850091, 0.343734332172),
    (0.353167761422, 0.746189164237), (0.414587777921, 0.719053835073),
    (0.477677654595, 0.706835892494), (0.522732900812, 0.717092275768),
    (0.569832064287, 0.705414478982), (0.635195811927, 0.71565572516),
    (0.69951672331, 0.739419187253), (0.639447159575, 0.805236879972),
    (0.576410514055, 0.835436670169), (0.525398405766, 0.841706377792),
    (0.47641545769, 0.837505914975), (0.41379548902, 0.810045601727),
    (0.380084785646, 0.749979603086), (0.477955996282, 0.74513234612),
    (0.523389793327, 0.748924302636), (0.571057789237, 0.74332894691),
    (0.672409137852, 0.744177032192), (0.572539621444, 0.776609286626),
    (0.5240106503, 0.783370783245), (0.477561227414, 0.778476346951)])

INV_TEMPLATE = np.float32([
                            (-0.04099179660567834, -0.008425234314031194, 2.575498465013183),
                            (0.04062510634554352, -0.009678089746831375, -1.2534351452524177),
                            (0.0003666902601348179, 0.01810332406086298, -0.32206331976076663)])

TPL_MIN, TPL_MAX = np.min(TEMPLATE, axis=0), np.max(TEMPLATE, axis=0)
MINMAX_TEMPLATE = (TEMPLATE - TPL_MIN) / (TPL_MAX - TPL_MIN)


class AlignDlib:
    """
    Use `dlib's landmark estimation `_ to align faces.

    The alignment preprocess faces for input into a neural network.
    Faces are resized to the same size (such as 96x96) and transformed
    to make landmarks (such as the eyes and nose) appear at the same
    location on every image.

    Normalized landmarks:

    .. image:: ../images/dlib-landmark-mean.png
    """

    #: Landmark indices corresponding to the inner eyes and bottom lip.
    INNER_EYES_AND_BOTTOM_LIP = [39, 42, 57]

    #: Landmark indices corresponding to the outer eyes and nose.
    OUTER_EYES_AND_NOSE = [36, 45, 33]

    def __init__(self, facePredictor):
        """
        Instantiate an 'AlignDlib' object.

        :param facePredictor: The path to dlib's
        :type facePredictor: str
        """
        assert facePredictor is not None

        #pylint: disable=no-member
        self.detector = dlib.get_frontal_face_detector()
        self.predictor = dlib.shape_predictor(facePredictor)

    def getAllFaceBoundingBoxes(self, rgbImg):
        """
        Find all face bounding boxes in an image.

        :param rgbImg: RGB image to process. Shape: (height, width, 3)
        :type rgbImg: numpy.ndarray
        :return: All face bounding boxes in an image.
        :rtype: dlib.rectangles
        """
        assert rgbImg is not None

        try:
            return self.detector(rgbImg, 1)
        except Exception as e: #pylint: disable=broad-except
            print("Warning: {}".format(e))
            # In rare cases, exceptions are thrown.
            return []

    def getLargestFaceBoundingBox(self, rgbImg, skipMulti=False):
        """
        Find the largest face bounding box in an image.

        :param rgbImg: RGB image to process. Shape: (height, width, 3)
        :type rgbImg: numpy.ndarray
        :param skipMulti: Skip image if more than one face detected.
        :type skipMulti: bool
        :return: The largest face bounding box in an image, or None.
        :rtype: dlib.rectangle
        """
        assert rgbImg is not None

        faces = self.getAllFaceBoundingBoxes(rgbImg)
        if (not skipMulti and len(faces) > 0) or len(faces) == 1:
            return max(faces, key=lambda rect: rect.width() * rect.height())
        else:
            return None

    def findLandmarks(self, rgbImg, bb):
        """
        Find the landmarks of a face.

        :param rgbImg: RGB image to process. Shape: (height, width, 3)
        :type rgbImg: numpy.ndarray
        :param bb: Bounding box around the face to find landmarks for.
        :type bb: dlib.rectangle
        :return: Detected landmark locations.
        :rtype: list of (x,y) tuples
        """
        assert rgbImg is not None
        assert bb is not None

        points = self.predictor(rgbImg, bb)
        #return list(map(lambda p: (p.x, p.y), points.parts()))
        return [(p.x, p.y) for p in points.parts()]

    #pylint: disable=dangerous-default-value
    def align_old(self, imgDim, rgbImg, bb=None,
              landmarks=None, landmarkIndices=INNER_EYES_AND_BOTTOM_LIP,
              skipMulti=False, scale=1.0):
        r"""align(imgDim, rgbImg, bb=None, landmarks=None, landmarkIndices=INNER_EYES_AND_BOTTOM_LIP)

        Transform and align a face in an image.

        :param imgDim: The edge length in pixels of the square the image is resized to.
        :type imgDim: int
        :param rgbImg: RGB image to process. Shape: (height, width, 3)
        :type rgbImg: numpy.ndarray
        :param bb: Bounding box around the face to align. \
                   Defaults to the largest face.
        :type bb: dlib.rectangle
        :param landmarks: Detected landmark locations. \
                          Landmarks found on `bb` if not provided.
        :type landmarks: list of (x,y) tuples
        :param landmarkIndices: The indices to transform to.
        :type landmarkIndices: list of ints
        :param skipMulti: Skip image if more than one face detected.
        :type skipMulti: bool
        :param scale: Scale image before cropping to the size given by imgDim.
        :type scale: float
        :return: The aligned RGB image. Shape: (imgDim, imgDim, 3)
        :rtype: numpy.ndarray
        """
        assert imgDim is not None
        assert rgbImg is not None
        assert landmarkIndices is not None

        if bb is None:
            bb = self.getLargestFaceBoundingBox(rgbImg, skipMulti)
            if bb is None:
                return

        if landmarks is None:
            landmarks = self.findLandmarks(rgbImg, bb)

        npLandmarks = np.float32(landmarks)
        npLandmarkIndices = np.array(landmarkIndices)

        #pylint: disable=maybe-no-member
        H = cv2.getAffineTransform(npLandmarks[npLandmarkIndices],
                                   imgDim * MINMAX_TEMPLATE[npLandmarkIndices]*scale + imgDim*(1-scale)/2)
        thumbnail = cv2.warpAffine(rgbImg, H, (imgDim, imgDim))
        
        return thumbnail

    #Here's the new method
    def align_one(self, imgDim, rgbImg, bb=None,
              landmarks=None, landmarkIndices=INNER_EYES_AND_BOTTOM_LIP,
              skipMulti=False, scale=1.0):
        assert imgDim is not None
        assert rgbImg is not None
        assert landmarkIndices is not None

        if bb is None:
            bb = self.getLargestFaceBoundingBox(rgbImg, skipMulti)
            if bb is None:
                return

        if landmarks is None:
            landmarks = self.findLandmarks(rgbImg, bb)

        npLandmarks = np.float32(landmarks)
        tplLandmarks = imgDim * MINMAX_TEMPLATE*scale + imgDim*(1-scale)/2
        tplLandmarks = np.transpose(tplLandmarks)
        npLandmarks = np.vstack( (np.transpose(npLandmarks), np.ones(tplLandmarks.shape[1])) )

        H = np.matmul(np.matmul(tplLandmarks, np.transpose(npLandmarks)), 
            np.linalg.inv(np.matmul(npLandmarks,np.transpose(npLandmarks))))
        thumbnail = cv2.warpAffine(rgbImg, H, (imgDim, imgDim))

        return thumbnail, bb

    #here's that same method applied to all bounding boxes it finds
    def align_many(self, imgDim, rgbImg, bb=None,
              landmarks=None, landmarkIndices=INNER_EYES_AND_BOTTOM_LIP,
              skipMulti=False, scale=1.0):
        assert imgDim is not None
        assert rgbImg is not None
        assert landmarkIndices is not None

        thumbnails = []
        bboxes = []
        bbs = self.getAllFaceBoundingBoxes(rgbImg)
        if bbs is None:
            return

        for bb in bbs:

            if landmarks is None:
                landmarks = self.findLandmarks(rgbImg, bb)

            npLandmarks = np.float32(landmarks)
            tplLandmarks = imgDim * MINMAX_TEMPLATE*scale + imgDim*(1-scale)/2
            tplLandmarks = np.transpose(tplLandmarks)
            npLandmarks = np.vstack( (np.transpose(npLandmarks), np.ones(tplLandmarks.shape[1])) )

            H = np.matmul(np.matmul(tplLandmarks, np.transpose(npLandmarks)), 
                np.linalg.inv(np.matmul(npLandmarks,np.transpose(npLandmarks))))
            thumbnail = cv2.warpAffine(rgbImg, H, (imgDim, imgDim))

            thumbnails.append(thumbnail)
            bboxes.append(bb)
        return thumbnails, bboxes
    

The second thing we need to do is build a scanner to identify the faces you actually want to classify. One thing to note is that with the Jetson, using the camera with OpenCV can be tricky. We need to make sure open the video with this prompt in our call via OpenCV: "nvcamerasrc ! video/x-raw(memory:NVMM), width=(int)640, height=(int)480,format=(string)I420, framerate=(fraction)24/1 ! nvvidconv flip-method=2 ! video/x-raw, format=(string)BGRx ! videoconvert ! video/x-raw, format=(string)BGR ! appsink" Here's our script called scan.py.


# scan.py
# -*- coding: UTF-8 -*-
#Usage: python scan.py --name YOUR NAME
import cv2
import align_dlib as align
import os
import argparse

def fill(x, name):
    """
    function to append 00s onto a string
    """
    t=len(str(x))
    a = 5-t

    s = str(x)
    for i in range(a):
        s = str(0) + s
    return name+'_'+s

try: 
    os.mkdir('train_images')
except:
    pass
path = '{}/train_images/'.format(os.getcwd())

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--name', type=str, help="Name of person being scanned", default='Bruce_Lee')
    args = parser.parse_args()
    face_detector = align.AlignDlib('{}/shape_predictor_68_face_landmarks.dat'.format(os.getcwd()))
    video_capture = cv2.VideoCapture("nvcamerasrc ! video/x-raw(memory:NVMM), width=(int)640, height=(int)480,format=(string)I420, framerate=(fraction)24/1 ! nvvidconv flip-method=2 ! video/x-raw, format=(string)BGRx ! videoconvert ! video/x-raw, format=(string)BGR ! appsink")
    # set the capture HxW to the same as openface by https://github.com/cmusatyalab/openface
    video_capture.set(3, 320)
    video_capture.set(4, 240)
    count = 0
    while True:
        ret, frame = video_capture.read()
        try:
            face_thumbnail, area = face_detector.align_one(96, frame)
            string = fill(count, args.name)
            print("face detection successful")
            cv2.imwrite(path+string+'.png', face_thumbnail)
            bl = (area.left(), area.bottom())
            tr = (area.right(), area.top())
            cv2.rectangle(frame, bl, tr, color=(153, 255, 204),
                          thickness=3)
            count+=1
        except:
            pass
        cv2.imshow('', frame)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break
    

FaceNet Model Overview

Most open source facial recognition libraries like OpenFace, home_surveillance, facenet, etc. use the model similar to the one outlined in the FaceNet paper written by Florian Schroff, Dmitry Kalenichenko, James Philbin. Here' we're no different and will be using the model that David Sandberg's facenet.uses.

The model works to take an image of an individuals face and pass it through a network (the model David uses is a variant of Inception-Resnet). The goal is to make the network embed the image in a feature space so that the squared distance between images of the same identity is small and the squared distance between a pair of images from different identities is large. This is done using something called a Triplet Loss. It's probably the one of the single-most important feature of the model's structure.

Rather than break down the entire model, I just want to mention what made this model stick out for me:


# triplet loss
def triplet_loss(image_embedding, true_image_embedding, false_image_embedding, alpha):
    positive_distance = np.sum(np.square(image_embedding - true_image_embedding),axis=1) 
    negative_distance = np.sum(np.square(image_embedding - false_image_embedding),axis=1)
    basic_loss = (positive_distance - negative_distance) + alpha
    loss = np.mean(np.max(basic_loss, 0.0), axis=0)
    return loss
    

The goal here is to promote an embedding scheme that enforces a margin between each face pair of one identity to that of all other identities.

In order to ensure that the network learns properly, triplets are selected in such a way that during the forward pass, negative samples are selected in an online fashion from the current minibatch. The authors note that selecting very distant negatives can lead to a bad local minima early on, so they instead select negatives so that their distance is further away from the image's embedding than the positive example, but are still meaningful because the squared distance is close to the anchor positive distance. Thus resulting in negatives that lie inside the margin alpha and help avoid a collapsed model.

TL;DR: Read the paper. It's worth it.


Building the live detector

After you've scanned the faces you want to via the TX1's camera, we're going to want to put something together to actually classify faces. This script takes on concepts from openface's web-demo as well as facenet's validate_on_lfw.py.

So what's going on here? Welp, we want to train a model from all images within our training set from scan.py, then use the facenet model to build a representation of each image. After each image has been processed via the network, we train an SVM on their representations and we teach it to classify a person's processed image correctly. That trained SVM is then used to classify all the faces that the camera sees.


# -*- coding: UTF-8 -*-
# model.py
import cv2
from collections import defaultdict
import tensorflow as tf
import align_dlib as align
import os
import numpy as np
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC

def load_model(model_file):
    """Load facenet model"""
    saver = tf.train.import_meta_graph(os.path.expanduser(model_file+'.meta'))
    saver.restore(tf.get_default_session(), os.path.expanduser(model_file))

def get_data(path, sess, embeddings):
    """gather data from our training images and vectorize them"""
    image_list = sorted(os.listdir(path))
    X = []
    y = []
    person_dict = {}
    count = 0
    current = image_list[0].split('_')[0]
    for string in image_list:
        name = string.split('_')[0]
        if current!=name:
            count+=1
            current = name
        person_dict[count] = name
        thm = cv2.imread(path+'/'+string)
        feed_dict = { images_placeholder:np.expand_dims(thm,0), phase_train_placeholder:False }
        emb_array = sess.run(embeddings, feed_dict=feed_dict)
        X.append(emb_array)
        y.append(count)
    X = np.vstack(X)
    y = np.array(y)
    return X, y, person_dict

def train_SVM(path, sess, embeddings):
    """train an svm on the image data we've collected"""
    X, y, person_dict = get_data(path, sess, embeddings)
    print("+ Training SVM on {} labeled images.".format(X.shape[0]))
    param_grid = [
        {'C': [1, 10, 100, 1000],
         'kernel': ['linear']},
        {'C': [1, 10, 100, 1000],
         'gamma': [0.001, 0.0001],
         'kernel': ['rbf']}
    ]
    svm = GridSearchCV(SVC(C=1), param_grid, cv=5).fit(X, y)
    return svm, person_dict

def get_rep(img, sess, embeddings):
    """process an image via facenet and return the embedding array"""
    rgbImg = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    feed_dict = { images_placeholder:np.expand_dims(img,0), phase_train_placeholder:False }
    emb_array = sess.run(embeddings, feed_dict=feed_dict)
    return emb_array


def process_frame(frame, sess, embeddings, svm, person_dict):
    """process a frame and if faces are found, draw a rectangle on the image
       with the corresponding face
    """
    faces, bboxes = face_detector.align_many(96, frame)
    if faces!=None:
        print('detection!')
        for face, bb in zip(faces, bboxes):
            rep = get_rep(face, sess, embeddings)
            print(rep)
            identity = svm.predict(rep)[0]
            name = person_dict[identity]

            bl = (bb.left(), bb.bottom())
            tr = (bb.right(), bb.top())
            cv2.rectangle(frame, bl, tr, color=(153, 255, 204),
                          thickness=3)
            cv2.putText(frame, name, (bb.left(), bb.top() - 10),
                        cv2.FONT_HERSHEY_SIMPLEX, fontScale=0.75,
                        color=(152, 255, 204), thickness=2)

# without this, errors were being thrown and the machine was being killed
config = tf.ConfigProto()
config.gpu_options.allow_growth=True

# we use device = cpu because it's just as fast on the gpu. 
# we also want to avoid io transfer time to gpu considering we'll be
# processing images incredibly quickly
with tf.device('/cpu:0'):
    with tf.Graph().as_default():
        with tf.Session(config=config) as sess:
            # Load the model
            #TODO: change the model dir to whatever model dir/ model you download
            print('Loading model "%s"' % 'model-20160506.ckpt-500000')
            load_model('20160514-234418/model.ckpt-500000')
            graph_def = tf.get_default_graph()
            # Get input and output tensors
            images_placeholder = tf.get_default_graph().get_tensor_by_name("input:0")
            phase_train_placeholder = tf.get_default_graph().get_tensor_by_name("phase_train:0")
            embeddings = tf.get_default_graph().get_tensor_by_name("embeddings:0")
            
            print("setting up training")
            training_path = 'train_images'
            svm, person_dict = train_SVM(training_path, sess, embeddings)
            
            print("setting up camera")
            video_capture = cv2.VideoCapture("nvcamerasrc ! video/x-raw(memory:NVMM), width=(int)640, height=(int)480,format=(string)I420, framerate=(fraction)24/1 ! nvvidconv flip-method=2 ! video/x-raw, format=(string)BGRx ! videoconvert ! video/x-raw, format=(string)BGR ! appsink")
            video_capture.set(3, 320)
            video_capture.set(4, 240)
            
            print("setting up detector")
            face_detector = align.AlignDlib('shape_predictor_68_face_landmarks.dat')
            
            while True:
                ret, frame = video_capture.read()
                faces, bboxes = face_detector.align_many(96, frame)
                process_frame(frame, sess, embeddings, svm, person_dict)
                cv2.imshow('', frame)
                # quit the program on the press of key 'q'
                if cv2.waitKey(1) & 0xFF == ord('q'):
                    break
    

Boomshakalaka. And there you have it. A few simple scripts, and you have an embedded detector up and running.

Because everyone likes a demo in camera vertical (...a little lag due to tunneling X over ssh)