Original Source Here
How it works
To perform video classification, we will be implementing a Convolutional Neural Network (CNN). A CNN is a type of neural network that is most often applied to image processing problems. Let’s first describe what a regular neural network is!
A regular neural network has an input layer, hidden layers and an output layer. The input layer accepts inputs in different forms while the hidden layers perform calculations on these inputs. The output layer then delivers the outcome of the calculations and extractions. Each of these layers contain neurons that are connected to neurons in the previous layer and each neuron has its own weight. This means you aren’t making any assumptions about the data being fed into the network. This is great usually but not if you’re working with images or a language.
CNNs work differently as they treat data as spatial. Instead of neurons being connected to every neuron in the previous layer, they are instead only connected to neurons close to it and they all have the same weight. The word convolutional refers to the filtering process that happens in this type of network. Let’s think of it this way, an image is complex. A CNN simplifies it, so it can be better processed and understood. In terms of structure, the CNN is made up of multiple layers , but there are a couple layers that distinguish it from our typical neural networks: the convolutional layer and the pooling layer. Below is pasted an image of the model that we will explain later, but it already gives us a great idea of what the different layers of a CNN model look like.
The first step is to install the necessary libraries, and then set up the seed to ensure that our results are standardized when we do end up using our model.
It is important to mention that we will using a dataset from the Center for Research in Computer Vision (CRCV) of University of Central Florida (link in the references below) to train our CNN model. It includes 50 types of human movement videos that we downloaded directly from one of UCF’s webpage as seen below.
Movement Selection and Data Preprocessing
Before starting to preprocess our data, we picked 5 movements that our model will have to choose from to classify the different videos that we will be testing it with. Those movements are jumping jack, lunges, tennis swing, push ups and jumping rope. We also set different variables that will be used to standardize the different videos that we will be importing.
We then proceeded to take hundreds of videos from the CRCV representing those specific movements that will be used to input into our model. Using the cv2 package and preset variables, we created a function called frames_extraction that has a role to import, resize and normalize the different videos that were extracted from the UCF50 data. The output of that function is a list of normalized frames that we will use later in creating our training dataset.
The next step is to extract the data from UCF50 to an organized dataset that only contains the normalized frames of the selected five movement types and labeled with their class name. Furthermore, the number of frames for each class was restricted to 3000, so that the data size is unified for all the classes. Class names were also encoded for model development.
Once the dataset ready for model development, it is split into training and testing variables to allow the model to be validated with never seen before videos.
Training and Running the Model
As it is mentioned in the introduction section, the model is built based on CNN. It includes 3 Con2d layers with relu activation, 2 max_pooling2d layers with pool_size of (3, 3), 2 batch normalization layers, 1 global average pooling 2d layer and 3 dense layers including the final layer of classification to ensure the output size. Moreover, the model is processed to compile with the Adam optimizer.
Using the preset training dataset and setting up the model with a 2-min early stopping, it only took 13 epoch for that model to reach a training accuracy of 94.87% and validation accuracy of 97.87%. Evaluating that trained model with the testing dataset, the accuracy achieved is at 97.97%.
The loss and the accuracy were also plotted in the graph below.
Now that we have a fully trained model on the desired classes, we wanted to test out our model by testing it with our own videos for each of the chosen human activities.
The above function allows us to have a single prediction for an entire video. We are taking 50 frames from the video which would then be fed to our model. The function then takes the average from the 50 frames and gives us the final activity class for the video. Preprocessing of the videos are done similar to the process done before when training the model. We first use the cv2 package to capture the video. The video is then split into 50 different frames and each one of them is resized and normalized before it is being fed into the model. The results are stored in a list and at the end of the loop and we take the average of the results to find out which class came out on top. This is then displayed as the predicted class for the video.
Here is the video we are trying to predict:
The function download_youtube_videos is used to import the testing video as seen below.
The above function will also set the video file path which would be used in the make_prediction function to get the results.
The prediction that we get is exactly what it is supposed to be, which confirms the strength of our model.
Recognizing human activities from videos can be challenging because of its complexity but can be very beneficial in many contexts. There are many applications of Human Activity Recognition (HAR) including:
- Video surveillance: HAR in this context can allow the video surveillance system users to evaluate and get a high level interpretation of movement patterns and human actions and interactions with objects. Thus they can get notified of suspicious actions.
- Video Retrieval: with the large amount of videos on the web, being able to classify human actions can help search engines such as Google or Bing recognize videos based on keywords without the need for any captions.
- Robotics: robots trying to imitate human behaviors could also use HAR as a way to either learn basic everyday actions or more easily interact with live beings by recognizing patterns in movements.
To conclude, despite video and image recognition being in the early stages of their lives, those technologies powered by artificial intelligence can help us analyze and detect patterns in behaviors much more efficiently than by having to manually do it. Its applications also being endless, it is safe to say that this niche sector in computer science has a bright future ahead of it.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot