Deep Learning in the kitchen: development and validation of an action recognition system based on RGB-D sensors

This thesis work is part of a research project between the Laboratory and Cast Alimenti cooking school. Cast Alimenti aims to obtain a product to improve teaching in its classrooms. The idea is to develop a system (hardware and software) that simplifies the process of writing recipes while the teacher performs its lecture in the kitchen. The project aims to translate into written language a recipe performed during the demonstration lessons, allowing also to write in Italian the lessons performed by foreign teachers. This goal has been achieved through the use of an action recognition system.

Action recognition aims to interpret human actions through mathematical algorithms and is based on the identification and tracking of the position of the human body in time. The topic is actively studied in various fields, in fact, by using this technique, devices such as smart bands and cell phones may recognize if a person is stationary, walking or running. Another example is related to security cameras and systems, in which the recognition of actions that may be considered violent or dangerous allow the authorities to intervene quickly when necessary.

The project is divided into several phases. First, it is necessary to implement the recognition of the cook’s activity during the practical demonstration and later, based on the recognized actions, it is necessary to automate the writing of the recipe. This thesis work is focused on the first part of the project, in particular on the choice of the mathematical algorithm necessary for the recognition of actions 


For a complete understanding of the difficulties of application of the system under development, it is necessary to contextualize the environment in which it will work. The scenario is that of an open kitchen in which the cook works behind a counter and for most of the time is in the same position and only its upper body is visible (Fig. 1). The cook interacts with the various working tools and food only through the upper limbs and there is no interaction with other subjects. 

The kitchen changes as the recipe progresses, as work tools such as stoves, mixers, pots and pans and other utensils are added and removed as needed. The predominant colors of the scene are white (the cook’s uniform, the wall behind him, the cutting board) and steel (the worktop and equipment). This results in uneven lighting and the creation of reflections and shadows. The presence of machinery and heat sources generates both auditory and visual noises, especially in the infrared spectrum. In the working environment there are also other critical issues such as: the presence of water, steam, temperature changes, substances of various kinds such as oil and acids, as well as chemicals for cleaning. 

To obtain reliable data in this context it is necessary to use appropriate instrumentation, with specific technical characteristics.

Fig. 1 - Example of the "smart" kitchen used in this work at Cast Alimenti. The predominant colors are white and gray and the overall luminance is dark. Due to the characteristics of the scenario the instrumentation of choice must be carefully selected.

technology of choice for action recognition

Tracking is the first step in action recognition. The technology used for this project was carefully selected by evaluating the systems used commercially to track full-body movements, such as accelerometers and gyroscopes, RGB cameras and depth sensors (3D cameras). 

Wearable solutions involve the use of accelerometers coupled with gyroscopes, a technique adopted in almost all commercially available smart bands for sports performance assessment and sleep monitoring. However, to properly use such wearable devices in this project it means that the cook must be equipped with sensors applied to the hands or wrists and the data obtained from them should be synchronized with each other. Considering the scenario in which the cook operates, it is also necessary that the sensors are waterproof and resistant to aggressive substances.

Optical instrumentation, such as RGB or 3D cameras positioned externally to the scene, makes it possible to keep the image sensors away from the chef’s workspace, thus avoiding exposure to the critical environmental conditions of the kitchen. A disadvantage of this solution is the huge amount of images obtained and the related processing. However, the computational power of modern computer systems allow their application in real-time systems.

Given the environmental conditions and the need of making the acquisition system easily accessible to non-expert users, we opted for the second solution choosing to place the cameras in front of the cook who performs the recipe, thus replicating the view of the students during the lectures.

Fig. 2 - Example taken from the experimental set-up. The two Kinect v2 cameras have been positioned in front of the kitchen counter, replicating the students' view.


An action is any movement made by the operator in which tools are used to obtain a certain result. Characteristics of an action are (i) the speed of execution and (ii) the space in which it takes place. Based on these variables, it is important to select an appropriate frame rate and sensor resolution.

Recognizing the performance of actions through a mathematical algorithm that analyzes images is not a computationally simple task, because the computational load increases proportionally to the number of frames per second and to the image resolution. It is therefore crucial to find the optimal configuration in order to maintain a good image quality, which is necessary to recognize the action, while still being able to use a consumer processing system.

Modern consumer computing systems (PCs) currently provide sufficient computational power to perform the necessary calculations by harnessing the power of parallel computing in GPUs (graphics processing units).


There are several algorithms available to analyze actions in real time given a set of image frames temporally consistent. They may be subdivided into two main categories:

  • Algorithms that analyze 3D images, such as images generated using depth cameras. This type of data removes all issues related to the color composition of the scene and subjects that may be blurred or that vary during the execution (Fig. 3 a); 
  • Algorithms that process Skeleton data, in which an artificial skeleton composed of keypoints corresponding to the fundamental joints of the body is computed by the network. The keypoints represent (x, y, z) positions of the body’s joints in the camera reference system  (Fig. 3 b). 

Moreover, by combining the two categories it is possible to obtain hybrid algorithms that analzye both types of data.

Among the broad set of algorithms available two of them have been selected for this work:

  1. HPM+TM: it is a supervised classification algorithm developed in MATLAB by the University of Western Australia. It was created specifically for action recognition and has achieved the best performance in the 3D Action Pairs dataset, reaching an accuracy of 98%.
  2. indRNN: this model was developed as part of a collaboration between Australia’s University of Wollongong and the University of Electronic Science and Technology of China. Although the algorithm has not been specifically designed for action recognition, it is still applicable where it is necessary to recognize features over time. It is a supervised classification algorithm and obtained an efficiency of 88% in the NTU RGB+D Dataset.
Fig. 3 - Example of data that may be processed by deep learning models. (a) Image frame in RGB on which skeletal data is drawn; (b) depth frame taken from Kinect v2 cameras.


The experimental campaign took place during two days at Cast Alimenti cooking school. Two Kinect v2 cameras recorded Nicola Michieletto, the chef that worked with our Laboratory since the beginning of the project, while he cooked lasagne. The entire preparation was repeated and filmed twice, this with the aim of obtaining a larger and more representative dataset.

From a first analysis, it is made evident that some actions were repeated much more than others. This strong difference in the number of samples available forced us to make some preliminary analysis in order to understand how the presence of categories with a small number of samples influences the accuracy of the algorithm adopted. 

Therefore, we selected only a sub-sample of the actions present in the dataset, namely:

  1. stirring: using a ladle the cook mixes the ingredients inside a pot or a bowl with circular movements;
  2. pouring: the cook takes an ingredient from one container and pours it inside another container;
  3. rolling out the pasta: a process in which pasta is made flat and thin using an inclined plane dough sheeter. The cook loads a thick sheet from the top of the machine and pulls out a thinner sheet from the bottom;
  4. cutting: the cook cuts a dish by means of a kitchen knife; the dish is held steady with the left hand and the knife is used with the right hand;
  5. placing the pasta: the cook takes the pastry from the cloths on which it was put to rest and deposits it inside the pan where he is composing the lasagna;
  6. spreading: process by which the béchamel and Bolognese sauce are distributed in an even layer during the composition of the lasagna;
  7. sprinkling: the cook takes the Parmesan grated cheese and distributes it forming an even layer;
  8. blanching: a process in which the cook takes freshly flaked pasta from the counter and plunges it inside a pot with salted water for a brief cooking time;
  9. straining the pasta: with the use of a perforated ladle the cook removes the pasta from the pot in which it was cooking and deposits it in a pan with water and ice;
  10. draining the pasta on cloth: the cook removes with his hands the pasta from the water and ice pan and lays it on a cloth in order to allow it to dry;
  11. folding the pasta: during the puffing process it is sometimes necessary to fold the pasta on itself in order to proceed to a further puffing process and obtain a more uniform pasta layer;
  12. turn on/off induction plate: the cook turns on or off a portable induction plate located on the work counter;
  13. catching: simple process where the cook grabs an object and moves it closer to the work point;
  14. moving pot: the cook moves the pot within the work space, in most cases involves moving it to or from the induction plate.
Fig. 4 - Examples of the 14 actions selected for the work. (a) stirring, (b) pouring; (c) rolling out the pasta; (d) cutting; (e) placing the pasta; (f) spreading; (g) sprinkling; (h) blanching; (i) straining the pasta; (l) draining the pasta on cloth; (m) folding the pasta; (n) turn on/off induction plate; (o) catching; (p) moving pot.


We performed a detailed analysis to determine the performances of the two algorithms according to (i) a reduction of the number of classes and (ii) an increase of the number of samples per class. Albeit theoretically known that deep learning algorithms improve the more data are used and that a high number of classes presenting similarities between each other reduce the overall inference accuracy, with this test we wanted to quantify this phenomenon.

In summary, the results for each algorithm are:

  • HPM+TM: this algorithm performs better when less classes are adopted and a high number of samples per class is used. Highest accuracy achieved: 54%
  • indRNN: this model performs better than the other one and is more robust even if less samples per class are used. Moreover, no significant improvements can be observed by reducing by more than a half the number of classes. Highest accuracy achieved: 85%
Moreover, by observing the resulting confusion matrixes it is possible to note that “stirring” and “pouring” classes are the most critical. In fact, the highest number of false positives is obtained for the “stirring” class while the highest number of false negatives is observed for the “pouring” class. The two cases are often due to misclassifications between each other. This highlighted the fact that during the cooking procedure the chef often poured an ingredient while stirring the pot with the other hand, hence the two actions are more often than not overlapping. Hence, it would be best to merge the two classes into one to account for this eventuality.