PRE2024 3 Group15: Difference between revisions

From Control Systems Technology Group
Jump to navigation Jump to search
(added names of group members)
Tag: 2017 source edit
 
 
(111 intermediate revisions by 5 users not shown)
Line 1: Line 1:
Nikola Milanovski, Senn Loverix, Illie Alexandru, Matus , Gabriel
{{DISPLAYTITLE:Computer Vision for hand gesture control of synthesizers (modulation of synthesizers)}}
 
Group members: Nikola Milanovski, Senn Loverix, Rares Ilie, Matus Sevcik, Gabriel Karpinsky
 
= Work done last week =
{| class="wikitable"
|+Sheet with work ours and tasks: https://docs.google.com/spreadsheets/d/1LjFgobNuYJQzRVwhrHQrzcVzEVyR9A_9NcWLjT7u2a0/edit?usp=sharing
|'''Name'''
|'''Total'''
|'''Breakdown'''
|-
|Nikola
|6h
|Meetings (1h), Trying to set up virtual machine for Ableton (3h), Familiarizing with MIDI through python (2h)
|-
|Senn
|11h
|Set up codebase for gui optimization (4h), researching optimization techniques (2h), Applying and testing optimization techniques (3h), meeting (1h), looking at python to MIDI (1h)
|-
|Rares
|10.5h
|Researching ASL classification models (1h), Data collection and preprocessing (2.5h), Testing ASL classification (2h), Refinements on previous hand-tracking volume control program (2.5h), Research into additional possibilities of OpenCV and implementation of basic algorithms (finger counting) (2.5)
|-
|Matus
|12.5h
|Meetings (0.5h), Preparing and conducting interviews (4h), Researching Ableton inputs, and in general what output should our software produce to work with audio software (6h), Understanding codebase and coding experiments (2h)
|-
|Gabriel
|16h
|Preparing interviews (2h), Interview with Jakub K. (3h), GUI research (2h), modifiying codebase to be reproducable (git, venv, .env setup, package requirements.txt automation etc.) (6h), gui initial coding experiments (3h)
|}
 
= Problem statement and objectives =
The synthesizer has become an essential instrument in the creation of modern day music. They allow musicians to modulate and create sounds electronically. Traditionally, an analog synthesizer utilizes a keyboard to generate notes, and different knobs, buttons and sliders to manipulate sound. However, through using MIDI (Music Instrument Digital Interface) the synthesizer can be controlled via an external device, usually also shaped like a keyboard, and the other controls are made digital. This allows for a wide range of approaches to what kind of input device is used to manipulate the digital synthesizer. Although traditional keyboard MIDI controllers have reached great success, its form may restrict expressiveness of musicians that seek to create more dynamic and unique sounds, as well as availability to people that struggle with the controls due to a lack of keyboard playing knowledge or a physical impairment for example.
 
During this project the aim is to design a new way of controlling a synthesizer using the motion of the users’ hand. By moving their hand to a certain position in front of a suitable sensor system which consists of one or more cameras, various aspects of the produced sound can be controlled, such as pitch or reverb. Computer vision techniques will be implemented in software to track the position of the users’ hand and fingers. Different orientations will be mapped to operations on the sound which the synthesizer will do. Through the use of MIDI, this information will be passed to a synthesizer software to produce the electronic sound. We aim to allow various users in the music industry to seamlessly implement this technology to create brand new sounds in an innovative, easy to control way to create these sounds in a more accessible way than through using a more traditional synthesizer.
 
= Users =
With this innovative way of producing music, the main targets for this technology are users in the music industry. Such users include performance artists and DJ’s, which can implement this technology to enhance their live performances or sets. Visual artists and motions based performers could integrate the technology within their choreography to  Other users include music producers looking to create unique sounds or rhythms in their tracks. Content creators that use audio equipment to enhance their content, such as soundboards, could use the technology as a new way to seamlessly control the audio of their content.
 
This new way of controlling a synthesizer could also be a great way to introduce people to creating and producing electronic music. It would be especially useful for people with some form of a physical impairment which could have restricted them from creating the music that they wanted before.
 
= User requirements =
For the users mentioned above, we have set up a list of requirements we would expect the users to have for this new synthesizer technology. First of all, it should be easy to set up for performance artists and producers so they don’t spend too much time preparing right before their performance or set. Next, the technology should be easily accessible and easy to understand for all users, both people that have a lot of experience with electronic music, and people that are relatively new to it.
 
Furthermore, the hand tracking should work in different environments. For example, a DJ that works in dimly lighted clubs who integrate a lot of different lighting and visual effects during their sets should still be able to rely on accurate hand tracking. There should also be the ability to easily integrate the technology into the artist’s workflow. An artist should not change their entire routines of performing or producing music if they want to use a motion based synthesizer.
 
Lastly, the technology should allow for elaborate customization to fit to each user’s needs. The user should be able to decide what attributes of the recognized hand gestures are important for their work, and which ones should be omitted. For example, if the vertical position of the hand regulates the pitch of the sound, and rotation of the hand the volume, the user should be able to ‘turn off’ the volume regulation so that if they rotate their hand nothing will change.
 
To get a better understanding of the user requirements, we are planning on interviewing some people in the music industry such as music producers, a DJ and an audiovisual producer. The interview questions are as follows:
 
'''Background and Experience:''' What tools or instruments do you currently use in your creative process? Have you previously incorporated technology into your performances or creations? If so, how?
 
'''Creative Process and Workflow:''' Can you describe your typical workflow? How do you integrate new tools or technologies into your practice? What challenges do you face when adopting new technologies in your work?
 
'''Interaction with Technology:''' Have you used motion-based controllers or gesture recognition systems in your performances or art? If yes, what was your experience?
 
How do you feel about using hand gestures to control audio or visual elements during a performance?
 
What features would you find most beneficial in a hand motion recognition controller for your work?
 
'''Feedback on Prototype:''' What specific functionalities or capabilities would you expect from such a device?
 
How important is the intuitiveness and learning curve of a new tool in your adoption decision?
 
'''Performance and Practical Considerations:''' In live performances, how crucial is the reliability of your equipment? What are your expectations regarding the responsiveness and accuracy of motion-based controllers?
 
How do you manage technical issues during a live performance?
 
How important are the design and aesthetics of the tools you use?
 
Do you have any ergonomic preferences or concerns when using new devices during performances?
 
What emerging technologies are you most excited about in your field?
 
=== Interview (Gabriel K.) with (Jakub K., Music group and club manager, techno producer and DJ) ===
 
==== Important takeaways: ====
 
# Artists, mainly DJs, have prepare tracked which are chopped up and ready to perform and mostly play with effects and values. So for this application using it instead of a knob or a digital slider to give the artist more granular control over an effect or a sound. For example if it our controller was an app with a GUI that could be turned on and of at will during the performance he thinks it could add "spice" to a performance.
# For visual control during a live performance, he thinks it's too difficult to use in doing it live especially compared to the current methods. But he can imagine using it to control for example color or some specific element.
# He says that many venues have multiple cameras already pre setup which could be used to capture the gestures from multiple angles and at high resolutions.
# He can definitely imagine using it live if he wanted to.
# If it is used to modulate sound and not to be played like an instrument delay isn't that much of a problem
 
==== What tools or instruments do you currently use in your creative process? ====
For a live performance he uses: pioneer 3000 player, can send music to xone-92 or pioneer v10 (industry standard). Then it goes to speakers.
 
alternatively instead of the pioneer 3000 a person can have a laptop going to a less advanced mixing dec.
 
Our hand tracking solution would be a laptop as a full input to a mixing dec. or an additional input to xone-92 as a input on a separate channel.
 
On the mixer he as a DJ mostly only uses eq, faders and adding effects on a knob leaving many channels open. Live performances use a lot more of the nobs then DJs
 
==== Have you previously incorporated technology into your performances or creations? If so, how? ====
 
Yes he has, him and a colleague tried to add a drum machine to a live performance.  They had an Ableton project that had samples and virtual midi controllers to add a live element to their performance. But it was too cumbersome to add to his standard DJ workflow.
 
==== What challenges do you face when adopting new technologies in your work? ====
 
Practicality with hauling equipment setting it up. Adds time to setup, before and pack up after especially when he goes to a club. worried about the club goers destroying the equipment. In his case  even having a laptop is extra work. His current workflow has no laptop he just has USB sticks with prepared music and plays it live on the already setup equipment.
 
==== What features would you find most beneficial in a hand motion recognition controller for your work? ====
 
If it could learn specific gestures. Could be solved with sign language library.
He likes the idea of assigning specific gestures in the GUI to be able to switch between different sound modulations. For example show thumbs up and then the movement of his fingers modulates pitch. Then a different gesture and his movements modulate something else.
 
==== What are your expectations regarding the responsiveness and accuracy of motion-based controllers? ====
 
Delay is ok, if he learns it and expects. If it's playing it like an instrument it's a problem but if it's just modifying a sound without changing how rhythmic it is delay is completely fine
 
==== How important are the design and aesthetics of the tools you use? ====
 
Aesthetics don't matter unless it's a commercial product. If he is supposed to pay for it he expects nice visuals otherwise if it works he doesn't care.
 
==== What emerging technologies are you most excited about in your field? ====
 
He says there is not that much new technology. Some improvements in software like AI tools that can separate music tracks etc... But otherwise it is a pretty figured out standardized industry.
 
=== Interview (Matus S.) with (Sami L., music producer) ===
 
==== Important takeaways: ====
 
# Might be too imprecise to work in music production
# Difficulty of using the software (learning it, implementing it) should not outweigh the benefit you could gain from it, howerer in this specific case he does not think he incorporate it in is workflow but would be interesting in trying it.
 
==== What tools or instruments do you currently use in your creative process? ====
 
For music production mainly uses Ableton as main editing software, uses libraries to find sounds, and sometimes third party equalizers or midi controllers (couldnt give me a name on the spot).
 
==== Have you previously incorporated technology into your performances or creations? If so, how? ====
 
No, doesn't do live performances.
 
==== What challenges do you face when adopting new technologies in your work? ====
 
Learning curves
Price
 
==== What features would you find most beneficial in a hand motion recognition controller for your work? ====
 
Being able to control eq's in his opened tracks, or control some third party tools, changing modulation, or volume levels of sounds/loops.
 
==== What are your expectations regarding the responsiveness and accuracy of motion-based controllers? ====
 
Would not want delay if its would be like a virtual instrument, but if the tool would only be used as eq or changing modulation, volume levels then its fine, but would be a little skeptical of accuracy of the gestures sensing.
 
==== How important are the design and aesthetics of the tools you use? ====
 
If its not commercial doesn't really care but ideally would avoid 20 y.o. looking software/interface
 
==== What emerging technologies are you most excited about in your field? ====
 
Does not really track them.
 
=== Interview (Nikola M.) with ( a) Louis P., DJ and Producer, and b) Samir S., Producer and Hyper-pop artist) ===
 
==== Important takeaways: ====
 
# Could be an interesting tool for live performances, but impractical for music production due to the imprecise nature of the software in comparison to physical controllers
# Must be very-low-to-no latency, in order to make sure no disconnect during live performances
# Must be easily compatible with a wide range of software
 
==== What tools or instruments do you currently use in your creative process? ====
a) MIDI duh, Ableton
 
b) DAW (Ableton Live 11) and various synthesizers (mostly Serum and Omnisphere)
 
==== Have you previously incorporated technology into your performances or creations? If so, how? ====
 
a) MIDI knobs
 
b) Not a whole lot, outside of what I already use to create. When I perform I usually just have a backing track and microphone.
 
==== What challenges do you face when adopting new technologies in your work? ====
 
a) Software compatibility
 
b) Lack of ease of use and lack of available tutorials
 
==== What features would you find most beneficial in a hand motion recognition controller for your work? ====
 
a) I'd like just a empty slot where you can choose between dry and wet 0-100% and you let the producer chose which effect / sound to put on it , but probably some basic ones like delay or reverb would be a start.
 
b) Being able to modulate chosen MIDI parameters (pitch bend, mod wheel, or even mapping a parameter to a plugin’s parameter), by using a certain hand stroke (maybe horizontal movements can control one parameter and vertical movements another parameter?)
 
==== What are your expectations regarding the responsiveness and accuracy of motion-based controllers? ====
 
a) Instant recognition no lag
 
b) Should be responsive and fluid/continuous rather than discrete. There should be a sensitivity parameter to adjust how sensitive the controller is to hand actions, so that (for example) the pitch doesn’t bend if your hand moves slightly up/down while modulating a horizontal movement parameter
 
==== How important are the design and aesthetics of the tools you use? ====
 
a) Not that important its about the music and I think any hand controller thing would look cool because its technology
 
b) I would say it’s fairly important.
 
==== What emerging technologies are you most excited about in your field? ====
 
a) 808s
 
b) I’m not sure unfortunately. I don’t keep up too much with emerging technologies in music production.
 
= State of the art =
 
===== Preface =====
As there exists no commercial product on this topic, and state of the art for this exact product consists dominantly of only the open source community, thus we tried looking more into the state of the art of the individual components that we would need in order to complete this product. Our inspiration to even create this project came from an Instagram video (https://www.instagram.com/p/DCwbZwczaER/), from a Taiwanese visual media artist. After a bit of research we found more artists that do similar stuff, for example a British artist, creating gloves for hand gesture control \url{<nowiki>https://mimugloves.com/#product-list}</nowiki>. We also discovered a commercially sold product that utilizes a similar technology to our idea to recognize hand movements and gestures for piano players (https://roli.com/eu/product/airwave-create). This product uses cameras and ultrasound to map hand movements of a piano player while they perform, to add an additional layer of musical expression. Hover, this product is merely an accessory to an already existing instrument and not a fully fleshed out instrument in it's own right.
 
===== Hardware =====
There are various kinds of cameras and sensors that can be used for hand tracking and gesture recognition. The main types used are RGB cameras, IR sensors and RGB-D cameras. Each type of camera/sensor has its own advantages/disadvantages regarding costs and resolution. RGB cameras provide high resolution colour information and are cost efficient, but do not provide depth information of an image. Infra red (IR) sensors can capture very detailed hand movements, but are sensitive to other sources of IR light. Depth sensors/cameras such as RGB-D cameras can be used to construct an accurate depth image of the hand, but does cost more than an RGB camera, and sometimes does not reach the same resolution as an RGB camera. A more accurate improvement would be to use multiple cameras (RGB/IR/Depth) to create more robust tracking and gesture recognition in various environments. The disadvantages of this increases both the system complexity and costs, and it would require synchronization between sensors [17]. Another way of increasing tracking accuracy would be to use mirrors. An advantage of doing this is that it reduces the costs of needing multiple cameras [18]. The camera type most commonly used in computer vision tasks regarding hand tracking ang gesture recognition is the RBG-D camera [4],[5].
 
===== Software =====
The literature on software-based visual capture for hand-gesture recognition emphasizes OpenCV — an open-source library offering optimized algorithms for tasks such as object detection, motion tracking, and image processing [12] — as a cornerstone for real-time image analysis. In most cases studied, this is augmented by MediaPipe, a framework developed by Google that provides ready-to-use pipelines for multimedia processing, including advanced hand tracking and pose estimation [14] Collectively, these works demonstrate that a typical pipeline involves detecting and segmenting the hand, extracting key features or keypoints, and classifying the gesture, often in real time [12],[15],[16]. By leveraging various tools and the aforementioned libraries, researchers achieve robust performance in varying environments, addressing issues such as changes in lighting or background noise [13]. Most of the papers suggest using Python due to its accessibility and ease of integration with other tools [12]–[16].
 
= Approach, milestones and deliverables =
* Market research interviews with musicians, music producers etc.
** Requirements for hardware
** Ease of use requirements
** Understanding of how to seamlessly integrate our product into a musicians workflow.
 
* Find software stack solutions
** Library for hand tracking
** Encoder to midi or another viable format.
** Synthesizer that can accept live inputs in chosen encoding format.
** Audio output solution
* Find hardware solutions
** Camera/ visual input
*** Multiple cameras
*** IR depth tracking
*** Viability of stander webcam laptop or otherwise
* MVP (Minimal viable product)
** Create a demonstration product proving the viably of the concept by modifying a single synthesizer using basic hand gestures and a laptop webcam/ other easily accessible camera.
* Test with potential users and get feedback
* Refined final product
** Additional features
** Ease of use and integration improvements
** Testing on different hardware and software platforms
** Visual improvements to the software
** Potential support for more encoding formats or additional input methods other then hand tracking
 
=== Who is doing what? ===
Nikola - Interface with audio software and music solution
 
Senn - Hardware interface and hardware solutions
 
Gabriel, Rares, Matus - Software processing of input and producing output
 
* Gabriel: GUI and GPU acceleration with Open-GL
* Rares: Main hand tracking soulutions
* Matus: Code to midi and Ableton integration
 
 
= The Design =
Our design consists of a software tool which acts as a digital MIDI controller. Using the position of the user’s hand as an input, the design can send out MIDI commands to music software or digital instruments. The design consists of three main parts, the GUI, the hand tracking and gesture recognition software and the software responsible for sending out the MIDI commands. Through the GUI, the user can see how their hand is being tracked, and is able to customize what sound parameters are controlled by what hand motions or gestures. The hand tracking software tracks the position of the hand given a camera input, and the gesture recognition software should be able to notice when the user performs a certain pre determined gesture. Lastly, there should be the lines of code responsible for sending out the correct MIDI commands in an adequate way. The design of each part is explained and motivated below.
 
== GUI ==
The GUI chosen for the design is a dashboard UI through which the user can determine their own style of sound control. The GUI was designed to be as user-friendly as possible and contain extensive customization options for how the user wishes to control the sound parameters. The buttons and sliders integrated within the GUI were made to be as intuitive and easy to understand as possible. That way, if an music artist or producer were to use the design, they would not have to spend a long time familiarizing with how they should use it. The GUI used for our design is shown in figure ??.
[[File:GUI screen.png|center|thumb|540x540px|Figure ??: The GUI used in our design]]
The GUI shows the user the camera input the software is receiving, and how their hand is being tracked. The GUI also allows the user to choose what camera to use if their device were to have multiple camera inputs. Furthermore, there is also a “Detection Confidence” slider to change the detection confidence of the hand tracking code. If the user observes that their hand is not being tracked accurately enough, or if the hand tracking is too choppy, they can change the detection confidence. This can be convenient for using the design in different lighting conditions. Below the camera screen the user can select what hand gestures control what parameter of the sound. They can also select through what MIDI channel the corresponding commands are sent, and they can choose to activate or deactivate the control by ticking the active box. The hand gestures that can be used for control are shown in table 1. The sound parameters that can be controlled are shown in table 2.
{| class="wikitable"
|+Table 1
!Hand gestures
|-
|Bounding box
|-
|Pinch
|-
|Fist
|-
|Victory
|-
|Thumbs up
|}
{| class="wikitable"
|+Table 2
!Sound parameters
|-
|Volume
|-
|Octave
|-
|Modulation
|}
 
 
Further explanation on how the control works and why these hand gestures were chosen is given in the section on hand tracking and gesture recognition in the design. The GUI also allows the user to add and remove gestures. The newly added gestures will appear below the already existing gesture settings. An example of a possible combination of gesture settings can be seen in figure ??. When clicking “Apply Gesture Settings”, the gestures the user has set as active will be used to control their corresponding sound parameter. If the user wished to add or remove a gesture, change what gesture controls what parameter or activate/deactivate a gesture, they will have to click the “Apply Gesture Settings” button again in order for the new settings to be applied. In order to stop all MIDI commands being sent, the user can click the “MIDI STOP” button.
[[File:Example gesture settings.png|center|thumb|732x732px|Figure ??: An example of possible gesture settings]]
 
== Hand tracking and gesture recognition ==
 
== Sending MIDI commands ==
For sending out the information on what sound parameters are to be controlled, MIDI was chosen. This is because MIDI is a universal language that can be used to control synthesizers, music software including the likes of Ableton and VCVRack and other digital instruments. MIDI allows for sending simple control change commands. Such commands can be used to quickly control various sound parameters, which makes them ideal for being used in a hand tracking controller.
 
= Other explored features =
 
=== Multiple camera inputs ===
To further improve hand tracking accuracy, the possibility of using multiple camera inputs at the same time was explored and tested. To allow data inputs from two cameras to be handled effectively, various software libraries were studied to find one that would be best suitable. One library that seemed to be able to handle multiple camera inputs the best was OpenCV. Using the OpenCV library, a program was created that could take the input of two cameras and output the captured images. The two cameras used to test the software were a laptop webcam, and a phone camera that was wirelessly connected to the laptop as a webcam. The program seemed to run correctly, however there was also a lot of delay between the two cameras. In an attempt to reduce this latency, the program was adjusted to have a lower buffer size for both camera inputs and to manually decrease the resolution, however this did not seem to reduce the latency. An advantage of using OpenCV is that it allows for threading so that the frames of both cameras can be retrieved in separate threads. Along with that a technique called fps smoothing can be implemented to try and synchronize frames. However, even after implementing both threading and frame smoothing, the latency did not seem to reduce. The limitations could be because of the wireless connection between the phone and the laptop, so a USB connected webcam could possibly lead to better results.
 
Before continuing with using multiple cameras, a test has to be carried out to see whether or not the hand tracking software works on multiple camera inputs. To do this, the code for handling multiple camera inputs was extended using the hand tracking software. The handling of multiple camera inputs was done via threading. Upon first inspection, the hand tracking software did seem to work for multiple cameras. From each camera input, a completely separate hand model is created depending on the position of the camera. However, a big drawback of using two camera inputs which each run the hand tracking algorithm, is that the fps of both camera inputs decreases. Even through using fps smoothing the camera inputs suffered a very low fps. This is probably due to the fact that running the hand tracking software for two camera inputs becomes computationally expensive. This very low fps results in the system not being very user-friendly when using multiple cameras. Therefore, we have decided to focus on creating a well-preforming single-camera system before returning to using multiple cameras.
 
= Code documentation =
 
=== Setup ===
 
1. Download python 3.9-3.12
2. Clone repository
```sh
git clone<nowiki/>https://github.com/Gabriel-Karpinsky/Project-Robots-Everywhere-Group-15
```
3. Create a python virtual environment
```sh
pip install -r requirements.txt
```
4. Install loop midi https://www.tobias-erichsen.de/software/loopmidi.html
5. Create a port named "Python to VCV"
6. Use a software of your choice that accepts midi
 
=== GUI ===
After researching potential GUI solutions we ended up narrowing it down to a few. Those being native python GUI implementation with either PyQt6 or a Java script based interface written in react/angular run either in a browser via a web socket or as a stand alone app. Main considerations were, looks, performance and ease of implementation. We ended  up settling on PyQt6 as it is native to python, allows for Open-GL hardware acceleration and is relatively easy to implement for our use case. A MVP prototype is not ready yet but it is being worked. We identified some of the most important features to by implemented first as: changing input sensitivity of the hand tracking. Changing the camera input channel in case of multiple cameras connected, and a start stop button.
 
===== PyQt6 library =====
The code for the GUI was written in the file HandTrackingGUI.py. In order to create the GUI in Python, the PyQt6 library was used. This library contains Python bindings for Qt, a framework for creating graphical user interfaces. More information on the library and the documentation can be found here: https://pypi.org/project/PyQt6/.
 
The main GUI window is created by creating a class from '''QWidget''', called “HandTrackingGUI”. This class is responsible for setting up everything in the GUI window, including buttons and sliders. QWidget is also responsible for creating the gesture setting rows where the user can select what gesture they want to use to control what sound parameter. This is done using the “GestureSettingRow” class. Each time the user adds a new gesture, a new row has to be created. This is done within the GestureSettingsWidget class. Every time a new gesture is added, a new instance of the GestureSettingRow class is generated. This allows for multiple different mappings from gesture to form of MIDI control.
 
While running the GUI, there should also be code that handles the hand tracking. To make sure the GUI does not freeze, this code should be run simultaneously with the code that runs the GUI. This can be achieved via multi-threading, a technique that makes sure multiple parts of a program can be run at the same time. The PyQt6 library allows for multithreading by creating a class using '''QThread'''. This class is called “HandTrackingThread”, which is responsible for running the hand tracking code.
 
The current GUI is made using a default layout and styling. This was done because the focus was mainly on creating a functional GUI, rather than making it visually appealing using stylistic features. However, there are plans to make further improvements to the GUI. The idea is to use a tool called Qt Widgets Designer (https://doc.qt.io/qt-6/qtdesigner-manual.html) to create a more visually appealing GUI. This tool is a drag and drop GUI designer for PyQt. It can help simplify making a more customized and complicated GUI, as extensive coding would not be necessary.
 
=== Output to audio software ===
[[File:VCV setup.png|thumb|456x456px|Figure ??: The VCV Rack setup used to test MIDI commands generated using Python]]
We have to decide more on how the product should look, if we want it to be, in very simple terms, a knob (or any number of knobs) from a mixing deck or if we want it to be able run as executable inside Ableton or something else entirely. Based on this discussion there are several options on what the output the software should have. Based on the interview with Jakub K. we learned that we could just pass Ableton MIDI and in cases an int or some other metadata. For the case of having it as an executable it would be much more difficult, we would have to do more research on how to make it correctly interact with Ableton as we still would need to figure out how to change parameters from inside Ableton.
 
===== VCV Rack =====
Since we could not obtain a version of Ableton, we have decided to test how we can send MIDI commands via Python using VCV Rack. An overview of the VCV Rack setup used can be seen in figure ??. In order to play notes sent via MIDI commands, the setup needs a MIDI-CV module to convert MIDI signals to voltage signals. It also needs a MIDI-CC module to handle control changes. Along with that it needs an amplifier to regulate the volume and an oscillator to regulate the pitch of the sound. Finally it requires a module that can output the audio. The setup also contains a monitor to keep track of the MIDI messages being sent. The Python code was connected to VCV Rack using a custom MIDI output port created using loopMIDI. A Python script was created which could send different notes at a customizable volume using MIDI commands. All notes sent were successfully picked up and played by the VCV Rack setup.
 
===== Mido library =====
In order to send MIDI commands to VCV rack, a library called mido was used (https://mido.readthedocs.io/en/stable/index.html). These messages allow for sending specified notes at a certain volume. Before sending these messages however, a MIDI output port must be selected. This can be done using the open_output() command, and passing the name of the created loopMIDI port as the argument. After this is done, notes can be sent using the following command structure:
{| class="wikitable"
|+
!output_port.send(Message('note_on', note=..., velocity=..., time=...))
|}
where ‘note_on’ indicates that a note is to be played, note requires a number that represents the pitch, velocity represents a number that represents the volume and time indicates for how many seconds the note should play. The time attribute can be omitted if the note is wished to keep playing without having a time constraint. To stop the note from being played, the same command can be given, where ‘note_on’ is changed to ‘note_off’.
 
To change sound parameters such as volume, octave or modulation, a 'control_change' message can be sent. In order to send these messages, the following command structure can be used:
{| class="wikitable"
|+
!output_port.send(Message('control_change', control=... , channel=... ,value=...))
|}
where ‘control’ takes the control change value corresponding to the sound parameter that is to be modified. An overview of these control change values can be found here (https://midi.org/midi-1-0-control-change-messages). The ‘channel’ argument determines what MIDI channel the message is to be transmitted over. ‘value’ determines how much the chosen sound parameter is  modified, ranging from 0 to 127. For example, if the volume should be changed to 70%, ‘control’ should have a value of 7, and ‘value’ should have a value of around 89. In order to stop all notes being sent, a control value of 123 can be used.
 
===== MIDITransmitter class =====
In the hand tracking module (HandTrackingModule.py), a class called MIDITransmitter was created to send control change messages to the music software. When initialized the class tries to establish a connection to a specified MIDI output port using the '''connect()''' function. In our case, this specified port is labelled “Python to VCV”. When all notes need to be turned off, the '''send_stop()''' function can be. This function makes sure that notes on all channels are turned off.
 
The class also contains functions that handle a control change in volume, octave and modulation. These functions are called '''send_volume()''', '''send_octave()''' and '''send_modulation()''' respectively. There is also a function called '''send_cc()''', which can be used to make a control change to all aforementioned parameters. This function is used by the GUI to make control changes, passing the appropriate control value as an argument.
 
=== Hand tracking and gesture recognition ===
We have implemented a real-time hand detection system that tracks the position of 21 landmarks on a user's hand using the MediaPipe library. It captures live video input, processes each frame to detect the hand, and maps the 21 key points to create a skeletal representation of the hand. The program calculates the Euclidean distance between the tip of the thumb and the tip of the index finger. This distance is then mapped to a volume control range using the pycaw library. As the user moves their thumb and index finger closer or farther apart, the system's volume is adjusted in real time.
 
Additionally, we are in the process of implementing a program that is capable of recognising and interpreting American Sign Language (ASL) hand signs in real time. It consists of two parts: data collection and testing. The data collection phase captures images of hand signs using a webcam, processes them to create a standardised dataset, and stores them for training a machine learning model. The testing phase uses a pre-trained model to classify hand signs in real time. By recognizing and distinguishing different hand signs, the program will be able to map each gesture to specific ways of manipulating sound, such as adjusting pitch, modulating filters, changing volume, etc. For example, a gesture for the letter "A" could activate a low-pass filter, while a gesture for "B" could increase the volume.
 
=== Multiple camera inputs ===
To allow data inputs from two cameras to be handled effectively, various software libraries were studied to find one that would be best suitable. One library that seemed to be able to handle multiple camera inputs the best was OpenCV. Using the OpenCV library, a program was created that could take the input of two cameras and output the captured images. The two cameras used to test the software were a laptop webcam, and a phone camera that was wirelessly connected to the laptop as a webcam. The program seemed to run correctly, however there was also a lot of delay between the two cameras. In an attempt to reduce this latency, the program was adjusted to have a lower buffer size for both camera inputs and to manually decrease the resolution, however this did not seem to reduce the latency. An advantage of using OpenCV is that it allows for threading so that the frames of both cameras can be retrieved in separate threads. Along with that a technique called fps smoothing can be implemented to try and synchronize frames. However, even after implementing both threading and frame smoothing, the latency did not seem to reduce. The limitations could be because of the wireless connection between the phone and the laptop, so a USB connected webcam could possibly lead to better results.
 
Before continuing with using multiple cameras, a test has to be carried out to see whether or not the hand tracking software works on multiple camera inputs. To do this, the code for handling multiple camera inputs was extended using the hand tracking software. The handling of multiple camera inputs was done via threading. Upon first inspection, the hand tracking software did seem to work for multiple cameras. From each camera input, a completely separate hand model is created depending on the position of the camera. However, a big drawback of using two camera inputs which each run the hand tracking algorithm, is that the fps of both camera inputs decreases. Even through using fps smoothing the camera inputs suffered a very low fps. This is probably due to the fact that running the hand tracking software for two camera inputs becomes computationally expensive. This very low fps results in the system not being very user-friendly when using multiple cameras. Therefore, we have decided to focus on creating a well-preforming single-camera system before returning to using multiple cameras.
 
== Possible improvements and future work ==
 
= References =
[1] “A MIDI Controller based on Human Motion Capture (Institute of Visual Computing, Department of Computer Science, Bonn-Rhein-Sieg University of Applied Sciences),” ResearchGate. Accessed: Feb. 12, 2025. [Online]. Available: <nowiki>https://www.researchgate.net/publication/264562371_A_MIDI_Controller_based_on_Human_Motion_Capture_Institute_of_Visual_Computing_Department_of_Computer_Science_Bonn-Rhein-Sieg_University_of_Applied_Sciences</nowiki>
 
[2] M. Lim and N. Kotsani, “An Accessible, Browser-Based Gestural Controller for Web Audio, MIDI, and Open Sound Control,” ''Computer Music Journal'', vol. 47, no. 3, pp. 6–18, Sep. 2023, doi: 10.1162/COMJ_a_00693.
 
[3] M. Oudah, A. Al-Naji, and J. Chahl, “Hand Gesture Recognition Based on Computer Vision: A Review of Techniques,” ''J Imaging'', vol. 6, no. 8, p. 73, Jul. 2020, doi: 10.3390/jimaging608007
 
[4] A. Tagliasacchi, M. Schröder, A. Tkach, S. Bouaziz, M. Botsch, and M. Pauly, “Robust Articulated‐ICP for Real‐Time Hand Tracking,” Computer Graphics Forum, vol. 34, no. 5, pp. 101–114, Aug. 2015, doi: 10.1111/cgf.12700.
 
[5] A. Tkach, A. Tagliasacchi, E. Remelli, M. Pauly, and A. Fitzgibbon, “Online generative model personalization for hand tracking,” ACM Transactions on Graphics, vol. 36, no. 6, pp. 1–11, Nov. 2017, doi: 10.1145/3130800.3130830.
 
[6] T. Winkler, Composing Interactive Music: Techniques and Ideas Using Max. Cambridge, MA, USA: MIT Press, 2001.
 
[7] E. R. Miranda and M. M. Wanderley, New Digital Musical Instruments: Control and Interaction Beyond the Keyboard. Middleton, WI, USA: AR Editions, Inc., 2006.
 
[8] D. Hosken, An Introduction to Music Technology, 2nd ed. New York, NY, USA: Routledge, 2014. doi: 10.4324/9780203539149.
 
[9] P. D. Lehrman and T. Tully, "What is MIDI?," Medford, MA, USA: MMA, 2017.
 
[10] C. Dobrian and F. Bevilacqua, Gestural Control of Music Using the Vicon 8 Motion Capture System. UC Irvine: Integrated Composition, Improvisation, and Technology (ICIT), 2003.
 
[11] J. L. Hernandez-Rebollar, “Method and apparatus for translating hand gestures,” US7565295B1, Jul. 21, 2009 Accessed: Feb. 12, 2025. [Online]. Available: https://patents.google.com/patent/US7565295B1/en
 
[12] I. Culjak, D. Abram, T. Pribanic, H. Dzapo, and M. Cifrek, “A brief introduction to OpenCV,” in 2012 Proceedings of the 35th International Convention MIPRO, May 2012, pp. 1725–1730. Accessed: Feb. 12, 2025. [Online]. Available: https://ieeexplore.ieee.org/document/6240859/?arnumber=6240859
 
[13] K. V. Sainadh, K. Satwik, V. Ashrith, and D. K. Niranjan, “A Real-Time Human Computer Interaction Using Hand Gestures in OpenCV,” in IOT with Smart Systems, J. Choudrie, P. N. Mahalle, T. Perumal, and A. Joshi, Eds., Singapore: Springer Nature Singapore, 2023, pp. 271–282.
 
[14] V. Patil, S. Sutar, S. Ghadage, and S. Palkar, “Gesture Recognition for Media Interaction: A Streamlit Implementation with OpenCV and MediaPipe,” International Journal for Research in Applied Science & Engineering Technology (IJRASET), 2023.
 
[15] A. P. Ismail, F. A. A. Aziz, N. M. Kasim, and K. Daud, “Hand gesture recognition on python and opencv,” IOP Conf. Ser.: Mater. Sci. Eng., vol. 1045, no. 1, p. 012043, Feb. 2021, doi: 10.1088/1757-899X/1045/1/012043.
 
[16] R. Tharun and I. Lakshmi, “Robust Hand Gesture Recognition Based On Computer Vision,” in 2024 International Conference on Intelligent Systems for Cybersecurity (ISCS), May 2024, pp. 1–7. doi: 10.1109/ISCS61804.2024.10581250.
 
[17] E. Theodoridou ''et al.'', “Hand tracking and gesture recognition by multiple contactless sensors: a survey,” ''IEEE Transactions on Human-Machine Systems'', vol. 53, no. 1, pp. 35–43, Jul. 2022, doi: 10.1109/thms.2022.3188840.
 
[18] G. M. Lim, P. Jatesiktat, C. W. K. Kuah, and W. T. Ang, “Camera-based Hand Tracking using a Mirror-based Multi-view Setup,” ''IEEE Engineering in Medicine and Biology Society. Annual International Conference'', pp. 5789–5793, Jul. 2020, doi: 10.1109/embc44109.2020.9176728.
 
[19] P. Rahimian and J. K. Kearney, “Optimal camera placement for motion capture systems,” ''IEEE Transactions on Visualization and Computer Graphics'', vol. 23, no. 3, pp. 1209–1221, Dec. 2016, doi: 10.1109/tvcg.2016.2637334.
 
[20] R. Tchantchane, H. Zhou, S. Zhang, and G. Alici, “A Review of Hand Gesture Recognition Systems Based on Noninvasive Wearable Sensors,” ''Advanced Intelligent Systems'', vol. 5, no. 10, p. 2300207, 2023, doi: 10.1002/aisy.202300207.
 
[21] Sahoo, J. P., Prakash, A. J., Pławiak, P., & Samantray, S. (2022). Real-Time Hand Gesture Recognition Using Fine-Tuned Convolutional Neural Network. ''Sensors'', ''22''(3), 706. <nowiki>https://doi.org/10.3390/s22030706</nowiki>
 
[22] Cheng, M., Zhang, Y., & Zhang, W. (2024). Application and Research of Machine Learning Algorithms in Personalized Piano Teaching System. International Journal of High Speed Electronics and Systems. <nowiki>http://doi.org/10.1142/S0129156424400949</nowiki>
 
[23] Rhodes, C., Allmendinger, R., & Climent, R. (2020). New Interfaces and Approaches to Machine Learning When Classifying Gestures within Music. Entropy. 22. <nowiki>http://doi.org/10.3390/e22121384</nowiki>
 
[24] Supriya, S., & Manoharan, C.. (2024). Hand gesture recognition using multi-objective optimization-based segmentation technique. Journal of Electrical Engineering. 21. 133-145. <nowiki>http://doi.org/10.59168/AEAY3121</nowiki>
 
[25] Benitez-Garcia, G., & Takahashi, Hiroki. (2024). Multimodal Hand Gesture Recognition Using Automatic Depth and Optical Flow Estimation from RGB Videos. <nowiki>http://doi.org/10.3233/FAIA240397</nowiki>
 
[26] Togootogtokh, E., Shih, T., Kumara, W.G.C.W., Wu, S.J., Sun, S.W., & Chang, H.H. (2018). 3D finger tracking and recognition image processing for real-time music playing with depth sensors. Multimedia Tools and Applications. 77. <nowiki>https://doi.org/10.1007/s11042-017-4784-9</nowiki>
 
[27] Manaris, B., Johnson, D., & Vassilandonakis, Yiorgos. (2013). Harmonic Navigator: A Gesture-Driven, Corpus-Based Approach to Music Analysis, Composition, and Performance. AAAI Workshop - Technical Report. 9. 67-74. <nowiki>http://doi.org/10.1609/aiide.v9i5.12658</nowiki>
 
[28] Velte, M. (2012). A MIDI Controller based on Human Motion Capture (Institute of Visual Computing, Department of Computer Science, Bonn-Rhein-Sieg University of Applied Sciences). <nowiki>http://doi.org/10.13140/2.1.4438.3366</nowiki>
 
[29] Dikshith, S.M. (2025). AirCanvas using OpenCV and MediaPipe. International Journal for Research in Applied Science and Engineering Technology. 13. 14671-1473. <nowiki>http://doi.org/10.22214/ijraset.2025.66601</nowiki>
 
[30] Patel, S., & Deepa, R. (2023). Hand Gesture Recognition Used for Functioning System Using OpenCV. 3-10. <nowiki>http://doi.org/10.4028/p-4589o3</nowiki>
 
=== Summary ===
 
===== Preface =====
As there exists no commercial product on this topic, and state of the art for this exact product consists dominantly of only the open source community, thus we tried looking more into the state of the art of the individual components that we would need in order to complete this product. Our inspiration to even create this project came from an Instagram video (https://www.instagram.com/p/DCwbZwczaER/), from a Taiwanese visual media artist. After a bit of research we found more artists that do similar stuff, for example a British artist, creating gloves for hand gesture control \url{<nowiki>https://mimugloves.com/#product-list}</nowiki>. We also discovered a commercially sold product that utilizes a similar technology to our idea to recognize hand movements and gestures for piano players (https://roli.com/eu/product/airwave-create). This product uses cameras and ultrasound to map hand movements of a piano player while they perform, to add an additional layer of musical expression. Hover, this product is merely an accessory to an already existing instrument and not a fully fleshed out instrument in it's own right.
 
===== Hardware =====
There are various kinds of cameras and sensors that can be used for hand tracking and gesture recognition. The main types used are RGB cameras, IR sensors and RGB-D cameras. Each type of camera/sensor has its own advantages/disadvantages regarding costs and resolution. RGB cameras provide high resolution colour information and are cost efficient, but do not provide depth information of an image. Infra red (IR) sensors can capture very detailed hand movements, but are sensitive to other sources of IR light. Depth sensors/cameras such as RGB-D cameras can be used to construct an accurate depth image of the hand, but does cost more than an RGB camera, and sometimes does not reach the same resolution as an RGB camera. A more accurate improvement would be to use multiple cameras (RGB/IR/Depth) to create more robust tracking and gesture recognition in various environments. The disadvantages of this increases both the system complexity and costs, and it would require synchronization between sensors [17]. Another way of increasing tracking accuracy would be to use mirrors. An advantage of doing this is that it reduces the costs of needing multiple cameras [18]. The camera type most commonly used in computer vision tasks regarding hand tracking ang gesture recognition is the RBG-D camera [4],[5].
 
===== Software =====
The literature on software-based visual capture for hand-gesture recognition emphasizes OpenCV — an open-source library offering optimized algorithms for tasks such as object detection, motion tracking, and image processing [12] — as a cornerstone for real-time image analysis. In most cases studied, this is augmented by MediaPipe, a framework developed by Google that provides ready-to-use pipelines for multimedia processing, including advanced hand tracking and pose estimation [14] Collectively, these works demonstrate that a typical pipeline involves detecting and segmenting the hand, extracting key features or keypoints, and classifying the gesture, often in real time [12],[15],[16]. By leveraging various tools and the aforementioned libraries, researchers achieve robust performance in varying environments, addressing issues such as changes in lighting or background noise [13]. Most of the papers suggest using Python due to its accessibility and ease of integration with other tools [12]–[16].
 
===== Current state of the art review =====

Latest revision as of 19:25, 7 April 2025


Group members: Nikola Milanovski, Senn Loverix, Rares Ilie, Matus Sevcik, Gabriel Karpinsky

Work done last week

Sheet with work ours and tasks: https://docs.google.com/spreadsheets/d/1LjFgobNuYJQzRVwhrHQrzcVzEVyR9A_9NcWLjT7u2a0/edit?usp=sharing
Name Total Breakdown
Nikola 6h Meetings (1h), Trying to set up virtual machine for Ableton (3h), Familiarizing with MIDI through python (2h)
Senn 11h Set up codebase for gui optimization (4h), researching optimization techniques (2h), Applying and testing optimization techniques (3h), meeting (1h), looking at python to MIDI (1h)
Rares 10.5h Researching ASL classification models (1h), Data collection and preprocessing (2.5h), Testing ASL classification (2h), Refinements on previous hand-tracking volume control program (2.5h), Research into additional possibilities of OpenCV and implementation of basic algorithms (finger counting) (2.5)
Matus 12.5h Meetings (0.5h), Preparing and conducting interviews (4h), Researching Ableton inputs, and in general what output should our software produce to work with audio software (6h), Understanding codebase and coding experiments (2h)
Gabriel 16h Preparing interviews (2h), Interview with Jakub K. (3h), GUI research (2h), modifiying codebase to be reproducable (git, venv, .env setup, package requirements.txt automation etc.) (6h), gui initial coding experiments (3h)

Problem statement and objectives

The synthesizer has become an essential instrument in the creation of modern day music. They allow musicians to modulate and create sounds electronically. Traditionally, an analog synthesizer utilizes a keyboard to generate notes, and different knobs, buttons and sliders to manipulate sound. However, through using MIDI (Music Instrument Digital Interface) the synthesizer can be controlled via an external device, usually also shaped like a keyboard, and the other controls are made digital. This allows for a wide range of approaches to what kind of input device is used to manipulate the digital synthesizer. Although traditional keyboard MIDI controllers have reached great success, its form may restrict expressiveness of musicians that seek to create more dynamic and unique sounds, as well as availability to people that struggle with the controls due to a lack of keyboard playing knowledge or a physical impairment for example.

During this project the aim is to design a new way of controlling a synthesizer using the motion of the users’ hand. By moving their hand to a certain position in front of a suitable sensor system which consists of one or more cameras, various aspects of the produced sound can be controlled, such as pitch or reverb. Computer vision techniques will be implemented in software to track the position of the users’ hand and fingers. Different orientations will be mapped to operations on the sound which the synthesizer will do. Through the use of MIDI, this information will be passed to a synthesizer software to produce the electronic sound. We aim to allow various users in the music industry to seamlessly implement this technology to create brand new sounds in an innovative, easy to control way to create these sounds in a more accessible way than through using a more traditional synthesizer.

Users

With this innovative way of producing music, the main targets for this technology are users in the music industry. Such users include performance artists and DJ’s, which can implement this technology to enhance their live performances or sets. Visual artists and motions based performers could integrate the technology within their choreography to  Other users include music producers looking to create unique sounds or rhythms in their tracks. Content creators that use audio equipment to enhance their content, such as soundboards, could use the technology as a new way to seamlessly control the audio of their content.

This new way of controlling a synthesizer could also be a great way to introduce people to creating and producing electronic music. It would be especially useful for people with some form of a physical impairment which could have restricted them from creating the music that they wanted before.

User requirements

For the users mentioned above, we have set up a list of requirements we would expect the users to have for this new synthesizer technology. First of all, it should be easy to set up for performance artists and producers so they don’t spend too much time preparing right before their performance or set. Next, the technology should be easily accessible and easy to understand for all users, both people that have a lot of experience with electronic music, and people that are relatively new to it.

Furthermore, the hand tracking should work in different environments. For example, a DJ that works in dimly lighted clubs who integrate a lot of different lighting and visual effects during their sets should still be able to rely on accurate hand tracking. There should also be the ability to easily integrate the technology into the artist’s workflow. An artist should not change their entire routines of performing or producing music if they want to use a motion based synthesizer.

Lastly, the technology should allow for elaborate customization to fit to each user’s needs. The user should be able to decide what attributes of the recognized hand gestures are important for their work, and which ones should be omitted. For example, if the vertical position of the hand regulates the pitch of the sound, and rotation of the hand the volume, the user should be able to ‘turn off’ the volume regulation so that if they rotate their hand nothing will change.

To get a better understanding of the user requirements, we are planning on interviewing some people in the music industry such as music producers, a DJ and an audiovisual producer. The interview questions are as follows:

Background and Experience: What tools or instruments do you currently use in your creative process? Have you previously incorporated technology into your performances or creations? If so, how?

Creative Process and Workflow: Can you describe your typical workflow? How do you integrate new tools or technologies into your practice? What challenges do you face when adopting new technologies in your work?

Interaction with Technology: Have you used motion-based controllers or gesture recognition systems in your performances or art? If yes, what was your experience?

How do you feel about using hand gestures to control audio or visual elements during a performance?

What features would you find most beneficial in a hand motion recognition controller for your work?

Feedback on Prototype: What specific functionalities or capabilities would you expect from such a device?

How important is the intuitiveness and learning curve of a new tool in your adoption decision?

Performance and Practical Considerations: In live performances, how crucial is the reliability of your equipment? What are your expectations regarding the responsiveness and accuracy of motion-based controllers?

How do you manage technical issues during a live performance?

How important are the design and aesthetics of the tools you use?

Do you have any ergonomic preferences or concerns when using new devices during performances?

What emerging technologies are you most excited about in your field?

Interview (Gabriel K.) with (Jakub K., Music group and club manager, techno producer and DJ)

Important takeaways:

  1. Artists, mainly DJs, have prepare tracked which are chopped up and ready to perform and mostly play with effects and values. So for this application using it instead of a knob or a digital slider to give the artist more granular control over an effect or a sound. For example if it our controller was an app with a GUI that could be turned on and of at will during the performance he thinks it could add "spice" to a performance.
  2. For visual control during a live performance, he thinks it's too difficult to use in doing it live especially compared to the current methods. But he can imagine using it to control for example color or some specific element.
  3. He says that many venues have multiple cameras already pre setup which could be used to capture the gestures from multiple angles and at high resolutions.
  4. He can definitely imagine using it live if he wanted to.
  5. If it is used to modulate sound and not to be played like an instrument delay isn't that much of a problem

What tools or instruments do you currently use in your creative process?

For a live performance he uses: pioneer 3000 player, can send music to xone-92 or pioneer v10 (industry standard). Then it goes to speakers.

alternatively instead of the pioneer 3000 a person can have a laptop going to a less advanced mixing dec.

Our hand tracking solution would be a laptop as a full input to a mixing dec. or an additional input to xone-92 as a input on a separate channel.

On the mixer he as a DJ mostly only uses eq, faders and adding effects on a knob leaving many channels open. Live performances use a lot more of the nobs then DJs

Have you previously incorporated technology into your performances or creations? If so, how?

Yes he has, him and a colleague tried to add a drum machine to a live performance. They had an Ableton project that had samples and virtual midi controllers to add a live element to their performance. But it was too cumbersome to add to his standard DJ workflow.

What challenges do you face when adopting new technologies in your work?

Practicality with hauling equipment setting it up. Adds time to setup, before and pack up after especially when he goes to a club. worried about the club goers destroying the equipment. In his case even having a laptop is extra work. His current workflow has no laptop he just has USB sticks with prepared music and plays it live on the already setup equipment.

What features would you find most beneficial in a hand motion recognition controller for your work?

If it could learn specific gestures. Could be solved with sign language library. He likes the idea of assigning specific gestures in the GUI to be able to switch between different sound modulations. For example show thumbs up and then the movement of his fingers modulates pitch. Then a different gesture and his movements modulate something else.

What are your expectations regarding the responsiveness and accuracy of motion-based controllers?

Delay is ok, if he learns it and expects. If it's playing it like an instrument it's a problem but if it's just modifying a sound without changing how rhythmic it is delay is completely fine

How important are the design and aesthetics of the tools you use?

Aesthetics don't matter unless it's a commercial product. If he is supposed to pay for it he expects nice visuals otherwise if it works he doesn't care.

What emerging technologies are you most excited about in your field?

He says there is not that much new technology. Some improvements in software like AI tools that can separate music tracks etc... But otherwise it is a pretty figured out standardized industry.

Interview (Matus S.) with (Sami L., music producer)

Important takeaways:

  1. Might be too imprecise to work in music production
  2. Difficulty of using the software (learning it, implementing it) should not outweigh the benefit you could gain from it, howerer in this specific case he does not think he incorporate it in is workflow but would be interesting in trying it.

What tools or instruments do you currently use in your creative process?

For music production mainly uses Ableton as main editing software, uses libraries to find sounds, and sometimes third party equalizers or midi controllers (couldnt give me a name on the spot).

Have you previously incorporated technology into your performances or creations? If so, how?

No, doesn't do live performances.

What challenges do you face when adopting new technologies in your work?

Learning curves Price

What features would you find most beneficial in a hand motion recognition controller for your work?

Being able to control eq's in his opened tracks, or control some third party tools, changing modulation, or volume levels of sounds/loops.

What are your expectations regarding the responsiveness and accuracy of motion-based controllers?

Would not want delay if its would be like a virtual instrument, but if the tool would only be used as eq or changing modulation, volume levels then its fine, but would be a little skeptical of accuracy of the gestures sensing.

How important are the design and aesthetics of the tools you use?

If its not commercial doesn't really care but ideally would avoid 20 y.o. looking software/interface

What emerging technologies are you most excited about in your field?

Does not really track them.

Interview (Nikola M.) with ( a) Louis P., DJ and Producer, and b) Samir S., Producer and Hyper-pop artist)

Important takeaways:

  1. Could be an interesting tool for live performances, but impractical for music production due to the imprecise nature of the software in comparison to physical controllers
  2. Must be very-low-to-no latency, in order to make sure no disconnect during live performances
  3. Must be easily compatible with a wide range of software

What tools or instruments do you currently use in your creative process?

a) MIDI duh, Ableton

b) DAW (Ableton Live 11) and various synthesizers (mostly Serum and Omnisphere)

Have you previously incorporated technology into your performances or creations? If so, how?

a) MIDI knobs

b) Not a whole lot, outside of what I already use to create. When I perform I usually just have a backing track and microphone.

What challenges do you face when adopting new technologies in your work?

a) Software compatibility

b) Lack of ease of use and lack of available tutorials

What features would you find most beneficial in a hand motion recognition controller for your work?

a) I'd like just a empty slot where you can choose between dry and wet 0-100% and you let the producer chose which effect / sound to put on it , but probably some basic ones like delay or reverb would be a start.

b) Being able to modulate chosen MIDI parameters (pitch bend, mod wheel, or even mapping a parameter to a plugin’s parameter), by using a certain hand stroke (maybe horizontal movements can control one parameter and vertical movements another parameter?)

What are your expectations regarding the responsiveness and accuracy of motion-based controllers?

a) Instant recognition no lag

b) Should be responsive and fluid/continuous rather than discrete. There should be a sensitivity parameter to adjust how sensitive the controller is to hand actions, so that (for example) the pitch doesn’t bend if your hand moves slightly up/down while modulating a horizontal movement parameter

How important are the design and aesthetics of the tools you use?

a) Not that important its about the music and I think any hand controller thing would look cool because its technology

b) I would say it’s fairly important.

What emerging technologies are you most excited about in your field?

a) 808s

b) I’m not sure unfortunately. I don’t keep up too much with emerging technologies in music production.

State of the art

Preface

As there exists no commercial product on this topic, and state of the art for this exact product consists dominantly of only the open source community, thus we tried looking more into the state of the art of the individual components that we would need in order to complete this product. Our inspiration to even create this project came from an Instagram video (https://www.instagram.com/p/DCwbZwczaER/), from a Taiwanese visual media artist. After a bit of research we found more artists that do similar stuff, for example a British artist, creating gloves for hand gesture control \url{https://mimugloves.com/#product-list}. We also discovered a commercially sold product that utilizes a similar technology to our idea to recognize hand movements and gestures for piano players (https://roli.com/eu/product/airwave-create). This product uses cameras and ultrasound to map hand movements of a piano player while they perform, to add an additional layer of musical expression. Hover, this product is merely an accessory to an already existing instrument and not a fully fleshed out instrument in it's own right.

Hardware

There are various kinds of cameras and sensors that can be used for hand tracking and gesture recognition. The main types used are RGB cameras, IR sensors and RGB-D cameras. Each type of camera/sensor has its own advantages/disadvantages regarding costs and resolution. RGB cameras provide high resolution colour information and are cost efficient, but do not provide depth information of an image. Infra red (IR) sensors can capture very detailed hand movements, but are sensitive to other sources of IR light. Depth sensors/cameras such as RGB-D cameras can be used to construct an accurate depth image of the hand, but does cost more than an RGB camera, and sometimes does not reach the same resolution as an RGB camera. A more accurate improvement would be to use multiple cameras (RGB/IR/Depth) to create more robust tracking and gesture recognition in various environments. The disadvantages of this increases both the system complexity and costs, and it would require synchronization between sensors [17]. Another way of increasing tracking accuracy would be to use mirrors. An advantage of doing this is that it reduces the costs of needing multiple cameras [18]. The camera type most commonly used in computer vision tasks regarding hand tracking ang gesture recognition is the RBG-D camera [4],[5].

Software

The literature on software-based visual capture for hand-gesture recognition emphasizes OpenCV — an open-source library offering optimized algorithms for tasks such as object detection, motion tracking, and image processing [12] — as a cornerstone for real-time image analysis. In most cases studied, this is augmented by MediaPipe, a framework developed by Google that provides ready-to-use pipelines for multimedia processing, including advanced hand tracking and pose estimation [14] Collectively, these works demonstrate that a typical pipeline involves detecting and segmenting the hand, extracting key features or keypoints, and classifying the gesture, often in real time [12],[15],[16]. By leveraging various tools and the aforementioned libraries, researchers achieve robust performance in varying environments, addressing issues such as changes in lighting or background noise [13]. Most of the papers suggest using Python due to its accessibility and ease of integration with other tools [12]–[16].

Approach, milestones and deliverables

  • Market research interviews with musicians, music producers etc.
    • Requirements for hardware
    • Ease of use requirements
    • Understanding of how to seamlessly integrate our product into a musicians workflow.
  • Find software stack solutions
    • Library for hand tracking
    • Encoder to midi or another viable format.
    • Synthesizer that can accept live inputs in chosen encoding format.
    • Audio output solution
  • Find hardware solutions
    • Camera/ visual input
      • Multiple cameras
      • IR depth tracking
      • Viability of stander webcam laptop or otherwise
  • MVP (Minimal viable product)
    • Create a demonstration product proving the viably of the concept by modifying a single synthesizer using basic hand gestures and a laptop webcam/ other easily accessible camera.
  • Test with potential users and get feedback
  • Refined final product
    • Additional features
    • Ease of use and integration improvements
    • Testing on different hardware and software platforms
    • Visual improvements to the software
    • Potential support for more encoding formats or additional input methods other then hand tracking

Who is doing what?

Nikola - Interface with audio software and music solution

Senn - Hardware interface and hardware solutions

Gabriel, Rares, Matus - Software processing of input and producing output

  • Gabriel: GUI and GPU acceleration with Open-GL
  • Rares: Main hand tracking soulutions
  • Matus: Code to midi and Ableton integration


The Design

Our design consists of a software tool which acts as a digital MIDI controller. Using the position of the user’s hand as an input, the design can send out MIDI commands to music software or digital instruments. The design consists of three main parts, the GUI, the hand tracking and gesture recognition software and the software responsible for sending out the MIDI commands. Through the GUI, the user can see how their hand is being tracked, and is able to customize what sound parameters are controlled by what hand motions or gestures. The hand tracking software tracks the position of the hand given a camera input, and the gesture recognition software should be able to notice when the user performs a certain pre determined gesture. Lastly, there should be the lines of code responsible for sending out the correct MIDI commands in an adequate way. The design of each part is explained and motivated below.

GUI

The GUI chosen for the design is a dashboard UI through which the user can determine their own style of sound control. The GUI was designed to be as user-friendly as possible and contain extensive customization options for how the user wishes to control the sound parameters. The buttons and sliders integrated within the GUI were made to be as intuitive and easy to understand as possible. That way, if an music artist or producer were to use the design, they would not have to spend a long time familiarizing with how they should use it. The GUI used for our design is shown in figure ??.

Figure ??: The GUI used in our design

The GUI shows the user the camera input the software is receiving, and how their hand is being tracked. The GUI also allows the user to choose what camera to use if their device were to have multiple camera inputs. Furthermore, there is also a “Detection Confidence” slider to change the detection confidence of the hand tracking code. If the user observes that their hand is not being tracked accurately enough, or if the hand tracking is too choppy, they can change the detection confidence. This can be convenient for using the design in different lighting conditions. Below the camera screen the user can select what hand gestures control what parameter of the sound. They can also select through what MIDI channel the corresponding commands are sent, and they can choose to activate or deactivate the control by ticking the active box. The hand gestures that can be used for control are shown in table 1. The sound parameters that can be controlled are shown in table 2.

Table 1
Hand gestures
Bounding box
Pinch
Fist
Victory
Thumbs up
Table 2
Sound parameters
Volume
Octave
Modulation


Further explanation on how the control works and why these hand gestures were chosen is given in the section on hand tracking and gesture recognition in the design. The GUI also allows the user to add and remove gestures. The newly added gestures will appear below the already existing gesture settings. An example of a possible combination of gesture settings can be seen in figure ??. When clicking “Apply Gesture Settings”, the gestures the user has set as active will be used to control their corresponding sound parameter. If the user wished to add or remove a gesture, change what gesture controls what parameter or activate/deactivate a gesture, they will have to click the “Apply Gesture Settings” button again in order for the new settings to be applied. In order to stop all MIDI commands being sent, the user can click the “MIDI STOP” button.

Figure ??: An example of possible gesture settings

Hand tracking and gesture recognition

Sending MIDI commands

For sending out the information on what sound parameters are to be controlled, MIDI was chosen. This is because MIDI is a universal language that can be used to control synthesizers, music software including the likes of Ableton and VCVRack and other digital instruments. MIDI allows for sending simple control change commands. Such commands can be used to quickly control various sound parameters, which makes them ideal for being used in a hand tracking controller.

Other explored features

Multiple camera inputs

To further improve hand tracking accuracy, the possibility of using multiple camera inputs at the same time was explored and tested. To allow data inputs from two cameras to be handled effectively, various software libraries were studied to find one that would be best suitable. One library that seemed to be able to handle multiple camera inputs the best was OpenCV. Using the OpenCV library, a program was created that could take the input of two cameras and output the captured images. The two cameras used to test the software were a laptop webcam, and a phone camera that was wirelessly connected to the laptop as a webcam. The program seemed to run correctly, however there was also a lot of delay between the two cameras. In an attempt to reduce this latency, the program was adjusted to have a lower buffer size for both camera inputs and to manually decrease the resolution, however this did not seem to reduce the latency. An advantage of using OpenCV is that it allows for threading so that the frames of both cameras can be retrieved in separate threads. Along with that a technique called fps smoothing can be implemented to try and synchronize frames. However, even after implementing both threading and frame smoothing, the latency did not seem to reduce. The limitations could be because of the wireless connection between the phone and the laptop, so a USB connected webcam could possibly lead to better results.

Before continuing with using multiple cameras, a test has to be carried out to see whether or not the hand tracking software works on multiple camera inputs. To do this, the code for handling multiple camera inputs was extended using the hand tracking software. The handling of multiple camera inputs was done via threading. Upon first inspection, the hand tracking software did seem to work for multiple cameras. From each camera input, a completely separate hand model is created depending on the position of the camera. However, a big drawback of using two camera inputs which each run the hand tracking algorithm, is that the fps of both camera inputs decreases. Even through using fps smoothing the camera inputs suffered a very low fps. This is probably due to the fact that running the hand tracking software for two camera inputs becomes computationally expensive. This very low fps results in the system not being very user-friendly when using multiple cameras. Therefore, we have decided to focus on creating a well-preforming single-camera system before returning to using multiple cameras.

Code documentation

Setup

1. Download python 3.9-3.12 2. Clone repository ```sh git clonehttps://github.com/Gabriel-Karpinsky/Project-Robots-Everywhere-Group-15 ``` 3. Create a python virtual environment ```sh pip install -r requirements.txt ``` 4. Install loop midi https://www.tobias-erichsen.de/software/loopmidi.html 5. Create a port named "Python to VCV" 6. Use a software of your choice that accepts midi

GUI

After researching potential GUI solutions we ended up narrowing it down to a few. Those being native python GUI implementation with either PyQt6 or a Java script based interface written in react/angular run either in a browser via a web socket or as a stand alone app. Main considerations were, looks, performance and ease of implementation. We ended up settling on PyQt6 as it is native to python, allows for Open-GL hardware acceleration and is relatively easy to implement for our use case. A MVP prototype is not ready yet but it is being worked. We identified some of the most important features to by implemented first as: changing input sensitivity of the hand tracking. Changing the camera input channel in case of multiple cameras connected, and a start stop button.

PyQt6 library

The code for the GUI was written in the file HandTrackingGUI.py. In order to create the GUI in Python, the PyQt6 library was used. This library contains Python bindings for Qt, a framework for creating graphical user interfaces. More information on the library and the documentation can be found here: https://pypi.org/project/PyQt6/.

The main GUI window is created by creating a class from QWidget, called “HandTrackingGUI”. This class is responsible for setting up everything in the GUI window, including buttons and sliders. QWidget is also responsible for creating the gesture setting rows where the user can select what gesture they want to use to control what sound parameter. This is done using the “GestureSettingRow” class. Each time the user adds a new gesture, a new row has to be created. This is done within the GestureSettingsWidget class. Every time a new gesture is added, a new instance of the GestureSettingRow class is generated. This allows for multiple different mappings from gesture to form of MIDI control.

While running the GUI, there should also be code that handles the hand tracking. To make sure the GUI does not freeze, this code should be run simultaneously with the code that runs the GUI. This can be achieved via multi-threading, a technique that makes sure multiple parts of a program can be run at the same time. The PyQt6 library allows for multithreading by creating a class using QThread. This class is called “HandTrackingThread”, which is responsible for running the hand tracking code.

The current GUI is made using a default layout and styling. This was done because the focus was mainly on creating a functional GUI, rather than making it visually appealing using stylistic features. However, there are plans to make further improvements to the GUI. The idea is to use a tool called Qt Widgets Designer (https://doc.qt.io/qt-6/qtdesigner-manual.html) to create a more visually appealing GUI. This tool is a drag and drop GUI designer for PyQt. It can help simplify making a more customized and complicated GUI, as extensive coding would not be necessary.

Output to audio software

Figure ??: The VCV Rack setup used to test MIDI commands generated using Python

We have to decide more on how the product should look, if we want it to be, in very simple terms, a knob (or any number of knobs) from a mixing deck or if we want it to be able run as executable inside Ableton or something else entirely. Based on this discussion there are several options on what the output the software should have. Based on the interview with Jakub K. we learned that we could just pass Ableton MIDI and in cases an int or some other metadata. For the case of having it as an executable it would be much more difficult, we would have to do more research on how to make it correctly interact with Ableton as we still would need to figure out how to change parameters from inside Ableton.

VCV Rack

Since we could not obtain a version of Ableton, we have decided to test how we can send MIDI commands via Python using VCV Rack. An overview of the VCV Rack setup used can be seen in figure ??. In order to play notes sent via MIDI commands, the setup needs a MIDI-CV module to convert MIDI signals to voltage signals. It also needs a MIDI-CC module to handle control changes. Along with that it needs an amplifier to regulate the volume and an oscillator to regulate the pitch of the sound. Finally it requires a module that can output the audio. The setup also contains a monitor to keep track of the MIDI messages being sent. The Python code was connected to VCV Rack using a custom MIDI output port created using loopMIDI. A Python script was created which could send different notes at a customizable volume using MIDI commands. All notes sent were successfully picked up and played by the VCV Rack setup.

Mido library

In order to send MIDI commands to VCV rack, a library called mido was used (https://mido.readthedocs.io/en/stable/index.html). These messages allow for sending specified notes at a certain volume. Before sending these messages however, a MIDI output port must be selected. This can be done using the open_output() command, and passing the name of the created loopMIDI port as the argument. After this is done, notes can be sent using the following command structure:

output_port.send(Message('note_on', note=..., velocity=..., time=...))

where ‘note_on’ indicates that a note is to be played, note requires a number that represents the pitch, velocity represents a number that represents the volume and time indicates for how many seconds the note should play. The time attribute can be omitted if the note is wished to keep playing without having a time constraint. To stop the note from being played, the same command can be given, where ‘note_on’ is changed to ‘note_off’.

To change sound parameters such as volume, octave or modulation, a 'control_change' message can be sent. In order to send these messages, the following command structure can be used:

output_port.send(Message('control_change', control=... , channel=... ,value=...))

where ‘control’ takes the control change value corresponding to the sound parameter that is to be modified. An overview of these control change values can be found here (https://midi.org/midi-1-0-control-change-messages). The ‘channel’ argument determines what MIDI channel the message is to be transmitted over. ‘value’ determines how much the chosen sound parameter is  modified, ranging from 0 to 127. For example, if the volume should be changed to 70%, ‘control’ should have a value of 7, and ‘value’ should have a value of around 89. In order to stop all notes being sent, a control value of 123 can be used.

MIDITransmitter class

In the hand tracking module (HandTrackingModule.py), a class called MIDITransmitter was created to send control change messages to the music software. When initialized the class tries to establish a connection to a specified MIDI output port using the connect() function. In our case, this specified port is labelled “Python to VCV”. When all notes need to be turned off, the send_stop() function can be. This function makes sure that notes on all channels are turned off.

The class also contains functions that handle a control change in volume, octave and modulation. These functions are called send_volume(), send_octave() and send_modulation() respectively. There is also a function called send_cc(), which can be used to make a control change to all aforementioned parameters. This function is used by the GUI to make control changes, passing the appropriate control value as an argument.

Hand tracking and gesture recognition

We have implemented a real-time hand detection system that tracks the position of 21 landmarks on a user's hand using the MediaPipe library. It captures live video input, processes each frame to detect the hand, and maps the 21 key points to create a skeletal representation of the hand. The program calculates the Euclidean distance between the tip of the thumb and the tip of the index finger. This distance is then mapped to a volume control range using the pycaw library. As the user moves their thumb and index finger closer or farther apart, the system's volume is adjusted in real time.

Additionally, we are in the process of implementing a program that is capable of recognising and interpreting American Sign Language (ASL) hand signs in real time. It consists of two parts: data collection and testing. The data collection phase captures images of hand signs using a webcam, processes them to create a standardised dataset, and stores them for training a machine learning model. The testing phase uses a pre-trained model to classify hand signs in real time. By recognizing and distinguishing different hand signs, the program will be able to map each gesture to specific ways of manipulating sound, such as adjusting pitch, modulating filters, changing volume, etc. For example, a gesture for the letter "A" could activate a low-pass filter, while a gesture for "B" could increase the volume.

Multiple camera inputs

To allow data inputs from two cameras to be handled effectively, various software libraries were studied to find one that would be best suitable. One library that seemed to be able to handle multiple camera inputs the best was OpenCV. Using the OpenCV library, a program was created that could take the input of two cameras and output the captured images. The two cameras used to test the software were a laptop webcam, and a phone camera that was wirelessly connected to the laptop as a webcam. The program seemed to run correctly, however there was also a lot of delay between the two cameras. In an attempt to reduce this latency, the program was adjusted to have a lower buffer size for both camera inputs and to manually decrease the resolution, however this did not seem to reduce the latency. An advantage of using OpenCV is that it allows for threading so that the frames of both cameras can be retrieved in separate threads. Along with that a technique called fps smoothing can be implemented to try and synchronize frames. However, even after implementing both threading and frame smoothing, the latency did not seem to reduce. The limitations could be because of the wireless connection between the phone and the laptop, so a USB connected webcam could possibly lead to better results.

Before continuing with using multiple cameras, a test has to be carried out to see whether or not the hand tracking software works on multiple camera inputs. To do this, the code for handling multiple camera inputs was extended using the hand tracking software. The handling of multiple camera inputs was done via threading. Upon first inspection, the hand tracking software did seem to work for multiple cameras. From each camera input, a completely separate hand model is created depending on the position of the camera. However, a big drawback of using two camera inputs which each run the hand tracking algorithm, is that the fps of both camera inputs decreases. Even through using fps smoothing the camera inputs suffered a very low fps. This is probably due to the fact that running the hand tracking software for two camera inputs becomes computationally expensive. This very low fps results in the system not being very user-friendly when using multiple cameras. Therefore, we have decided to focus on creating a well-preforming single-camera system before returning to using multiple cameras.

Possible improvements and future work

References

[1] “A MIDI Controller based on Human Motion Capture (Institute of Visual Computing, Department of Computer Science, Bonn-Rhein-Sieg University of Applied Sciences),” ResearchGate. Accessed: Feb. 12, 2025. [Online]. Available: https://www.researchgate.net/publication/264562371_A_MIDI_Controller_based_on_Human_Motion_Capture_Institute_of_Visual_Computing_Department_of_Computer_Science_Bonn-Rhein-Sieg_University_of_Applied_Sciences

[2] M. Lim and N. Kotsani, “An Accessible, Browser-Based Gestural Controller for Web Audio, MIDI, and Open Sound Control,” Computer Music Journal, vol. 47, no. 3, pp. 6–18, Sep. 2023, doi: 10.1162/COMJ_a_00693.

[3] M. Oudah, A. Al-Naji, and J. Chahl, “Hand Gesture Recognition Based on Computer Vision: A Review of Techniques,” J Imaging, vol. 6, no. 8, p. 73, Jul. 2020, doi: 10.3390/jimaging608007

[4] A. Tagliasacchi, M. Schröder, A. Tkach, S. Bouaziz, M. Botsch, and M. Pauly, “Robust Articulated‐ICP for Real‐Time Hand Tracking,” Computer Graphics Forum, vol. 34, no. 5, pp. 101–114, Aug. 2015, doi: 10.1111/cgf.12700.

[5] A. Tkach, A. Tagliasacchi, E. Remelli, M. Pauly, and A. Fitzgibbon, “Online generative model personalization for hand tracking,” ACM Transactions on Graphics, vol. 36, no. 6, pp. 1–11, Nov. 2017, doi: 10.1145/3130800.3130830.

[6] T. Winkler, Composing Interactive Music: Techniques and Ideas Using Max. Cambridge, MA, USA: MIT Press, 2001.

[7] E. R. Miranda and M. M. Wanderley, New Digital Musical Instruments: Control and Interaction Beyond the Keyboard. Middleton, WI, USA: AR Editions, Inc., 2006.

[8] D. Hosken, An Introduction to Music Technology, 2nd ed. New York, NY, USA: Routledge, 2014. doi: 10.4324/9780203539149.

[9] P. D. Lehrman and T. Tully, "What is MIDI?," Medford, MA, USA: MMA, 2017.

[10] C. Dobrian and F. Bevilacqua, Gestural Control of Music Using the Vicon 8 Motion Capture System. UC Irvine: Integrated Composition, Improvisation, and Technology (ICIT), 2003.

[11] J. L. Hernandez-Rebollar, “Method and apparatus for translating hand gestures,” US7565295B1, Jul. 21, 2009 Accessed: Feb. 12, 2025. [Online]. Available: https://patents.google.com/patent/US7565295B1/en

[12] I. Culjak, D. Abram, T. Pribanic, H. Dzapo, and M. Cifrek, “A brief introduction to OpenCV,” in 2012 Proceedings of the 35th International Convention MIPRO, May 2012, pp. 1725–1730. Accessed: Feb. 12, 2025. [Online]. Available: https://ieeexplore.ieee.org/document/6240859/?arnumber=6240859

[13] K. V. Sainadh, K. Satwik, V. Ashrith, and D. K. Niranjan, “A Real-Time Human Computer Interaction Using Hand Gestures in OpenCV,” in IOT with Smart Systems, J. Choudrie, P. N. Mahalle, T. Perumal, and A. Joshi, Eds., Singapore: Springer Nature Singapore, 2023, pp. 271–282.

[14] V. Patil, S. Sutar, S. Ghadage, and S. Palkar, “Gesture Recognition for Media Interaction: A Streamlit Implementation with OpenCV and MediaPipe,” International Journal for Research in Applied Science & Engineering Technology (IJRASET), 2023.

[15] A. P. Ismail, F. A. A. Aziz, N. M. Kasim, and K. Daud, “Hand gesture recognition on python and opencv,” IOP Conf. Ser.: Mater. Sci. Eng., vol. 1045, no. 1, p. 012043, Feb. 2021, doi: 10.1088/1757-899X/1045/1/012043.

[16] R. Tharun and I. Lakshmi, “Robust Hand Gesture Recognition Based On Computer Vision,” in 2024 International Conference on Intelligent Systems for Cybersecurity (ISCS), May 2024, pp. 1–7. doi: 10.1109/ISCS61804.2024.10581250.

[17] E. Theodoridou et al., “Hand tracking and gesture recognition by multiple contactless sensors: a survey,” IEEE Transactions on Human-Machine Systems, vol. 53, no. 1, pp. 35–43, Jul. 2022, doi: 10.1109/thms.2022.3188840.

[18] G. M. Lim, P. Jatesiktat, C. W. K. Kuah, and W. T. Ang, “Camera-based Hand Tracking using a Mirror-based Multi-view Setup,” IEEE Engineering in Medicine and Biology Society. Annual International Conference, pp. 5789–5793, Jul. 2020, doi: 10.1109/embc44109.2020.9176728.

[19] P. Rahimian and J. K. Kearney, “Optimal camera placement for motion capture systems,” IEEE Transactions on Visualization and Computer Graphics, vol. 23, no. 3, pp. 1209–1221, Dec. 2016, doi: 10.1109/tvcg.2016.2637334.

[20] R. Tchantchane, H. Zhou, S. Zhang, and G. Alici, “A Review of Hand Gesture Recognition Systems Based on Noninvasive Wearable Sensors,” Advanced Intelligent Systems, vol. 5, no. 10, p. 2300207, 2023, doi: 10.1002/aisy.202300207.

[21] Sahoo, J. P., Prakash, A. J., Pławiak, P., & Samantray, S. (2022). Real-Time Hand Gesture Recognition Using Fine-Tuned Convolutional Neural Network. Sensors, 22(3), 706. https://doi.org/10.3390/s22030706

[22] Cheng, M., Zhang, Y., & Zhang, W. (2024). Application and Research of Machine Learning Algorithms in Personalized Piano Teaching System. International Journal of High Speed Electronics and Systems. http://doi.org/10.1142/S0129156424400949

[23] Rhodes, C., Allmendinger, R., & Climent, R. (2020). New Interfaces and Approaches to Machine Learning When Classifying Gestures within Music. Entropy. 22. http://doi.org/10.3390/e22121384

[24] Supriya, S., & Manoharan, C.. (2024). Hand gesture recognition using multi-objective optimization-based segmentation technique. Journal of Electrical Engineering. 21. 133-145. http://doi.org/10.59168/AEAY3121

[25] Benitez-Garcia, G., & Takahashi, Hiroki. (2024). Multimodal Hand Gesture Recognition Using Automatic Depth and Optical Flow Estimation from RGB Videos. http://doi.org/10.3233/FAIA240397

[26] Togootogtokh, E., Shih, T., Kumara, W.G.C.W., Wu, S.J., Sun, S.W., & Chang, H.H. (2018). 3D finger tracking and recognition image processing for real-time music playing with depth sensors. Multimedia Tools and Applications. 77. https://doi.org/10.1007/s11042-017-4784-9

[27] Manaris, B., Johnson, D., & Vassilandonakis, Yiorgos. (2013). Harmonic Navigator: A Gesture-Driven, Corpus-Based Approach to Music Analysis, Composition, and Performance. AAAI Workshop - Technical Report. 9. 67-74. http://doi.org/10.1609/aiide.v9i5.12658

[28] Velte, M. (2012). A MIDI Controller based on Human Motion Capture (Institute of Visual Computing, Department of Computer Science, Bonn-Rhein-Sieg University of Applied Sciences). http://doi.org/10.13140/2.1.4438.3366

[29] Dikshith, S.M. (2025). AirCanvas using OpenCV and MediaPipe. International Journal for Research in Applied Science and Engineering Technology. 13. 14671-1473. http://doi.org/10.22214/ijraset.2025.66601

[30] Patel, S., & Deepa, R. (2023). Hand Gesture Recognition Used for Functioning System Using OpenCV. 3-10. http://doi.org/10.4028/p-4589o3

Summary

Preface

As there exists no commercial product on this topic, and state of the art for this exact product consists dominantly of only the open source community, thus we tried looking more into the state of the art of the individual components that we would need in order to complete this product. Our inspiration to even create this project came from an Instagram video (https://www.instagram.com/p/DCwbZwczaER/), from a Taiwanese visual media artist. After a bit of research we found more artists that do similar stuff, for example a British artist, creating gloves for hand gesture control \url{https://mimugloves.com/#product-list}. We also discovered a commercially sold product that utilizes a similar technology to our idea to recognize hand movements and gestures for piano players (https://roli.com/eu/product/airwave-create). This product uses cameras and ultrasound to map hand movements of a piano player while they perform, to add an additional layer of musical expression. Hover, this product is merely an accessory to an already existing instrument and not a fully fleshed out instrument in it's own right.

Hardware

There are various kinds of cameras and sensors that can be used for hand tracking and gesture recognition. The main types used are RGB cameras, IR sensors and RGB-D cameras. Each type of camera/sensor has its own advantages/disadvantages regarding costs and resolution. RGB cameras provide high resolution colour information and are cost efficient, but do not provide depth information of an image. Infra red (IR) sensors can capture very detailed hand movements, but are sensitive to other sources of IR light. Depth sensors/cameras such as RGB-D cameras can be used to construct an accurate depth image of the hand, but does cost more than an RGB camera, and sometimes does not reach the same resolution as an RGB camera. A more accurate improvement would be to use multiple cameras (RGB/IR/Depth) to create more robust tracking and gesture recognition in various environments. The disadvantages of this increases both the system complexity and costs, and it would require synchronization between sensors [17]. Another way of increasing tracking accuracy would be to use mirrors. An advantage of doing this is that it reduces the costs of needing multiple cameras [18]. The camera type most commonly used in computer vision tasks regarding hand tracking ang gesture recognition is the RBG-D camera [4],[5].

Software

The literature on software-based visual capture for hand-gesture recognition emphasizes OpenCV — an open-source library offering optimized algorithms for tasks such as object detection, motion tracking, and image processing [12] — as a cornerstone for real-time image analysis. In most cases studied, this is augmented by MediaPipe, a framework developed by Google that provides ready-to-use pipelines for multimedia processing, including advanced hand tracking and pose estimation [14] Collectively, these works demonstrate that a typical pipeline involves detecting and segmenting the hand, extracting key features or keypoints, and classifying the gesture, often in real time [12],[15],[16]. By leveraging various tools and the aforementioned libraries, researchers achieve robust performance in varying environments, addressing issues such as changes in lighting or background noise [13]. Most of the papers suggest using Python due to its accessibility and ease of integration with other tools [12]–[16].

Current state of the art review