PRE2024 3 Group15
Group members: Nikola Milanovski, Senn Loverix, Rares Ilie, Matus Sevcik, Gabriel Karpinsky
Work done last week
Name | Total | Breakdown |
Nikola | 10h | Research synth software (2h), Finalize interview questions and interviewing (4h), Meeting (1h), Researching stat-of-the-art (3h) |
Senn | 12h | Research different libraries (2h), familiarize with OpenCV (3h), research handling multiple camera inputs in OpenCV (2h), coding and testing (4h), Meeting (1h) |
Rares | 10h | Meeting (1h), Researching ASL classification models (2.5h), Data collection and preprocessing (5h), Testing ASL classification (1.5h) |
Matus | 12.5h | Meetings (0.5h), Preparing and conducting interviews (4h), Researching Ableton inputs, and in general what output should our software produce to work with audio software (6h), Understanding codebase and coding experiments (2h) |
Gabriel | 16h | Preparing interviews (2h), Interview with Jakub K. (3h), GUI research (2h), modifiying codebase to be reproducable (git, venv, .env setup, package requirements.txt automation etc.) (6h), gui initial coding experiments (3h) |
Problem statement and objectives
The synthesizer has become an essential instrument in the creation of modern day music. They allow musicians to modulate and create sounds electronically. Traditionally, an analog synthesizer utilizes a keyboard to generate notes, and different knobs, buttons and sliders to manipulate sound. However, through using MIDI (Music Instrument Digital Interface) the synthesizer can be controlled via an external device, usually also shaped like a keyboard, and the other controls are made digital. This allows for a wide range of approaches to what kind of input device is used to manipulate the digital synthesizer. Although traditional keyboard MIDI controllers have reached great success, its form may restrict expressiveness of musicians that seek to create more dynamic and unique sounds, as well as availability to people that struggle with the controls due to a lack of keyboard playing knowledge or a physical impairment for example.
During this project the aim is to design a new way of controlling a synthesizer using the motion of the users’ hand. By moving their hand to a certain position in front of a suitable sensor system which consists of one or more cameras, various aspects of the produced sound can be controlled, such as pitch or reverb. Computer vision techniques will be implemented in software to track the position of the users’ hand and fingers. Different orientations will be mapped to operations on the sound which the synthesizer will do. Through the use of MIDI, this information will be passed to a synthesizer software to produce the electronic sound. We aim to allow various users in the music industry to seamlessly implement this technology to create brand new sounds in an innovative, easy to control way to create these sounds in a more accessible way than through using a more traditional synthesizer.
Users
With this innovative way of producing music, the main targets for this technology are users in the music industry. Such users include performance artists and DJ’s, which can implement this technology to enhance their live performances or sets. Visual artists and motions based performers could integrate the technology within their choreography to Other users include music producers looking to create unique sounds or rhythms in their tracks. Content creators that use audio equipment to enhance their content, such as soundboards, could use the technology as a new way to seamlessly control the audio of their content.
This new way of controlling a synthesizer could also be a great way to introduce people to creating and producing electronic music. It would be especially useful for people with some form of a physical impairment which could have restricted them from creating the music that they wanted before.
User requirements
For the users mentioned above, we have set up a list of requirements we would expect the users to have for this new synthesizer technology. First of all, it should be easy to set up for performance artists and producers so they don’t spend too much time preparing right before their performance or set. Next, the technology should be easily accessible and easy to understand for all users, both people that have a lot of experience with electronic music, and people that are relatively new to it.
Furthermore, the hand tracking should work in different environments. For example, a DJ that works in dimly lighted clubs who integrate a lot of different lighting and visual effects during their sets should still be able to rely on accurate hand tracking. There should also be the ability to easily integrate the technology into the artist’s workflow. An artist should not change their entire routines of performing or producing music if they want to use a motion based synthesizer.
Lastly, the technology should allow for elaborate customization to fit to each user’s needs. The user should be able to decide what attributes of the recognized hand gestures are important for their work, and which ones should be omitted. For example, if the vertical position of the hand regulates the pitch of the sound, and rotation of the hand the volume, the user should be able to ‘turn off’ the volume regulation so that if they rotate their hand nothing will change.
To get a better understanding of the user requirements, we are planning on interviewing some people in the music industry such as music producers, a DJ and an audiovisual producer. The interview questions are as follows:
Background and Experience: What tools or instruments do you currently use in your creative process? Have you previously incorporated technology into your performances or creations? If so, how?
Creative Process and Workflow: Can you describe your typical workflow? How do you integrate new tools or technologies into your practice? What challenges do you face when adopting new technologies in your work?
Interaction with Technology: Have you used motion-based controllers or gesture recognition systems in your performances or art? If yes, what was your experience?
How do you feel about using hand gestures to control audio or visual elements during a performance?
What features would you find most beneficial in a hand motion recognition controller for your work?
Feedback on Prototype: What specific functionalities or capabilities would you expect from such a device?
How important is the intuitiveness and learning curve of a new tool in your adoption decision?
Performance and Practical Considerations: In live performances, how crucial is the reliability of your equipment? What are your expectations regarding the responsiveness and accuracy of motion-based controllers?
How do you manage technical issues during a live performance?
How important are the design and aesthetics of the tools you use?
Do you have any ergonomic preferences or concerns when using new devices during performances?
What emerging technologies are you most excited about in your field?
Interview (Gabriel K.) with (Jakub K., Music group and club manager, techno producer and DJ)
Important takeaways:
- Artists, mainly DJs, have prepare tracked which are chopped up and ready to perform and mostly play with effects and values. So for this application using it instead of a knob or a digital slider to give the artist more granular control over an effect or a sound. For example if it our controller was an app with a GUI that could be turned on and of at will during the performance he thinks it could add "spice" to a performance.
- For visual control during a live performance, he thinks it's too difficult to use in doing it live especially compared to the current methods. But he can imagine using it to control for example color or some specific element.
- He says that many venues have multiple cameras already pre setup which could be used to capture the gestures from multiple angles and at high resolutions.
- He can definitely imagine using it live if he wanted to.
- If it is used to modulate sound and not to be played like an instrument delay isn't that much of a problem
What tools or instruments do you currently use in your creative process?
For a live performance he uses: pioneer 3000 player, can send music to xone-92 or pioneer v10 (industry standard). Then it goes to speakers.
alternatively instead of the pioneer 3000 a person can have a laptop going to a less advanced mixing dec.
Our hand tracking solution would be a laptop as a full input to a mixing dec. or an additional input to xone-92 as a input on a separate channel.
On the mixer he as a DJ mostly only uses eq, faders and adding effects on a knob leaving many channels open. Live performances use a lot more of the nobs then DJs
Have you previously incorporated technology into your performances or creations? If so, how?
Yes he has, him and a colleague tried to add a drum machine to a live performance. They had an Ableton project that had samples and virtual midi controllers to add a live element to their performance. But it was too cumbersome to add to his standard DJ workflow.
What challenges do you face when adopting new technologies in your work?
Practicality with hauling equipment setting it up. Adds time to setup, before and pack up after especially when he goes to a club. worried about the club goers destroying the equipment. In his case even having a laptop is extra work. His current workflow has no laptop he just has USB sticks with prepared music and plays it live on the already setup equipment.
What features would you find most beneficial in a hand motion recognition controller for your work?
If it could learn specific gestures. Could be solved with sign language library. He likes the idea of assigning specific gestures in the GUI to be able to switch between different sound modulations. For example show thumbs up and then the movement of his fingers modulates pitch. Then a different gesture and his movements modulate something else.
What are your expectations regarding the responsiveness and accuracy of motion-based controllers?
Delay is ok, if he learns it and expects. If it's playing it like an instrument it's a problem but if it's just modifying a sound without changing how rhythmic it is delay is completely fine
How important are the design and aesthetics of the tools you use?
Aesthetics don't matter unless it's a commercial product. If he is supposed to pay for it he expects nice visuals otherwise if it works he doesn't care.
What emerging technologies are you most excited about in your field?
He says there is not that much new technology. Some improvements in software like AI tools that can separate music tracks etc... But otherwise it is a pretty figured out standardized industry.
Interview (Matus S.) with (Sami L., music producer)
Important takeaways:
- Might be too imprecise to work in music production
- Difficulty of using the software (learning it, implementing it) should not outweigh the benefit you could gain from it, howerer in this specific case he does not think he incorporate it in is workflow but would be interesting in trying it.
What tools or instruments do you currently use in your creative process?
For music production mainly uses Ableton as main editing software, uses libraries to find sounds, and sometimes third party equalizers or midi controllers (couldnt give me a name on the spot).
Have you previously incorporated technology into your performances or creations? If so, how?
No, doesn't do live performances.
What challenges do you face when adopting new technologies in your work?
Learning curves Price
What features would you find most beneficial in a hand motion recognition controller for your work?
Being able to control eq's in his opened tracks, or control some third party tools, changing modulation, or volume levels of sounds/loops.
What are your expectations regarding the responsiveness and accuracy of motion-based controllers?
Would not want delay if its would be like a virtual instrument, but if the tool would only be used as eq or changing modulation, volume levels then its fine, but would be a little skeptical of accuracy of the gestures sensing.
How important are the design and aesthetics of the tools you use?
If its not commercial doesn't really care but ideally would avoid 20 y.o. looking software/interface
What emerging technologies are you most excited about in your field?
Does not really track them.
Approach, milestones and deliverables
- Market research interviews with musicians, music producers etc.
- Requirements for hardware
- Ease of use requirements
- Understanding of how to seamlessly integrate our product into a musicians workflow.
- Find software stack solutions
- Library for hand tracking
- Encoder to midi or another viable format.
- Synthesizer that can accept live inputs in chosen encoding format.
- Audio output solution
- Find hardware solutions
- Camera/ visual input
- Multiple cameras
- IR depth tracking
- Viability of stander webcam laptop or otherwise
- Camera/ visual input
- MVP (Minimal viable product)
- Create a demonstration product proving the viably of the concept by modifying a single synthesizer using basic hand gestures and a laptop webcam/ other easily accessible camera.
- Test with potential users and get feedback
- Refined final product
- Additional features
- Ease of use and integration improvements
- Testing on different hardware and software platforms
- Visual improvements to the software
- Potential support for more encoding formats or additional input methods other then hand tracking
Who is doing what?
Nikola - Interface with audio software and music solution
Senn - Hardware interface and hardware solutions
Gabriel, Rares, Matus - Software processing of input and producing output
- Gabriel: GUI and GPU acceleration with Open-GL
- Rares: Main hand tracking soulutions
- Matus: Code to midi and Ableton integration
Code documentation
GUI
After researching potential GUI solutions we ended up narrowing it down to a few. Those being native python GUI implementation with either PyQt6 or a Java script based interface written in react/angular run either in a browser via a web socket or as a stand alone app. Main considerations were, looks, performance and ease of implementation. We ended up settling on PyQt6 as it is native to python, allows for Open-GL hardware acceleration and is relatively easy to implement for our use case. A MVP prototype is not ready yet but it is being worked. We identified some of the most important features to by implemented first as: changing input sensitivity of the hand tracking. Changing the camera input channel in case of multiple cameras connected, and a start stop button.
Output to audio software
We have to decide more on how the product should look, if we want it to be, in very simple terms, a knob (or any number of knobs) from a mixing deck or if we want it to be able run as executable inside Ableton or something else entirely. Based on this discussion there are several options on what the output the software should have. Based on the interview with Jakub K. we learned that we could just pass Ableton MIDI and in cases an int or some other metadata. For the case of having it as an executable it would be much more difficult, we would have to do more research on how to make it correctly interact with Ableton as we still would need to figure out how to change parameters from inside Ableton.
Hand tracking and gesture recognition
We have implemented a real-time hand detection system that tracks the position of 21 landmarks on a user's hand using the MediaPipe library. It captures live video input, processes each frame to detect the hand, and maps the 21 key points to create a skeletal representation of the hand. The program calculates the Euclidean distance between the tip of the thumb and the tip of the index finger. This distance is then mapped to a volume control range using the pycaw library. As the user moves their thumb and index finger closer or farther apart, the system's volume is adjusted in real time.
Additionally, we are in the process of implementing a program that is capable of recognising and interpreting American Sign Language (ASL) hand signs in real time. It consists of two parts: data collection and testing. The data collection phase captures images of hand signs using a webcam, processes them to create a standardised dataset, and stores them for training a machine learning model. The testing phase uses a pre-trained model to classify hand signs in real time. By recognizing and distinguishing different hand signs, the program will be able to map each gesture to specific ways of manipulating sound, such as adjusting pitch, modulating filters, changing volume, etc. For example, a gesture for the letter "A" could activate a low-pass filter, while a gesture for "B" could increase the volume.
Multiple camera inputs
To allow data inputs from two cameras to be handled effectively, various software libraries were studied to find one that would be best suitable. One library that seemed to be able to handle multiple camera inputs the best was OpenCV. Using the OpenCV library, a program was created that could take the input of two cameras and output the captured images. The two cameras used to test the software were a laptop webcam, and a phone camera that was wirelessly connected to the laptop as a webcam. The program seemed to run correctly, however there was also a lot of delay between the two cameras. In an attempt to reduce this latency, the program was adjusted to have a lower buffer size for both camera inputs and to manually decrease the resolution, however this did not seem to reduce the latency. An advantage of using OpenCV is that it allows for threading so that the frames of both cameras can be retrieved in separate threads. Along with that a technique called fps smoothing can be implemented to try and synchronize frames. However, even after implementing both threading and frame smoothing, the latency did not seem to reduce. The limitations could be because of the wireless connection between the phone and the laptop, so a USB connected webcam could possibly lead to better results.
State of the art (sources)
[1] “A MIDI Controller based on Human Motion Capture (Institute of Visual Computing, Department of Computer Science, Bonn-Rhein-Sieg University of Applied Sciences),” ResearchGate. Accessed: Feb. 12, 2025. [Online]. Available: https://www.researchgate.net/publication/264562371_A_MIDI_Controller_based_on_Human_Motion_Capture_Institute_of_Visual_Computing_Department_of_Computer_Science_Bonn-Rhein-Sieg_University_of_Applied_Sciences
[2] M. Lim and N. Kotsani, “An Accessible, Browser-Based Gestural Controller for Web Audio, MIDI, and Open Sound Control,” Computer Music Journal, vol. 47, no. 3, pp. 6–18, Sep. 2023, doi: 10.1162/COMJ_a_00693.
[3] M. Oudah, A. Al-Naji, and J. Chahl, “Hand Gesture Recognition Based on Computer Vision: A Review of Techniques,” J Imaging, vol. 6, no. 8, p. 73, Jul. 2020, doi: 10.3390/jimaging608007
[4] A. Tagliasacchi, M. Schröder, A. Tkach, S. Bouaziz, M. Botsch, and M. Pauly, “Robust Articulated‐ICP for Real‐Time Hand Tracking,” Computer Graphics Forum, vol. 34, no. 5, pp. 101–114, Aug. 2015, doi: 10.1111/cgf.12700.
[5] A. Tkach, A. Tagliasacchi, E. Remelli, M. Pauly, and A. Fitzgibbon, “Online generative model personalization for hand tracking,” ACM Transactions on Graphics, vol. 36, no. 6, pp. 1–11, Nov. 2017, doi: 10.1145/3130800.3130830.
[6] T. Winkler, Composing Interactive Music: Techniques and Ideas Using Max. Cambridge, MA, USA: MIT Press, 2001.
[7] E. R. Miranda and M. M. Wanderley, New Digital Musical Instruments: Control and Interaction Beyond the Keyboard. Middleton, WI, USA: AR Editions, Inc., 2006.
[8] D. Hosken, An Introduction to Music Technology, 2nd ed. New York, NY, USA: Routledge, 2014. doi: 10.4324/9780203539149.
[9] P. D. Lehrman and T. Tully, "What is MIDI?," Medford, MA, USA: MMA, 2017.
[10] C. Dobrian and F. Bevilacqua, Gestural Control of Music Using the Vicon 8 Motion Capture System. UC Irvine: Integrated Composition, Improvisation, and Technology (ICIT), 2003.
[11] J. L. Hernandez-Rebollar, “Method and apparatus for translating hand gestures,” US7565295B1, Jul. 21, 2009 Accessed: Feb. 12, 2025. [Online]. Available: https://patents.google.com/patent/US7565295B1/en
[12] I. Culjak, D. Abram, T. Pribanic, H. Dzapo, and M. Cifrek, “A brief introduction to OpenCV,” in 2012 Proceedings of the 35th International Convention MIPRO, May 2012, pp. 1725–1730. Accessed: Feb. 12, 2025. [Online]. Available: https://ieeexplore.ieee.org/document/6240859/?arnumber=6240859
[13] K. V. Sainadh, K. Satwik, V. Ashrith, and D. K. Niranjan, “A Real-Time Human Computer Interaction Using Hand Gestures in OpenCV,” in IOT with Smart Systems, J. Choudrie, P. N. Mahalle, T. Perumal, and A. Joshi, Eds., Singapore: Springer Nature Singapore, 2023, pp. 271–282.
[14] V. Patil, S. Sutar, S. Ghadage, and S. Palkar, “Gesture Recognition for Media Interaction: A Streamlit Implementation with OpenCV and MediaPipe,” International Journal for Research in Applied Science & Engineering Technology (IJRASET), 2023.
[15] A. P. Ismail, F. A. A. Aziz, N. M. Kasim, and K. Daud, “Hand gesture recognition on python and opencv,” IOP Conf. Ser.: Mater. Sci. Eng., vol. 1045, no. 1, p. 012043, Feb. 2021, doi: 10.1088/1757-899X/1045/1/012043.
[16] R. Tharun and I. Lakshmi, “Robust Hand Gesture Recognition Based On Computer Vision,” in 2024 International Conference on Intelligent Systems for Cybersecurity (ISCS), May 2024, pp. 1–7. doi: 10.1109/ISCS61804.2024.10581250.
[17] E. Theodoridou et al., “Hand tracking and gesture recognition by multiple contactless sensors: a survey,” IEEE Transactions on Human-Machine Systems, vol. 53, no. 1, pp. 35–43, Jul. 2022, doi: 10.1109/thms.2022.3188840.
[18] G. M. Lim, P. Jatesiktat, C. W. K. Kuah, and W. T. Ang, “Camera-based Hand Tracking using a Mirror-based Multi-view Setup,” IEEE Engineering in Medicine and Biology Society. Annual International Conference, pp. 5789–5793, Jul. 2020, doi: 10.1109/embc44109.2020.9176728.
[19] P. Rahimian and J. K. Kearney, “Optimal camera placement for motion capture systems,” IEEE Transactions on Visualization and Computer Graphics, vol. 23, no. 3, pp. 1209–1221, Dec. 2016, doi: 10.1109/tvcg.2016.2637334.
[20] R. Tchantchane, H. Zhou, S. Zhang, and G. Alici, “A Review of Hand Gesture Recognition Systems Based on Noninvasive Wearable Sensors,” Advanced Intelligent Systems, vol. 5, no. 10, p. 2300207, 2023, doi: 10.1002/aisy.202300207.
[21] Sahoo, J. P., Prakash, A. J., Pławiak, P., & Samantray, S. (2022). Real-Time Hand Gesture Recognition Using Fine-Tuned Convolutional Neural Network. Sensors, 22(3), 706. https://doi.org/10.3390/s22030706
[22] Cheng, M., Zhang, Y., & Zhang, W. (2024). Application and Research of Machine Learning Algorithms in Personalized Piano Teaching System. International Journal of High Speed Electronics and Systems. http://doi.org/10.1142/S0129156424400949
[23] Rhodes, C., Allmendinger, R., & Climent, R. (2020). New Interfaces and Approaches to Machine Learning When Classifying Gestures within Music. Entropy. 22. http://doi.org/10.3390/e22121384
[24] Supriya, S., & Manoharan, C.. (2024). Hand gesture recognition using multi-objective optimization-based segmentation technique. Journal of Electrical Engineering. 21. 133-145. http://doi.org/10.59168/AEAY3121
[25] Benitez-Garcia, G., & Takahashi, Hiroki. (2024). Multimodal Hand Gesture Recognition Using Automatic Depth and Optical Flow Estimation from RGB Videos. http://doi.org/10.3233/FAIA240397
[26] Togootogtokh, E., Shih, T., Kumara, W.G.C.W., Wu, S.J., Sun, S.W., & Chang, H.H. (2018). 3D finger tracking and recognition image processing for real-time music playing with depth sensors. Multimedia Tools and Applications. 77. https://doi.org/10.1007/s11042-017-4784-9
[27] Manaris, B., Johnson, D., & Vassilandonakis, Yiorgos. (2013). Harmonic Navigator: A Gesture-Driven, Corpus-Based Approach to Music Analysis, Composition, and Performance. AAAI Workshop - Technical Report. 9. 67-74. http://doi.org/10.1609/aiide.v9i5.12658
[28] Velte, M. (2012). A MIDI Controller based on Human Motion Capture (Institute of Visual Computing, Department of Computer Science, Bonn-Rhein-Sieg University of Applied Sciences). http://doi.org/10.13140/2.1.4438.3366
[29] Dikshith, S.M. (2025). AirCanvas using OpenCV and MediaPipe. International Journal for Research in Applied Science and Engineering Technology. 13. 14671-1473. http://doi.org/10.22214/ijraset.2025.66601
[30] Patel, S., & Deepa, R. (2023). Hand Gesture Recognition Used for Functioning System Using OpenCV. 3-10. http://doi.org/10.4028/p-4589o3
Summary
Preface
As there exists no commercial product on this topic, and state of the art for this exact product consists dominantly of only the open source community, thus we tried looking more into the state of the art of the individual components that we would need in order to complete this product. Our inspiration to even create this project came from an Instagram video (https://www.instagram.com/p/DCwbZwczaER/), from a Taiwanese visual media artist. After a bit of research we found more artists that do similar stuff, for example a British artist, creating gloves for hand gesture control \url{https://mimugloves.com/#product-list}. We also discovered a commercially sold product that utilizes a similar technology to our idea to recognize hand movements and gestures for piano players (https://roli.com/eu/product/airwave-create). This product uses cameras and ultrasound to map hand movements of a piano player while they perform, to add an additional layer of musical expression. Hover, this product is merely an accessory to an already existing instrument and not a fully fleshed out instrument in it's own right.
Hardware
There are various kinds of cameras and sensors that can be used for hand tracking and gesture recognition. The main types used are RGB cameras, IR sensors and RGB-D cameras. Each type of camera/sensor has its own advantages/disadvantages regarding costs and resolution. RGB cameras provide high resolution colour information and are cost efficient, but do not provide depth information of an image. Infra red (IR) sensors can capture very detailed hand movements, but are sensitive to other sources of IR light. Depth sensors/cameras such as RGB-D cameras can be used to construct an accurate depth image of the hand, but does cost more than an RGB camera, and sometimes does not reach the same resolution as an RGB camera. A more accurate improvement would be to use multiple cameras (RGB/IR/Depth) to create more robust tracking and gesture recognition in various environments. The disadvantages of this increases both the system complexity and costs, and it would require synchronization between sensors [17]. Another way of increasing tracking accuracy would be to use mirrors. An advantage of doing this is that it reduces the costs of needing multiple cameras [18]. The camera type most commonly used in computer vision tasks regarding hand tracking ang gesture recognition is the RBG-D camera [4],[5].
Software
The literature on software-based visual capture for hand-gesture recognition emphasizes OpenCV — an open-source library offering optimized algorithms for tasks such as object detection, motion tracking, and image processing [12] — as a cornerstone for real-time image analysis. In most cases studied, this is augmented by MediaPipe, a framework developed by Google that provides ready-to-use pipelines for multimedia processing, including advanced hand tracking and pose estimation [14] Collectively, these works demonstrate that a typical pipeline involves detecting and segmenting the hand, extracting key features or keypoints, and classifying the gesture, often in real time [12],[15],[16]. By leveraging various tools and the aforementioned libraries, researchers achieve robust performance in varying environments, addressing issues such as changes in lighting or background noise [13]. Most of the papers suggest using Python due to its accessibility and ease of integration with other tools [12]–[16].