PRE2024 3 Group15
Group members: Nikola Milanovski, Senn Loverix, Rares Ilie, Matus Sevcik, Gabriel Karpinsky
Work done last week
Name | Total | Breakdown |
Nikola | 11h | Studied papers (5h), Meeting (1.5h), Researched possible music software (3h), Interviews with potential users (1.5h) |
Senn | 12.5h | Studied papers (5h), wrote summaries (2h), Meeting (1.5h), Wrote Problem statement and objectives (2h), Wrote Users (1h), Wrote User requirements (1h) |
Rares | 13.5 | Attended meeting (2h), Researched sources and state of the art (5h), Researched contemporary software solutions (2.5h), Began working on real-time hand-tracking software (3h), Arranged interview with potential users (1h) |
Matus | 12h | Meeting (2.5h), Wrote preface for the state of the art and part of software state of the art (3h), Researching papers and sources (6h). Arranged interview with two potential users (0.5h) |
Gabriel | 15h | Meeting (3h), Research of contemporty solutions to the problem both software and hardware (6h), Watching all available YouTube demos for different implementations(2h), Arranging interviews with potential users (1h), Writing and brainstorming approach and other sections (3h) |
Introduction and plan
Problem statement and objectives
The synthesizer has become an essential instrument in the creation of modern day music. They allow musicians to modulate and create sounds electronically. Traditionally, an analog synthesizer utilizes a keyboard to generate notes, and different knobs, buttons and sliders to manipulate sound. However, through using MIDI (Music Instrument Digital Interface) the synthesizer can be controlled via an external device, usually also shaped like a keyboard, and the other controls are made digital. This allows for a wide range of approaches to what kind of input device is used to manipulate the digital synthesizer. Although traditional keyboard MIDI controllers have reached great success, its form may restrict expressiveness of musicians that seek to create more dynamic and unique sounds, as well as availability to people that struggle with the controls due to a lack of keyboard playing knowledge or a physical impairment for example.
During this project the aim is to design a new way of controlling a synthesizer using the motion of the users’ hand. By moving their hand to a certain position in front of a suitable sensor system which consists of one or more cameras, various aspects of the produced sound can be controlled, such as pitch or reverb. Computer vision techniques will be implemented in software to track the position of the users’ hand and fingers. Different orientations will be mapped to operations on the sound which the synthesizer will do. Through the use of MIDI, this information will be passed to a synthesizer software to produce the electronic sound. We aim to allow various users in the music industry to seamlessly implement this technology to create brand new sounds in an innovative, easy to control way to create these sounds in a more accessible way than through using a more traditional synthesizer.
Users
With this innovative way of producing music, the main targets for this technology are users in the music industry. Such users include performance artists and DJ’s, which can implement this technology to enhance their live performances or sets. Visual artists and motions based performers could integrate the technology within their choreography to Other users include music producers looking to create unique sounds or rhythms in their tracks. Content creators that use audio equipment to enhance their content, such as soundboards, could use the technology as a new way to seamlessly control the audio of their content.
This new way of controlling a synthesizer could also be a great way to introduce people to creating and producing electronic music. It would be especially useful for people with some form of a physical impairment which could have restricted them from creating the music that they wanted before.
User requirements
For the users mentioned above, we have set up a list of requirements we would expect the users to have for this new synthesizer technology. First of all, it should be easy to set up for performance artists and producers so they don’t spend too much time preparing right before their performance or set. Next, the technology should be easily accessible and easy to understand for all users, both people that have a lot of experience with electronic music, and people that are relatively new to it.
Furthermore, the hand tracking should work in different environments. For example, a DJ that works in dimly lighted clubs who integrate a lot of different lighting and visual effects during their sets should still be able to rely on accurate hand tracking. There should also be the ability to easily integrate the technology into the artist’s workflow. An artist should not change their entire routines of performing or producing music if they want to use a motion based synthesizer.
Lastly, the technology should allow for elaborate customization to fit to each user’s needs. The user should be able to decide what attributes of the recognized hand gestures are important for their work, and which ones should be omitted. For example, if the vertical position of the hand regulates the pitch of the sound, and rotation of the hand the volume, the user should be able to ‘turn off’ the volume regulation so that if they rotate their hand nothing will change.
To get a better understanding of the user requirements, we are planning on interviewing some people in the music industry such as music producers, a DJ and an audiovisual producer. The interview questions are as follows:
Background and Experience: What tools or instruments do you currently use in your creative process? Have you previously incorporated technology into your performances or creations? If so, how?
Creative Process and Workflow: Can you describe your typical workflow? How do you integrate new tools or technologies into your practice? What challenges do you face when adopting new technologies in your work?
Interaction with Technology: Have you used motion-based controllers or gesture recognition systems in your performances or art? If yes, what was your experience?
How do you feel about using hand gestures to control audio or visual elements during a performance?
What features would you find most beneficial in a hand motion recognition controller for your work?
Feedback on Prototype: What specific functionalities or capabilities would you expect from such a device?
How important is the intuitiveness and learning curve of a new tool in your adoption decision?
Performance and Practical Considerations: In live performances, how crucial is the reliability of your equipment? What are your expectations regarding the responsiveness and accuracy of motion-based controllers?
How do you manage technical issues during a live performance?
How important are the design and aesthetics of the tools you use?
Do you have any ergonomic preferences or concerns when using new devices during performances?
What emerging technologies are you most excited about in your field?
Approach, milestones and deliverables
- Market research interviews with musicians, music producers etc.
- Requirements for hardware
- Ease of use requirements
- Understanding of how to seamlessly integrate our product into a musicians workflow.
- Find software stack solutions
- Library for hand tracking
- Encoder to midi or another viable format.
- Synthesizer that can accept live inputs in chosen encoding format.
- Audio output solution
- Find hardware solutions
- Camera/ visual input
- Multiple cameras
- IR depth tracking
- Viability of stander webcam laptop or otherwise
- Camera/ visual input
- MVP (Minimal viable product)
- Create a demonstration product proving the viably of the concept by modifying a single synthesizer using basic hand gestures and a laptop webcam/ other easily accessible camera.
- Test with potential users and get feedback
- Refined final product
- Additional features
- Ease of use and integration improvements
- Testing on different hardware and software platforms
- Visual improvements to the software
- Potential support for more encoding formats or additional input methods other then hand tracking
Who is doing what?
Nikola - Interface with audio software
Senn - Hardware interface
Gabriel, Rares, Matus - Software processing of input and producing output
State of the art (sources)
[1] “A MIDI Controller based on Human Motion Capture (Institute of Visual Computing, Department of Computer Science, Bonn-Rhein-Sieg University of Applied Sciences),” ResearchGate. Accessed: Feb. 12, 2025. [Online]. Available: https://www.researchgate.net/publication/264562371_A_MIDI_Controller_based_on_Human_Motion_Capture_Institute_of_Visual_Computing_Department_of_Computer_Science_Bonn-Rhein-Sieg_University_of_Applied_Sciences
[2] M. Lim and N. Kotsani, “An Accessible, Browser-Based Gestural Controller for Web Audio, MIDI, and Open Sound Control,” Computer Music Journal, vol. 47, no. 3, pp. 6–18, Sep. 2023, doi: 10.1162/COMJ_a_00693.
[3] M. Oudah, A. Al-Naji, and J. Chahl, “Hand Gesture Recognition Based on Computer Vision: A Review of Techniques,” J Imaging, vol. 6, no. 8, p. 73, Jul. 2020, doi: 10.3390/jimaging608007
[4] A. Tagliasacchi, M. Schröder, A. Tkach, S. Bouaziz, M. Botsch, and M. Pauly, “Robust Articulated‐ICP for Real‐Time Hand Tracking,” Computer Graphics Forum, vol. 34, no. 5, pp. 101–114, Aug. 2015, doi: 10.1111/cgf.12700.
[5] A. Tkach, A. Tagliasacchi, E. Remelli, M. Pauly, and A. Fitzgibbon, “Online generative model personalization for hand tracking,” ACM Transactions on Graphics, vol. 36, no. 6, pp. 1–11, Nov. 2017, doi: 10.1145/3130800.3130830.
[6] T. Winkler, Composing Interactive Music: Techniques and Ideas Using Max. Cambridge, MA, USA: MIT Press, 2001.
[7] E. R. Miranda and M. M. Wanderley, New Digital Musical Instruments: Control and Interaction Beyond the Keyboard. Middleton, WI, USA: AR Editions, Inc., 2006.
[8] D. Hosken, An Introduction to Music Technology, 2nd ed. New York, NY, USA: Routledge, 2014. doi: 10.4324/9780203539149.
[9] P. D. Lehrman and T. Tully, "What is MIDI?," Medford, MA, USA: MMA, 2017.
[10] C. Dobrian and F. Bevilacqua, Gestural Control of Music Using the Vicon 8 Motion Capture System. UC Irvine: Integrated Composition, Improvisation, and Technology (ICIT), 2003.
[11] J. L. Hernandez-Rebollar, “Method and apparatus for translating hand gestures,” US7565295B1, Jul. 21, 2009 Accessed: Feb. 12, 2025. [Online]. Available: https://patents.google.com/patent/US7565295B1/en
[12] I. Culjak, D. Abram, T. Pribanic, H. Dzapo, and M. Cifrek, “A brief introduction to OpenCV,” in 2012 Proceedings of the 35th International Convention MIPRO, May 2012, pp. 1725–1730. Accessed: Feb. 12, 2025. [Online]. Available: https://ieeexplore.ieee.org/document/6240859/?arnumber=6240859
[13] K. V. Sainadh, K. Satwik, V. Ashrith, and D. K. Niranjan, “A Real-Time Human Computer Interaction Using Hand Gestures in OpenCV,” in IOT with Smart Systems, J. Choudrie, P. N. Mahalle, T. Perumal, and A. Joshi, Eds., Singapore: Springer Nature Singapore, 2023, pp. 271–282.
[14] V. Patil, S. Sutar, S. Ghadage, and S. Palkar, “Gesture Recognition for Media Interaction: A Streamlit Implementation with OpenCV and MediaPipe,” International Journal for Research in Applied Science & Engineering Technology (IJRASET), 2023.
[15] A. P. Ismail, F. A. A. Aziz, N. M. Kasim, and K. Daud, “Hand gesture recognition on python and opencv,” IOP Conf. Ser.: Mater. Sci. Eng., vol. 1045, no. 1, p. 012043, Feb. 2021, doi: 10.1088/1757-899X/1045/1/012043.
[16] R. Tharun and I. Lakshmi, “Robust Hand Gesture Recognition Based On Computer Vision,” in 2024 International Conference on Intelligent Systems for Cybersecurity (ISCS), May 2024, pp. 1–7. doi: 10.1109/ISCS61804.2024.10581250.
[17] E. Theodoridou et al., “Hand tracking and gesture recognition by multiple contactless sensors: a survey,” IEEE Transactions on Human-Machine Systems, vol. 53, no. 1, pp. 35–43, Jul. 2022, doi: 10.1109/thms.2022.3188840.
[18] G. M. Lim, P. Jatesiktat, C. W. K. Kuah, and W. T. Ang, “Camera-based Hand Tracking using a Mirror-based Multi-view Setup,” IEEE Engineering in Medicine and Biology Society. Annual International Conference, pp. 5789–5793, Jul. 2020, doi: 10.1109/embc44109.2020.9176728.
[19] P. Rahimian and J. K. Kearney, “Optimal camera placement for motion capture systems,” IEEE Transactions on Visualization and Computer Graphics, vol. 23, no. 3, pp. 1209–1221, Dec. 2016, doi: 10.1109/tvcg.2016.2637334.
[20] R. Tchantchane, H. Zhou, S. Zhang, and G. Alici, “A Review of Hand Gesture Recognition Systems Based on Noninvasive Wearable Sensors,” Advanced Intelligent Systems, vol. 5, no. 10, p. 2300207, 2023, doi: 10.1002/aisy.202300207.
[21] Sahoo, J. P., Prakash, A. J., Pławiak, P., & Samantray, S. (2022). Real-Time Hand Gesture Recognition Using Fine-Tuned Convolutional Neural Network. Sensors, 22(3), 706. https://doi.org/10.3390/s22030706
[22] Cheng, M., Zhang, Y., & Zhang, W. (2024). Application and Research of Machine Learning Algorithms in Personalized Piano Teaching System. International Journal of High Speed Electronics and Systems. http://doi.org/10.1142/S0129156424400949
[23] Rhodes, C., Allmendinger, R., & Climent, R. (2020). New Interfaces and Approaches to Machine Learning When Classifying Gestures within Music. Entropy. 22. http://doi.org/10.3390/e22121384
[24] Supriya, S., & Manoharan, C.. (2024). Hand gesture recognition using multi-objective optimization-based segmentation technique. Journal of Electrical Engineering. 21. 133-145. http://doi.org/10.59168/AEAY3121
[25] Benitez-Garcia, G., & Takahashi, Hiroki. (2024). Multimodal Hand Gesture Recognition Using Automatic Depth and Optical Flow Estimation from RGB Videos. http://doi.org/10.3233/FAIA240397
[26] Togootogtokh, E., Shih, T., Kumara, W.G.C.W., Wu, S.J., Sun, S.W., & Chang, H.H. (2018). 3D finger tracking and recognition image processing for real-time music playing with depth sensors. Multimedia Tools and Applications. 77. https://doi.org/10.1007/s11042-017-4784-9
[27] Manaris, B., Johnson, D., & Vassilandonakis, Yiorgos. (2013). Harmonic Navigator: A Gesture-Driven, Corpus-Based Approach to Music Analysis, Composition, and Performance. AAAI Workshop - Technical Report. 9. 67-74. http://doi.org/10.1609/aiide.v9i5.12658
[28] Velte, M. (2012). A MIDI Controller based on Human Motion Capture (Institute of Visual Computing, Department of Computer Science, Bonn-Rhein-Sieg University of Applied Sciences). http://doi.org/10.13140/2.1.4438.3366
[29] Dikshith, S.M. (2025). AirCanvas using OpenCV and MediaPipe. International Journal for Research in Applied Science and Engineering Technology. 13. 14671-1473. http://doi.org/10.22214/ijraset.2025.66601
[30] Patel, S., & Deepa, R. (2023). Hand Gesture Recognition Used for Functioning System Using OpenCV. 3-10. http://doi.org/10.4028/p-4589o3
Summary
Preface
As there exists no commercial product on this topic, and state of the art for this exact product consists dominantly of only the open source community, thus we tried looking more into the state of the art of the individual components that we would need in order to complete this product. Our inspiration to even create this project came from an Instagram video (https://www.instagram.com/p/DCwbZwczaER/), from a Taiwanese visual media artist. After a bit of research we found more artists that do similar stuff, for example a British artist, creating gloves for hand gesture control \url{https://mimugloves.com/#product-list}. We also discovered a commercially sold product that utilizes a similar technology to our idea to recognize hand movements and gestures for piano players (https://roli.com/eu/product/airwave-create). This product uses cameras and ultrasound to map hand movements of a piano player while they perform, to add an additional layer of musical expression. Hover, this product is merely an accessory to an already existing instrument and not a fully fleshed out instrument in it's own right.
Hardware
There are various kinds of cameras and sensors that can be used for hand tracking and gesture recognition. The main types used are RGB cameras, IR sensors and RGB-D cameras. Each type of camera/sensor has its own advantages/disadvantages regarding costs and resolution. RGB cameras provide high resolution colour information and are cost efficient, but do not provide depth information of an image. Infra red (IR) sensors can capture very detailed hand movements, but are sensitive to other sources of IR light. Depth sensors/cameras such as RGB-D cameras can be used to construct an accurate depth image of the hand, but does cost more than an RGB camera, and sometimes does not reach the same resolution as an RGB camera. A more accurate improvement would be to use multiple cameras (RGB/IR/Depth) to create more robust tracking and gesture recognition in various environments. The disadvantages of this increases both the system complexity and costs, and it would require synchronization between sensors [17]. Another way of increasing tracking accuracy would be to use mirrors. An advantage of doing this is that it reduces the costs of needing multiple cameras [18]. The camera type most commonly used in computer vision tasks regarding hand tracking ang gesture recognition is the RBG-D camera [4],[5].
Software
The literature on software-based visual capture for hand-gesture recognition emphasizes OpenCV — an open-source library offering optimized algorithms for tasks such as object detection, motion tracking, and image processing [12] — as a cornerstone for real-time image analysis. In most cases studied, this is augmented by MediaPipe, a framework developed by Google that provides ready-to-use pipelines for multimedia processing, including advanced hand tracking and pose estimation [14] Collectively, these works demonstrate that a typical pipeline involves detecting and segmenting the hand, extracting key features or keypoints, and classifying the gesture, often in real time [12],[15],[16]. By leveraging various tools and the aforementioned libraries, researchers achieve robust performance in varying environments, addressing issues such as changes in lighting or background noise [13]. Most of the papers suggest using Python due to its accessibility and ease of integration with other tools [12]–[16].
Current state of the art review
Hardware
Software implementation
To allow data inputs from two cameras to be handled effectively, various software libraries were studied to find one that would be best suitable. One library that seemed to be able to handle multiple camera inputs the best was OpenCV. To start experimenting with