PRE2017 4 Groep3
Group members
- Stijn Beukers
- Marijn v.d. Horst
- Rowin Versteeg
- Pieter Voors
- Tom v.d. Velden
Brainstorm
We have discussed several ideas that we may want to implement.
- An AI and GUI for a board game, in which you can play with different AIs and maybe integrate a multiplayer environment the GUI could also give tips to the users.
- A filter for notifications on your smartphone to not get distracted by non-urgent notifications while still being available for urgent notifications.
- A simple way to connect multiple interfaces like doorbells, music, notifications or your alarm to the lights in your house.
- An artificial intelligence that automatically switches between camera angles in live broadcasts.
- A program that stitches together recorded videos like vlogs automatically.
- A program that makes music compilations where music flows together naturally the way DJs mix together music as if it is one big song rather than fading in one song and starting the next.
- A system of cameras in homes for blind people that keeps track of where they have left certain items such that they can ask the system where they left it when they lose an object.
- A model of a robot which learns to walk/pick up objects using machine learning.
- A system that sorts music based on its genre.
Chosen Subject
For centuries our species has known that they are not perfect and shall never attain perfection. To get ever closer to perfection we have created many tools to bridge the gap between our weaknesses and the perfection and satisfaction we so very much desire. Though many problems have been tackled and human life has greatly improved in quality, we are still capable of losing the items that could provide such comfort. Such items could, for example, be phones, tablets or laptops. Even at home a TV remote is often lost. We propose a solution to the problem of losing items within the confinements of a certain building. The solution we propose is to apply Artificial Intelligence (AI) as to find items using live video footage. This is chosen as image classification has been proven to be very efficient and effective at classifying and detecting objects in images. For convenience sake this system will be provided with voice command abilities and upon finding the requested items, the system will return where the item is. This will be done via a speaker telling the user where the requested item is.
Users
Though many people may benefit from the proposed systems, there are some people that would more so benefit from the system than others. A prime example would be people that are visually impaired or people who are blind. These people could have a hard time finding some item as they may not be able to recognize it themselves or they may not be able to see it at all. The system would provide them with a sense of ease as they would no longer have to manage where their items are all the time. Secondly, people that have a kind of dementia would greatly benefit from this system as they don't have to worry about forgetting where they left their belongings due to their deficiency. The elderly in general is also a good user for the proposed system. This is due to the fact that the elderly tend to be forgetful as their body is no longer in the prime of their life. In addition, they are also the people that also suffer the most from the aforementioned deficiencies. Additionally, smart home enthusiasts could be interested in this system is a new type of smart device. Moreover, people with large mansions could be interested in this system, as within a mansion an item is easily lost. Lastly, companies could be interested in investing in this software. Companies would by implementing the system be able to keep track of their staff's belongings and help find important documents that may be lost on someone's desk.
User Requirements
For this system to work, we need to fulfill separate requirements of the users.
- The system should be able to inform the user where specific items are on command.
- The system should be available at all times.
- The system should understand voice commands and state the location of an object in an understandable manner.
- The system should only respond to the main user for security purposes.
- The system should take the privacy concerns of the user into respect.
- The system should be secure.
Goals
The goals of group 3 are as follows:
- Do research into the state of the art of AI picture recognition
- Find and interview relevant users
- Build an AI that can effectively classify certain objects based on pictures
- Determine which kind of cameras to use and where to place the cameras (in which rooms and placement)
- Expand AI capabilities by having it classify objects correctly within video footage
- Have the AI classify objects within live camera footage
- Have the AI determine the location of an object on command and tell the user
- Have the AI remember objects which are out of sight
Task division
- Voice Recognition
- Stijn Beukers
- Voice Recognition
- User Research
- Rowin Versteeg
- Image Detection
- Pieter Voors
- Marijn v.d. Horst
Planning
Milestones
Object detection
- Passive object detection (Detecting all objects or specific objects in video/image)
- Live video feed detection (Useing a camera)
- Input: find specific item (Input: e.g. item name. Output: e.g. camera & location)
- Location classification (What is camera2 pixel [100,305] called?)
- Keeping track of where item is last seen.
Voice interface
- Define interface (which data is needed as input and output in communication between voice interface and object detection system)
- Pure data input coupling with system that then gives output (e.g. send “find bottle” to make sure it receives “living room table” as data, without interface for now)
- Voice parameter input (User Interface to have text input)
- Text to speech output (Output the result over TTS)
Research
- Check whether users actually like the system in question.
- Check whether losing items is an actual problem for visually impaired people.
- Check whether which locations in building are most useful for users.
- Research privacy concerns regarding cameras in a home.
- Analyse the expected cost.
Deliverables
Prototype
- Create an object recognition setup with a camera.
- Create a voice recognition system that understands certain commands.
- Create a system that can locate objects that are asked for on a live camera feed.
- Create a system that can explain where a found object is located.
- Create a prototype that works according to the requirements.
Planning
Week | Milestones | Task division |
---|---|---|
Week 1 (23-04) |
|
|
Week 2 (30-04) |
|
|
Week 3 (07-05) |
|
|
Week 4 (14-05) |
|
|
Week 5 (21-05) |
|
|
Week 6 (28-05) |
|
|
Week 7 (04-06) |
|
|
Week 8 (11-06) |
|
|
Week 9 (18-06) |
|
State of the art
A number of researches have already been done into the field of finding objects using artificial intelligence in different ways. Among researches specifically aimed at finding objects for visually impaired people are systems that make use of FM Sonar systems that mostly detect the smoothness, repetitiveness and texture of surfaces[1] and Speed-Up Robust Features that are more robust with regards to scaling and rotating objects.[2] Other, more general, researches into object recovery also make use of radio-frequency tags attached to objects[3] and Spotlight, which "employs active RFID and ultrasonic position detection to detect the position of a lost object [and] illuminates the position".[4]
A relevant study has been conducted in the nature of losing objects and finding them. It addresses general questions such as how often people lose objects, what strategies are used to find them, the types of objects that are most frequently lost and why people lose objects.[5]
Applicable to the project is also researches that have been done into the needs and opinions of visually impaired people. A book has been written about assistive technologies for visually impaired people.[6] Multiple surveys were also conducted about the opinions of visually impaired people on the research regarding visually impairment, both in general[7] and in the Netherlands specifically.[8]
Convolutional networks are at the core of most state-of-the-art computer vision solutions[9]. TensorFlow is a project by Google which uses a convolutional network model built and trained especially for image recognition[10].
ImageNet Large Scale Visual Recognition Competition (ILSVRC) is a benchmark for object category classification and detection[11]. TensorFlow’s latest and highest quality model, Inception-v3, reaches 21.2%, top-1 and 5.6% top-5 error for single crop evaluation on the ILSVR 2012 classification, which has set a new state-of-the-art[9].
A lot of progress has been made in recent years with regards to object detection. Modern object detectors based on these networks — such as Faster R-CNN, R-FCN, Multibox, SSD and YOLO — are now good enough to be deployed in consumer products and some have been shown to be fast enough to be run on mobile devices[12]. Research has been done comparing these different architectures on running time, memory use and accuracy, which can be used to determine which implementation to use in a concrete application: [12].
In order to keep track of where a certain object resides, an object tracking (also known as video tracking) system would need to be implemented. Research has been done comparing different such systems using a large scale benchmark, providing a fair comparison: [13]. A master thesis implementing object tracking in video using TensorFlow has been published: [14].
Essentially we are designing some kind of "smart home" by expanding the house with this kind of technology. However several issues arise when talking about smart homes; security and privacy in particular. A study was done involving a smart home and the elderly[15], which used cameras as well. The results were rather positive, 1783 quotes were selected from interviews to determine the thoughts of the users. 12% of the quotes were negatively directed to the privacy and security of the users. In total 63% of the quotes were positively directed towards the system, with 37% being negatively directed, thus most of the complaints were not aimed at the privacy and security issues but at other concerns, like the use of the certain products of the study, which are irrelevant for this project.
Another study was done involving privacy concerns and the elderly[16], this time focused on the use of cameras. However in this study, the elderly were not the users but they commented on potential uses of the system in question. They were shown several families and the systems which were used by them, and the participants were asked several questions regarding the privacy and benefits of these scenarios. Concerning privacy the results were as follows, the more functioning a user was, the higher privacy concerns were rated. So potential benefits outweigh the privacy concerns when talking about a low functioning user.
A study similar to the last was conducted[17], which also asked people questions about a system which used cameras in their homes, not focusing on the elderly. The results state that people are very reluctant to place a monitoring system in their homes, except when it is an actual benefit for your health, in that case almost everyone would prefer to have this system. People would also prefer that the cameras installed would be invisible or unobtrusive. There were also a lot of security concerns, people do not trust that their footage is safe and that it can be obtained by third parties.
To tackle privacy concerns which are introduced by cameras, IBM has researched a camera that can be used to filter out any identity revealing features before receiving the video file[18]. This can be used in our camera based system to preserve the privacy of all primairy and secondary users who are seen by the camera.
For voice recognition many frameworks have been implemented. One such a framework is discussed by the paper of Dawid Polap et al.Cite error: Invalid <ref>
tag; invalid names, e.g. too many where they discuss a way to use audio as an image type file which then allows convolutional neural networks to be used. This could be worth while to look into as such a framework already needs to be delivered to find objects using the cameras.
When it comes to speech recognition the main framework used nowadays is the Hidden Markov Models (HMM) approach. A detailed description of how to implement it and where it is used for is presented by Lawrence R. Rabiner which provides a good starting point for the implementation.* [19]
As an alternative to voice control it is possible to allow the system to react to motion controls to. This would allow people who have no voice or have a hard time speaking because of disabilities to still use the system. This process is described by M. Sarkar et al. [20]
As mentioned above, there are those people who have a disability when it comes to speaking. If those people still want to use the voice controlled system some adjustments need to be made as to allow them to still use it. This is described in the paper by Xiaojun Zhang et al. where they describe how to use Deep Neural Networks to still allow these people to use a voice recognition system. [21]
When tackling the environment, being the house or building the system will be implemented in, there are decisions to be made about the extent of mapping the space. A similar study compared to our project has been done where there has been discussion about what rooms could be used and where the camera placement could be.[22] The camera's could for example be static or attached to the user in some way.
Continuing about the camera placement, studies concerning surveillance camera's have resulted in optimal placement according to specification of the camera to make the use as efficient as possible so that costs are reduced.[23] They have also resulted in an algorithm that tries to maximize the performance of the camera's by considering the task it is meant for.[24]
By the use of depth camera's there is also the possibility of generating a 3D model of the indoor environment, described in a paper about RGB-D mapping. This can help with explaining to the user where the object asked for is located.[25]
User Research
To investigate the interest and concerns of our system we decided to distribute a survey amongst various potential users.
We are mainly developing this system for people which have either dementia or are visually impaired, since they are among the people we estimate to have the highest probability of losing their personal belonings. For this purpose we ask people in the survey, whom are considered to be potential users, whether they suffer from such a disability.
We are also interested in the correlation between losing objects, age and the interest for this system.
Since we are expecting a lot of privacy concerns, we specifically ask the users in what way they would like to see these concerns be addressed.
We also ask in what rooms the users would want to have the cameras installed to see how many cameras would need to be installed on average.
We are also interested in the price that people would want to pay for the system. This is essential as the system has to be able to have a chance at the current market.
Lastly we ask how important it is that the cameras are hidden, because people may not want a visible camera in their rooms.
The survey below has been filled in by 61 people and we will now discuss the results.
Google form: https://docs.google.com/forms/d/e/1FAIpQLScCbIxM10migwrO-rNiF07-iIabRcVXuj8jqcqDpZFWPJ392Q/viewform?usp=sf_link
Results
The survey was filled in by people from all over the world in the age ranges of 11-20 (31%) and 21-30 (64%), so those are the only ages we can make a proper statement about. There is no notable difference between the two age groups.
46% of the people were not interested in the system while 42% was interested (the other responses were neither), this seems like an even split.
The correlation between interest in the system and how often the person loses something is highly noticable, as presented in the pie charts below.
Do notice that some people were not interested in the system even though they were losing objects rather occasionally, almost all of these people were not interested because of privacy and security concerns.
However not significant, there were 2 visually impaired people which filled in the survey and they were both very interested in the system.
Some comments that were mentioned for being interested in the system:
- The system would help me retrieve the objects I occasionally lose.
- The system is an interesting gimmick to fool around with.
- THe system would help me to keep track of valuable items.
- The system is a great idea for people with disabilities.
Some comments that were mentioned for being not interested in the system:
- I do not want cameras in my house.
- I (almost) never lose objects.
- The system might get annoying.
- The system is probably too expensive to install.
- The use cases are too limited.
- The house will use more electricity.
- Privacy/Security concerns (Discussed below)
We received several privacy and security concerns from people as expected. This is understandable as in recent years people are more concerned about their privacy, as many companies use their privacy for their own benefit. We will discuss several of the comments which were mentioned, and add reasoning to prevent any misuse of the system.
- The information is stored longer than necessary.
We plan to save no footage at all and only rely on live video capturing which will be deleted instantly. The system will only save the information needed to tell the users where certain objects are, a database of objects.
- The data is shared with 3rd parties.
A contract can be signed to ensure we are legally not allowed to share any of the data.
- The system can be hacked as the internet is insecure.
The system will operate locally with a central computer which connects all the cameras via hardware, thus there will be no internet involved. However for the prototype we use an existing framework for voice control of Google called "Google Assistant SDK".
- The system can be used by burglars.
The system can only be used by the registered users, a voice recognition system will take care of this. Furthermore the physical system's data will be encrypted so it can not be easily accessed.
- The system cannot be turned the system off.
In principle you can turn off the system, but take care as the system may "lose" some objects as they can be moved while the system is offline.
- I do not want cameras watching me or I would prefer another method of locating objects.
This is the only concern which cannot be addressed, as this concerns the user's own will. There are several other methods to locate objects, which involve bluetooth tracking devices, which actually can be hacked, so in a sense this system could be safer.
For the people who were interested in the system, we asked what rooms they would want a camera installed, this is represented in the pie chart below.
38 people indicated the rooms they would want a camera installed in, in a total of 167 rooms. This means that every user would have 4 cameras in their home on average.
The prices that people were willing to pay for the system were on average (after filtering out all the different currencies) €230. This would mean €230 for an average of 4 cameras and a computer which can do live calculations, excluding installation costs.
Lastly we asked whether the cameras have to be unobtrusive, for 31% of the people it is not important while 58% found it important (the rest in between). This means that we have to make sure that the cameras are unobtrusive, as it is important for most of the people.
We received some comments and tips for the system:
- You should also implement a security camera feature to the system which can detect burglars and report it.
- You should call the cameras "sensors" to scare off less users.
Results from other sources
A survey about lost items was conducted by the company which created "Pixie", a tool to track items in your home using bluetooth[26]. The survey concluded that Americans spend 2.5 days a year looking for lost items, and that it takes 5 minutes on average to find a lost item, which resulted in being late for certain events. The most common lost items are TV remotes, keys, shoes, wallets, glasses and phones. The lost items which take on average longer than 15 minutes to find are keys, wallets, umbrellas, passports, driver's licences and credit cards.
Week 1
Meeting on 23-4-2018 and 24-4-2018
- We brainstormed about several ideas for our project and chose the best one, which is described at the top.
- We discussed how to realize this project and how the wiki should look like.
AP For 26-4-2018
Pieter
- Do research into the Users of the project
Marijn
- Do research into Tensorlow image recognition
Tom
- Do research into Environment mapping
Stijn
- Do research into voice recognition
- Fix wiki
Rowin
- Do research into privacy matter
Meeting on 26-4-2018
- Discussed sources.
- Specified the chosen subject, users, their requirements, goals, milestones and deliverables.
- Prepared the feedback meeting.
- Discussed interview questions
AP For 30-4-2018
Everyone
- Make a summary for state of the art.
Week 2
We implemented object detection in a static image, which is based on TensorFlow: Object Detection Repository The output can be seen in the image below.
We also implemented it successfully on a live video feed, using this tutorial: live object detection The frame-rate and delay is still quite large, which needs to be fixed.
We also implemented DELF feature extraction, which uses feature detection in images which can be used to match two images containing the exact same object. An example matching can be seen in the image below, where we used a pre-trained model for architecture. In this image, lines are drawn between the matched feature points of the two images.
Meeting on 30-4-2018
- Weekly feedback meeting
- We subdivided tasks and worked on them as follows
- Get familiar with voice control (Tom and Stijn)
- Get familiar with object detection (Pieter and Marijn)
- Make and distribute a survey for user research (Rowin)
AP For 3-5-2018
Pieter and Marijn
- Investigate object detection
Tom and Stijn
- Investigate voice control
Rowin
- Analyze survey
Meeting on 3-5-2018
- Find more potential users and discuss results of the user research on the wiki (Rowin)
- Implementing DELF (Marijn and Pieter)
- Research state of the art of voice control to implement a framework (Stijn and Tom)
AP For 7-5-2018
Tom and Stijn
- fix audio ubuntu
Rowin
- Find more survey participants
Marijn
- Implement object detection on live video feed
Pieter
- Explore DELF Point feature extraction and matching
Week 3
Meeting on 7-5-2018
- Weekly feedback meeting
- Completed user research (first version) (Rowin)
- Implementing new voice commands on framework (Tom and Stijn)
- Point feature detection implementation (Pieter and Marijn)
AP For 14-5-2018
Pieter and Marijn
- Able to register new objects
Tom and Stijn
- Implement Coco database
- Look into security
- Create new voice commands
Rowin
- Find study for common lost household items
- Research object location description
- Research for visual impairment uses
Week 4
Week 5
Week 6
Week 7
Week 8
Week 9
Results
Sources
To cite a new source: <ref name="reference name">reference link and description</ref>
To cite a previously cited source: <ref name="reference name" \>
- ↑ http://www.aaai.org/Papers/Symposia/Fall/1996/FS-96-05/FS96-05-007.pdf
- ↑ Chincha, R., & Tian, Y. (2011). Finding objects for blind people based on SURF features. 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW). doi:10.1109/bibmw.2011.6112423 ( https://ieeexplore.ieee.org/abstract/document/6112423/ )
- ↑ Nakada, T., Kanai, H., & Kunifuji, S. (2005). A support system for finding lost objects using spotlight. Proceedings of the 7th International Conference on Human Computer Interaction with Mobile Devices & Services - MobileHCI 05. doi:10.1145/1085777.1085846 ( https://dl.acm.org/citation.cfm?id=1085846 )
- ↑ Pak, R., Peters, R. E., Rogers, W. A., Abowd, G. D., & Fisk, A. D. (2004). Finding lost objects: Informing the design of ubiquitous computing services for the home. PsycEXTRA Dataset. doi:10.1037/e577282012-008
- ↑ Hersh, M. A., & Johnson, M. A. (2008). Assistive technology for visually impaired and blind people. Londres (Inglaterra): Springer - Verlag London Limited.
- ↑ Duckett, P. S., & Pratt, R. (2001). The Researched Opinions on Research: Visually impaired people and visual impairment research. Disability & Society, 16(6), 815-835. doi:10.1080/09687590120083976 ( https://www.tandfonline.com/doi/abs/10.1080/09687590120083976 )
- ↑ Schölvinck, A. M., Pittens, C. A., & Broerse, J. E. (2017). The research priorities of people with visual impairments in the Netherlands. Journal of Visual Impairment & Blindness, 237-261. Retrieved from https://files.eric.ed.gov/fulltext/EJ1142797.pdf.
- ↑ 9.0 9.1 Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the Inception Architecture for Computer Vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/cvpr.2016.308
- ↑ Image Recognition | TensorFlow. (n.d.). Retrieved April 26, 2018, from https://www.tensorflow.org/tutorials/image_recognition
- ↑ Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., . . . Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3), 211-252. doi:10.1007/s11263-015-0816-y
- ↑ 12.0 12.1 Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., . . . Murphy, K. (2017). Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/cvpr.2017.351
- ↑ Leal-Taixé, L., Milan, A., Schindler, K., Cremers, D., Reid, I., & Roth, S. (2017). Tracking the Trackers: An Analysis of the State of the Art in Multiple Object Tracking. CoRR. Retrieved April 26, 2018, from http://arxiv.org/abs/1704.02781
- ↑ Ferri, A. (2016). Object Tracking in Video with TensorFlow (Master's thesis, Universidad Politecnica de Catalunia Catalunya, Spain, 2016). Barcelona: UPCommons. Retrieved April 26, 2018, from http://hdl.handle.net/2117/106410
- ↑ Anne-Sophie Melenhorst, Arthur D. Fisk, Elizabeth D. Mynatt, & Wendy A. Rogers. (2004). Potential Intrusiveness of Aware Home Technology: Perceptions of Older Adults. Proceedings of the Human Factors and Ergonomics Society 48th Annual Meeting (2004).
- ↑ Kelly E. Caine, Arthur D. Fisk, and Wendy A. Rogers. (2006). Benefits and Privacy Concerns of a Home Equipped with a Visual Sensing System: a Perspective from Older Adults. Proceedings of the Human Factors and Ergonomics Society 50th Annual Meeting (2006).
- ↑ Martina Ziefle, Carsten Röcker, Andreas Holzinger (2011). Medical Technology in Smart Homes: Exploring the User's Perspective on Privacy, Intimacy and Trust. Computer Software and Applications Conference Workshops (COMPSACW), 2011 IEEE 35th Annual.
- ↑ Andrew Senior, Sharath Pankanti, Arun Hampapur, Lisa Brown, Ying-Li Tian, Ahmet Ekin. (2003). Blinkering Surveillance: Enabling Video Privacy through Computer Vision. IBM Research Report: RC22886 (W0308-109) August 28, 2003 Computer Science.
- ↑ Rabiner, L. R. (1990). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Readings in Speech Recognition, 267-296. doi:10.1016/b978-0-08-051584-7.50027-9 ( https://ieeexplore.ieee.org/document/18626/ )
- ↑ Sarkar, M., Haider, M. Z., Chowdhury, D., & Rabbi, G. (2016). An Android based human computer interactive system with motion recognition and voice command activation. 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV). doi:10.1109/iciev.2016.7759990 ( https://ieeexplore.ieee.org/document/7759990/ )
- ↑ Zhang, X., Tao, Z., Zhao, H., & Xu, T. (2017). Pathological voice recognition by deep neural network. 2017 4th International Conference on Systems and Informatics (ICSAI). doi:10.1109/icsai.2017.8248337 ( https://ieeexplore.ieee.org/document/8248337/ )
- ↑ Yi, C., Flores, R. W., Chincha, R., & Tian, Y. (2013). Finding objects for assisting blind people. Network Modeling Analysis in Health Informatics and Bioinformatics, 2(2), 71-79.(https://link.springer.com/article/10.1007/s13721-013-0026-x)
- ↑ Yabuta, K., & Kitazawa, H. (2008, May). Optimum camera placement considering camera specification for security monitoring. In Circuits and Systems, 2008. ISCAS 2008. IEEE International Symposium on (pp. 2114-2117). IEEE.(https://ieeexplore.ieee.org/abstract/document/4541867/)
- ↑ Bodor, R., Drenner, A., Schrater, P., & Papanikolopoulos, N. (2007). Optimal camera placement for automated surveillance tasks. Journal of Intelligent and Robotic Systems, 50(3), 257-295. (https://link.springer.com/article/10.1007%2Fs10846-007-9164-7)
- ↑ Henry, P., Krainin, M., Herbst, E., Ren, X., & Fox, D. (2010). RGB-D mapping: Using depth cameras for dense 3D modeling of indoor environments. In In the 12th International Symposium on Experimental Robotics (ISER. (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.91)
- ↑ https://getpixie.com/blogs/news/lostfoundsurvey