PRE2017 4 Groep3

From Control Systems Technology Group
Jump to navigation Jump to search

Group members

  • Stijn Beukers
  • Marijn v.d. Horst
  • Rowin Versteeg
  • Pieter Voors
  • Tom v.d. Velden

Brainstorm

We have discussed several ideas that we may want to implement.

  • An AI and GUI for a board game, in which you can play with different AIs and maybe integrate a multiplayer environment the GUI could also give tips to the users.
  • A filter for notifications on your smartphone to not get distracted by non-urgent notifications while still being available for urgent notifications.
  • A simple way to connect multiple interfaces like doorbells, music, notifications or your alarm to the lights in your house.
  • An artificial intelligence that automatically switches between camera angles in live broadcasts.
  • A program that stitches together recorded videos like vlogs automatically.
  • A program that makes music compilations where music flows together naturally the way DJs mix together music as if it is one big song rather than fading in one song and starting the next.
  • A system of cameras in homes for blind people that keeps track of where they have left certain items such that they can ask the system where they left it when they lose an object.
  • A model of a robot which learns to walk/pick up objects using machine learning.
  • A system that sorts music based on its genre.

Chosen Subject

For centuries our species has known that they are not perfect and shall never attain perfection. To get ever closer to perfection we have created many tools to bridge the gap between our weaknesses and the perfection and satisfaction we so very much desire. Though many problems have been tackled and human life has greatly improved in quality, we are still capable of losing the items that could provide such comfort. Such items could, for example, be phones, tablets or laptops. Even at home a TV remote is often lost. We propose a solution to the problem of losing items within the confinements of a certain building. The solution we propose is to apply Artificial Intelligence (AI) as to find items using live video footage. This is chosen as image classification has been proven to be very efficient and effective at classifying and detecting objects in images. For convenience sake this system will be provided with voice command abilities and upon finding the requested items, the system will return where the item is. This will be done via a speaker telling the user where the requested item is.

Users

Though many people may benefit from the proposed systems, there are some people that would more so benefit from the system than others. A prime example would be people that are visually impaired or people who are blind. These people could have a hard time finding some item as they may not be able to recognize it themselves or they may not be able to see it at all. The system would provide them with a sense of ease as they would no longer have to manage where their items are all the time. Secondly, people that have a kind of dementia would greatly benefit from this system as they don't have to worry about forgetting where they left their belongings due to their deficiency. The elderly in general is also a good user for the proposed system. This is due to the fact that the elderly tend to be forgetful as their body is no longer in the prime of their life. In addition, they are also the people that also suffer the most from the aforementioned deficiencies. Additionally, smart home enthusiasts could be interested in this system is a new type of smart device. Moreover, people with large mansions could be interested in this system, as within a mansion an item is easily lost. Lastly, companies could be interested in investing in this software. Companies would by implementing the system be able to keep track of their staff's belongings and help find important documents that may be lost on someone's desk.

User Requirements

For this system to work, we need to fulfill separate requirements of the users.

  • The system should be able to inform the user where specific items are on command.
  • The system should be available at all times.
  • The system should understand voice commands and state the location of an object in an understandable manner.
  • The system should only respond to the main user for security purposes.
  • The system should take the privacy concerns of the user into respect.
  • The system should be secure.

Goals

The goals of group 3 are as follows:

  • Do research into the state of the art of AI picture recognition
  • Find and interview relevant users
  • Build an AI that can effectively classify certain objects based on pictures
  • Determine which kind of cameras to use and where to place the cameras (in which rooms and placement)
  • Expand AI capabilities by having it classify objects correctly within video footage
  • Have the AI classify objects within live camera footage
  • Have the AI determine the location of an object on command and tell the user
  • Have the AI remember objects which are out of sight

Planning

Milestones

Object detection

  • Passive object detection (Detecting all objects or specific objects in video/image)
  • Live video feed detection (Useing a camera)
  • Input: find specific item (Input: e.g. item name. Output: e.g. camera & location)
  • Location classification (What is camera2 pixel [100,305] called?)
  • Keeping track of where item is last seen.

Voice interface

  • Define interface (which data is needed as input and output in communication between voice interface and object detection system)
  • Pure data input coupling with system that then gives output (e.g. send “find bottle” to make sure it receives “living room table” as data, without interface for now)
  • Voice parameter input (User Interface to have text input)
  • Text to speech output (Output the result over TTS)

Research

  • Check whether users actually like the system in question.
  • Check whether losing items is an actual problem for visually impaired people.
  • Check whether which locations in building are most useful for users.
  • Research privacy concerns regarding cameras in a home.
  • Analyse the expected cost.

Deliverables

Prototype

  • Create an object recognition setup with a camera.
  • Create a voice recognition system that understands certain commands.
  • Create a system that can locate objects that are asked for on a live camera feed.
  • Create a system that can explain where a found object is located.
  • Create a prototype that works according to the requirements.

Planning

Week Milestones Task division
Week 1 (23-04)
  • Define problem statement
  • Define goals, milestones, users and planning
  • Research state of the art
Week 2 (30-04)
  • Passive object detection
  • Live video feed object detection
  • Data interface
Week 3 (07-05)
  • Voice parameter input
Week 4 (14-05)
  • Object tracking
  • Hold interviews
Week 5 (21-05)
  • Output object location on demand based on input
  • Data communication
Week 6 (28-05)
  • Location classification
  • Text to speech output
Week 7 (04-06)
  • Check privacy and security measures
Week 8 (11-06)
  • Tests
  • Prototype
  • Wiki
Week 9 (18-06)
  • Final presentation

State of the art

Convolutional networks are at the core of most state-of-the-art computer vision solutions[1]. TensorFlow is a project by Google which uses a convolutional network model built and trained especially for image recognition[2].

ImageNet Large Scale Visual Recognition Competition (ILSVRC) is a benchmark for object category classification and detection[3]. TensorFlow’s latest and highest quality model, Inception-v3, reaches 21.2%, top-1 and 5.6% top-5 error for single crop evaluation on the ILSVR 2012 classification, which has set a new state-of-the-art[1].

A lot of progress has been made in recent years with regards to object detection. Modern object detectors based on these networks — such as Faster R-CNN, R-FCN, Multibox, SSD and YOLO — are now good enough to be deployed in consumer products and some have been shown to be fast enough to be run on mobile devices[4]. Research has been done comparing these different architectures on running time, memory use and accuracy, which can be used to determine which implementation to use in a concrete application: [4].

In order to keep track of where a certain object resides, an object tracking (also known as video tracking) system would need to be implemented. Research has been done comparing different such systems using a large scale benchmark, providing a fair comparison: [5]. A master thesis implementing object tracking in video using TensorFlow has been published: [6].

Essentially we are designing some kind of "smart home" by expanding the house with this kind of technology. However several issues arise when talking about smart homes; security and privacy in particular. A study was done involving a smart home and the elderly[7], which used cameras as well. The results were rather positive, 1783 quotes were selected from interviews to determine the thoughts of the users. 12% of the quotes were negatively directed to the privacy and security of the users. In total 63% of the quotes were positively directed towards the system, with 37% being negatively directed, thus most of the complaints were not aimed at the privacy and security issues but at other concerns, like the use of the certain products of the study, which are irrelevant for this project.

Another study was done involving privacy concerns and the elderly[8], this time focused on the use of cameras. However in this study, the elderly were not the users but they commented on potential uses of the system in question. They were shown several families and the systems which were used by them, and the participants were asked several questions regarding the privacy and benefits of these scenarios. Concerning privacy the results were as follows, the more functioning a user was, the higher privacy concerns were rated. So potential benefits outweigh the privacy concerns when talking about a low functioning user.

A study similar to the last was conducted[9], which also asked people questions about a system which used cameras in their homes, not focusing on the elderly. The results state that people are very reluctant to place a monitoring system in their homes, except when it is an actual benefit for your health, in that case almost everyone would prefer to have this system. People would also prefer that the cameras installed would be invisible or unobtrusive. There were also a lot of security concerns, people do not trust that their footage is safe and that it can be obtained by third parties.

To tackle privacy concerns which are introduced by cameras, IBM has researched a camera that can be used to filter out any identity revealing features before receiving the video file[10]. This can be used in our camera based system to preserve the privacy of all primairy and secondary users who are seen by the camera.

For voice recognition many frameworks have been implemented. One such a framework is discussed by the paper of Dawid Polap et al.Cite error: Invalid <ref> tag; invalid names, e.g. too many where they discuss a way to use audio as an image type file which then allows convolutional neural networks to be used. This could be worth while to look into as such a framework already needs to be delivered to find objects using the cameras.

When it comes to speech recognition the main framework used nowadays is the Hidden Markov Models (HMM) approach. A detailed description of how to implement it and where it is used for is presented by Lawrence R. Rabiner which provides a good starting point for the implementation.* [11]

As an alternative to voice control it is possible to allow the system to react to motion controls to. This would allow people who have no voice or have a hard time speaking because of disabilities to still use the system. This process is described by M. Sarkar et al. [12]

As mentioned above, there are those people who have a disability when it comes to speaking. If those people still want to use the voice controlled system some adjustments need to be made as to allow them to still use it. This is described in the paper by Xiaojun Zhang et al. where they describe how to use Deep Neural Networks to still allow these people to use a voice recognition system. [13]

When tackling the environment, being the house or building the system will be implemented in, there are decisions to be made about the extent of mapping the space. A similar study compared to our project has been done where there has been discussion about what rooms could be used and where the camera placement could be.[14] The camera's could for example be static or attached to the person using the system in some way.

Interviews

Possible interview questions:

  • Do you often lose items in your house?
  • Would you be willing to install this system in your home, if it would it be free of charge?
  • If answer yes: How much would you be willing to pay for it? (if it works correctly)
  • If answer is no because privacy: Ask again when you say that people get blurred on camera, or that it is not connected to the internet, or that it is like a security camera.
  • How many cameras would you be willing to install?
  • In which rooms would you like these cameras to be in?
  • How important should it be that the cameras are unobtrusive?

Week 1


Meeting on 23-4-2018 and 24-4-2018

  • We brainstormed about several ideas for our project and chose the best one, which is described at the top.
  • We discussed how to realize this project and how the wiki should look like.

AP For 26-4-2018

Pieter

  • Do research into the Users of the project

Marijn

  • Do research into Tensorlow image recognition

Tom

  • Do research into Environment mapping

Stijn

  • Do research into voice recognition
  • Fix wiki

Rowin

  • Do research into privacy matter

Meeting on 26-4-2018

  • Discussed sources.
  • Specified the chosen subject, users, their requirements, goals, milestones and deliverables.
  • Prepared the feedback meeting.
  • Discussed interview questions

AP For 30-4-2018

Everyone

  • Make a summary for state of the art.

Week 2

Week 3

Week 4

Week 5

Week 6

Week 7

Week 8

Week 9

Results

Sources

To cite a new source: <ref name="reference name">reference link and description</ref>
To cite a previously cited source: <ref name="reference name" \>


  1. 1.0 1.1 Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the Inception Architecture for Computer Vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/cvpr.2016.308
  2. Image Recognition | TensorFlow. (n.d.). Retrieved April 26, 2018, from https://www.tensorflow.org/tutorials/image_recognition
  3. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., . . . Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3), 211-252. doi:10.1007/s11263-015-0816-y
  4. 4.0 4.1 Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., . . . Murphy, K. (2017). Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/cvpr.2017.351
  5. Leal-Taixé, L., Milan, A., Schindler, K., Cremers, D., Reid, I., & Roth, S. (2017). Tracking the Trackers: An Analysis of the State of the Art in Multiple Object Tracking. CoRR. Retrieved April 26, 2018, from http://arxiv.org/abs/1704.02781
  6. Ferri, A. (2016). Object Tracking in Video with TensorFlow (Master's thesis, Universidad Politecnica de Catalunia Catalunya, Spain, 2016). Barcelona: UPCommons. Retrieved April 26, 2018, from http://hdl.handle.net/2117/106410
  7. Anne-Sophie Melenhorst, Arthur D. Fisk, Elizabeth D. Mynatt, & Wendy A. Rogers. (2004). Potential Intrusiveness of Aware Home Technology: Perceptions of Older Adults. Proceedings of the Human Factors and Ergonomics Society 48th Annual Meeting (2004).
  8. Kelly E. Caine, Arthur D. Fisk, and Wendy A. Rogers. (2006). Benefits and Privacy Concerns of a Home Equipped with a Visual Sensing System: a Perspective from Older Adults. Proceedings of the Human Factors and Ergonomics Society 50th Annual Meeting (2006).
  9. Martina Ziefle, Carsten Röcker, Andreas Holzinger (2011). Medical Technology in Smart Homes: Exploring the User's Perspective on Privacy, Intimacy and Trust. Computer Software and Applications Conference Workshops (COMPSACW), 2011 IEEE 35th Annual.
  10. Andrew Senior, Sharath Pankanti, Arun Hampapur, Lisa Brown, Ying-Li Tian, Ahmet Ekin. (2003). Blinkering Surveillance: Enabling Video Privacy through Computer Vision. IBM Research Report: RC22886 (W0308-109) August 28, 2003 Computer Science.
  11. Rabiner, L. R. (1990). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Readings in Speech Recognition, 267-296. doi:10.1016/b978-0-08-051584-7.50027-9 ( https://ieeexplore.ieee.org/document/18626/ )
  12. Sarkar, M., Haider, M. Z., Chowdhury, D., & Rabbi, G. (2016). An Android based human computer interactive system with motion recognition and voice command activation. 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV). doi:10.1109/iciev.2016.7759990 ( https://ieeexplore.ieee.org/document/7759990/ )
  13. Zhang, X., Tao, Z., Zhao, H., & Xu, T. (2017). Pathological voice recognition by deep neural network. 2017 4th International Conference on Systems and Informatics (ICSAI). doi:10.1109/icsai.2017.8248337 ( https://ieeexplore.ieee.org/document/8248337/ )
  14. Yi, C., Flores, R. W., Chincha, R., & Tian, Y. (2013). Finding objects for assisting blind people. Network Modeling Analysis in Health Informatics and Bioinformatics, 2(2), 71-79.(https://link.springer.com/article/10.1007/s13721-013-0026-x)