PRE2017 4 Groep3
Group members
- Stijn Beukers
- Marijn v.d. Horst
- Rowin Versteeg
- Pieter Voors
- Tom v.d. Velden
Brainstorm
We have discussed several ideas that we may want to implement.
- An AI and GUI for a board game, in which you can play with different AIs and maybe integrate a multiplayer environment the GUI could also give tips to the users.
- A filter for notifications on your smartphone to not get distracted by non-urgent notifications while still being available for urgent notifications.
- A simple way to connect multiple interfaces like doorbells, music, notifications or your alarm to the lights in your house.
- An artificial intelligence that automatically switches between camera angles in live broadcasts.
- A program that stitches together recorded videos like vlogs automatically.
- A program that makes music compilations where music flows together naturally the way DJs mix together music as if it is one big song rather than fading in one song and starting the next.
- A system of cameras in homes for blind people that keeps track of where they have left certain items such that they can ask the system where they left it when they lose an object.
- A model of a robot which learns to walk/pick up objects using machine learning.
- A system that sorts music based on its genre.
Chosen Subject
For centuries our species has known that they are not perfect and shall never attain perfection. To get ever closer to perfection we have created many tools to bridge the gap between our weaknesses and the perfection and satisfaction we so very much desire. Though many problems have been tackled and human life has greatly improved in quality, we are still capable of losing the items that could provide such comfort. Such items could, for example, be phones, tablets or laptops. Even at home a TV remote is often lost. We propose a solution to the problem of losing items within the confinements of a certain building. The solution we propose is to apply Artificial Intelligence (AI) as to find items using live video footage. This is chosen as image classification has been proven to be very efficient and effective at classifying and detecting objects in images. For convenience sake this system will be provided with voice command abilities and upon finding the requested items, the system will return where the item is. This will be done via a speaker telling the user where the requested item is.
Users
Though many people may benefit from the proposed systems, there are some people that would more so benefit from the system than others. A prime example would be people that are visually impaired or people who are blind. These people could have a hard time finding some item as they may not be able to recognize it themselves or they may not be able to see it at all. The system would provide them with a sense of ease as they would no longer have to manage where their items are all the time. Secondly, people that have a kind of dementia would greatly benefit from this system as they don't have to worry about forgetting where they left their belongings due to their deficiency. The elderly in general is also a good user for the proposed system. This is due to the fact that the elderly tend to be forgetful as their body is no longer in the prime of their life. In addition, they are also the people that also suffer the most from the aforementioned deficiencies. Additionally, smart home enthusiasts could be interested in this system is a new type of smart device. Moreover, people with large mansions could be interested in this system, as within a mansion an item is easily lost. Lastly, companies could be interested in investing in this software. Companies would by implementing the system be able to keep track of their staff's belongings and help find important documents that may be lost on someone's desk.
User Requirements
For this system to work, we need to fulfill separate requirements of the users.
- The system should be able to inform the user where specific items are on command.
- The system should be available at all times.
- The system should only respond to the main user for security purposes.
- The system should take the privacy concerns of the user into respect.
- The system should be secure.
Goals
The goals of group 3 are as follows:
- Do research into the state of the art of AI picture recognition
- Find and interview relevant users
- Build an AI that can effectively classify certain objects based on pictures
- Determine which kind of cameras to use and where to place the cameras (in which rooms and placement)
- Expand AI capabilities by having it classify objects correctly within video footage
- Have the AI classify objects within live camera footage
- Have the AI determine the location of an object on command and communicate it to the user
- Have the AI remember objects which are out of sight
- Find out the best way to communicate the location to the user
Task division
- Voice Recognition
- Stijn Beukers
- Tom v.d. Velden
- User Research
- Rowin Versteeg
- Image Detection
- Pieter Voors
- Marijn v.d. Horst
Planning
Milestones
Object detection
- Passive object detection (Detecting all objects or specific objects in video/image)
- Live video feed detection (Useing a camera)
- Input: find specific item (Input: e.g. item name. Output: e.g. camera & location)
- Location classification (What is camera2 pixel [100,305] called?)
- Keeping track of where item is last seen.
Interface
- Define interface (which data is needed as input and output in communication between interface and object detection system)
- Pure data input coupling with system that then gives output (e.g. send “find bottle” to make sure it receives “living room table” as data, without user interface for now)
Research
- Check whether users actually like the system in question.
- Check whether which locations in building are most useful for users.
- Research privacy concerns regarding cameras in a home.
- Analyse the expected cost.
- Research the best way to communicate the location to the user
Deliverables
Prototype
- Create an object recognition setup with a camera.
- Create an interface that can process certain commands.
- Create a system that can locate objects that are asked for on a live camera feed.
- Create an interface that can explain where a found object is located.
- Create a prototype that works according to the requirements.
Planning
Week | Milestones | |
---|---|---|
Week 1 (23-04) |
|
|
Week 2 (30-04) |
|
|
Week 3 (07-05) |
|
|
Week 4 (14-05) |
|
|
Week 5 (21-05) |
|
|
Week 6 (28-05) |
|
|
Week 7 (04-06) |
|
|
Week 8 (11-06) |
|
|
Week 9 (18-06) |
|
State of the art
A number of researches have already been done into the field of finding objects using artificial intelligence in different ways. Among researches specifically aimed at finding objects for visually impaired people are systems that make use of FM Sonar systems that mostly detect the smoothness, repetitiveness and texture of surfaces[1] and Speed-Up Robust Features that are more robust with regards to scaling and rotating objects.[2] Other, more general, researches into object recovery also make use of radio-frequency tags attached to objects[3] and Spotlight, which "employs active RFID and ultrasonic position detection to detect the position of a lost object [and] illuminates the position".[4]
A relevant study has been conducted in the nature of losing objects and finding them. It addresses general questions such as how often people lose objects, what strategies are used to find them, the types of objects that are most frequently lost and why people lose objects.[5]
Applicable to the project is also researches that have been done into the needs and opinions of visually impaired people. A book has been written about assistive technologies for visually impaired people.[6] Multiple surveys were also conducted about the opinions of visually impaired people on the research regarding visually impairment, both in general[7] and in the Netherlands specifically.[8]
Convolutional networks are at the core of most state-of-the-art computer vision solutions[9]. TensorFlow is a project by Google which uses a convolutional network model built and trained especially for image recognition[10].
ImageNet Large Scale Visual Recognition Competition (ILSVRC) is a benchmark for object category classification and detection[11]. TensorFlow’s latest and highest quality model, Inception-v3, reaches 21.2%, top-1 and 5.6% top-5 error for single crop evaluation on the ILSVR 2012 classification, which has set a new state-of-the-art[9].
A lot of progress has been made in recent years with regards to object detection. Modern object detectors based on these networks — such as Faster R-CNN, R-FCN, Multibox, SSD and YOLO — are now good enough to be deployed in consumer products and some have been shown to be fast enough to be run on mobile devices[12]. Research has been done comparing these different architectures on running time, memory use and accuracy, which can be used to determine which implementation to use in a concrete application: [12].
In order to keep track of where a certain object resides, an object tracking (also known as video tracking) system would need to be implemented. Research has been done comparing different such systems using a large scale benchmark, providing a fair comparison: [13]. A master thesis implementing object tracking in video using TensorFlow has been published: [14].
Essentially we are designing some kind of "smart home" by expanding the house with this kind of technology. However several issues arise when talking about smart homes; security and privacy in particular. A study was done involving a smart home and the elderly[15], which used cameras as well. The results were rather positive, 1783 quotes were selected from interviews to determine the thoughts of the users. 12% of the quotes were negatively directed to the privacy and security of the users. In total 63% of the quotes were positively directed towards the system, with 37% being negatively directed, thus most of the complaints were not aimed at the privacy and security issues but at other concerns, like the use of the certain products of the study, which are irrelevant for this project.
Another study was done involving privacy concerns and the elderly[16], this time focused on the use of cameras. However in this study, the elderly were not the users but they commented on potential uses of the system in question. They were shown several families and the systems which were used by them, and the participants were asked several questions regarding the privacy and benefits of these scenarios. Concerning privacy the results were as follows, the more functioning a user was, the higher privacy concerns were rated. So potential benefits outweigh the privacy concerns when talking about a low functioning user.
A study similar to the last was conducted[17], which also asked people questions about a system which used cameras in their homes, not focusing on the elderly. The results state that people are very reluctant to place a monitoring system in their homes, except when it is an actual benefit for your health, in that case almost everyone would prefer to have this system. People would also prefer that the cameras installed would be invisible or unobtrusive. There were also a lot of security concerns, people do not trust that their footage is safe and that it can be obtained by third parties.
To tackle privacy concerns which are introduced by cameras, IBM has researched a camera that can be used to filter out any identity revealing features before receiving the video file[18]. This can be used in our camera based system to preserve the privacy of all primairy and secondary users who are seen by the camera.
For voice recognition many frameworks have been implemented. One such a framework is discussed by the paper of Dawid Polap et al.Cite error: Invalid <ref>
tag; invalid names, e.g. too many where they discuss a way to use audio as an image type file which then allows convolutional neural networks to be used. This could be worth while to look into as such a framework already needs to be delivered to find objects using the cameras.
When it comes to speech recognition the main framework used nowadays is the Hidden Markov Models (HMM) approach. A detailed description of how to implement it and where it is used for is presented by Lawrence R. Rabiner which provides a good starting point for the implementation.* [19]
As an alternative to voice control it is possible to allow the system to react to motion controls to. This would allow people who have no voice or have a hard time speaking because of disabilities to still use the system. This process is described by M. Sarkar et al. [20]
As mentioned above, there are those people who have a disability when it comes to speaking. If those people still want to use the voice controlled system some adjustments need to be made as to allow them to still use it. This is described in the paper by Xiaojun Zhang et al. where they describe how to use Deep Neural Networks to still allow these people to use a voice recognition system. [21]
When tackling the environment, being the house or building the system will be implemented in, there are decisions to be made about the extent of mapping the space. A similar study compared to our project has been done where there has been discussion about what rooms could be used and where the camera placement could be.[22] The camera's could for example be static or attached to the user in some way.
Continuing about the camera placement, studies concerning surveillance camera's have resulted in optimal placement according to specification of the camera to make the use as efficient as possible so that costs are reduced.[23] They have also resulted in an algorithm that tries to maximize the performance of the camera's by considering the task it is meant for.[24]
By the use of depth camera's there is also the possibility of generating a 3D model of the indoor environment, described in a paper about RGB-D mapping. This can help with explaining to the user where the object asked for is located.[25]
User Research
To investigate the interest and concerns of our system we decided to distribute a survey amongst various potential users.
We are mainly developing this system for people which have either dementia or are visually impaired, since they are among the people we estimate to have the highest probability of losing their personal belonings. For this purpose we ask people in the survey, whom are considered to be potential users, whether they suffer from such a disability.
We are also interested in the correlation between losing objects, age and the interest for this system.
Since we are expecting a lot of privacy concerns, we specifically ask the users in what way they would like to see these concerns be addressed.
We also ask in what rooms the users would want to have the cameras installed to see how many cameras would need to be installed on average.
We are also interested in the price that people would want to pay for the system. This is essential as the system has to be able to have a chance at the current market.
Lastly we ask how important it is that the cameras are hidden, because people may not want a visible camera in their rooms.
The survey below has been filled in by 61 people and we will now discuss the results.
Google form: https://docs.google.com/forms/d/e/1FAIpQLScCbIxM10migwrO-rNiF07-iIabRcVXuj8jqcqDpZFWPJ392Q/viewform?usp=sf_link
Results
The survey was filled in by people from all over the world in the age ranges of 11-20 (31%) and 21-30 (64%), so those are the only ages we can make a proper statement about. There is no notable difference between the two age groups.
46% of the people were not interested in the system while 42% was interested (the other responses were neither), this seems like an even split.
The correlation between interest in the system and how often the person loses something is highly noticable, as presented in the pie charts below.
Do notice that some people were not interested in the system even though they were losing objects rather occasionally, almost all of these people were not interested because of privacy and security concerns.
However not significant, there were 2 visually impaired people which filled in the survey and they were both very interested in the system.
Some comments that were mentioned for being interested in the system:
- The system would help me retrieve the objects I occasionally lose.
- The system is an interesting gimmick to fool around with.
- THe system would help me to keep track of valuable items.
- The system is a great idea for people with disabilities.
Some comments that were mentioned for being not interested in the system:
- I do not want cameras in my house.
- I (almost) never lose objects.
- The system might get annoying.
- The system is probably too expensive to install.
- The use cases are too limited.
- The house will use more electricity.
- Privacy/Security concerns (Discussed below)
We received several privacy and security concerns from people as expected. This is understandable as in recent years people are more concerned about their privacy, as many companies use their privacy for their own benefit. We will discuss several of the comments which were mentioned, and add reasoning to prevent any misuse of the system.
- The information is stored longer than necessary.
We plan to save no footage at all and only rely on live video capturing which will be deleted instantly. The system will only save the information needed to tell the users where certain objects are, a database of objects.
- The data is shared with 3rd parties.
A contract can be signed to ensure we are legally not allowed to share any of the data.
- The system can be hacked as the internet is insecure.
The system will operate locally with a central computer which connects all the cameras via hardware, thus there will be no internet involved. However for the prototype we use an existing framework for voice control of Google called "Google Assistant SDK".
- The system can be used by burglars.
The system can only be used by the registered users, a voice recognition system will take care of this. Furthermore the physical system's data will be encrypted so it can not be easily accessed.
- The system cannot be turned off.
In principle you can turn off the system, but take care as the system may "lose" some objects as they can be moved while the system is offline.
- I do not want cameras watching me or I would prefer another method of locating objects.
This is the only concern which cannot be addressed, as this concerns the user's own will. There are several other methods to locate objects, which involve bluetooth tracking devices, which actually can be hacked, so in a sense this system could be safer.
For the people who were interested in the system, we asked what rooms they would want a camera installed, this is represented in the pie chart below.
38 people indicated the rooms they would want a camera installed in, in a total of 167 rooms. This means that every user would have 4 cameras in their home on average.
The prices that people were willing to pay for the system were on average (after filtering out all the different currencies) €230. This would mean €230 for an average of 4 cameras and a computer which can do live calculations, excluding installation costs.
Lastly we asked whether the cameras have to be unobtrusive, for 31% of the people it is not important while 58% found it important (the rest in between). This means that we have to make sure that the cameras are unobtrusive, as it is important for most of the people.
We received some comments and tips for the system:
- You should also implement a security camera feature to the system which can detect burglars and report it.
- You should call the cameras "sensors" to scare off less users.
Results from other sources
A survey about lost items was conducted by the company which created "Pixie", a tool to track items in your home using bluetooth[26]. The survey concluded that Americans spend 2.5 days a year looking for lost items, and that it takes 5 minutes on average to find a lost item, which resulted in being late for certain events. The most common lost items are TV remotes, keys, shoes, wallets, glasses and phones. The lost items which take on average longer than 15 minutes to find are keys, wallets, umbrellas, passports, driver's licences and credit cards.
Location description research
One of the major difficulties of this system is describing where an object is located, such that a potential user can understand and locate the object that is to be found. To know the best way to do this however we want to consider several options and compare them with each other. For this we assume that we have an interface in which you can indicate what object you are looking for. We have considered multiple scenarios. In order to visualize these scenarios mentioned below we will use picture shown on the right as an example of what a room where the Object Locating System is utilized could look like. In this picture the red square is an indication of where a requested item was lost. For purpose of illustration we assume the lost/requested item to be a TV remote.
Scenario 1
One of the most convenient ways to show the user the location of an item is by showing them some footage of the moment when the object was last seen. Considering the example scenario this would be the video feed that is shown to the user. This allows the user to quickly identify where the requested object is located rather easily as the user can see exactly where the item was lost, making it very accurate and having little chance of confusion for the user because the interaction with the system doesn't require difficult communication. However, showing such a video feed for an item becomes quite taxing on the hardware making this option rather expensive. There is also no guarantee that the user won't obstruct the object in the live feed, making it hard for the system to show where the item is. Besides this, there isn't necessarily an advantage of having a live feed instead of having a picture. Even if the user would like to look at himself to guide him to the location there will be a delay what makes the system less user friendly.
Scenario 2
Instead of using a live feed, there is also the possibility of using a still picture of the room that is made when the user asks for an object. This means that there can be an object box on a location where the object was last seen while it is not there anymore like the example scenario picture at the right. Compared to the live feed this requires a lot less computation while delivering almost the same result. The only main difference is that the user can't guide himself to the object on the live feed but since the room will probably be well-known to the user this won't be necessary. The other advantage of scenario 1 also applies here, being that the communication between the user and the system will be intuitive and clear, with little chance of miscommunication such as the user not understanding what results the system is giving. In terms of disadvantages there is only one, being that also in this scenario the user can obstruct the object location while the picture is made what makes the result less clear.
Scenario 3
In the example scenario picture there is an empty box of where a TV remote has last been seen but is not there anymore. This may not help the user at all so having a scenario where a picture saved of when the object has last been seen could improve user friendliness. This keeps the same advantages of scenario 1 and 2 so user interaction is easy and clear. However, this might introduce some storage concerns because there have to be a lot of pictures saved for the amount of the objects the system finds. These also have to be updated in a certain time interval. However, there only has to be one picture of an object; the most recent one. This means storage doesn't have to be a problem. It can also be argued that privacy is at risk but the cameras should be as privacy sensitive as the pictures so saving images doesn't increase the risk.
Scenario 4
We now consider a situation where the system will state n objects around the object that you wish to find. Using our example this would mean that the system will state "The TV Remote is near the couch, pillow and table" if the system states 3 objects. This requires the object detection of more objects and the system will have to know how close they are to the object you are looking for. The latter is easy to do, the former however is more difficult as the system will have to be trained on more objects. The system also has to know how many objects to state and what objects are handy for the user. This can be done by looking a certain distance around the looked for object, which can be easily done. The system may state some useless objects in that case however. The system also cannot state distinct objects, for example in this case it would just state "couch" in stead of a specific couch.
Scenario 5
We now consider a situation where the system will State the object in which the searched for object is encapsulated in. This is very similar to the last scenario but instead of stating multiple objects, we only state one object. So for objects on the table it can say "On the table" as the objects are completed encapsulated in the table. The TV remote however is not encapsulated in anything hence the system cannot state any object.
Scenario 6
We now consider a situation where the system will state the region of the room an item is in. So using our example the system would state that the TV remote is in the top right region of the room. As you can see for this example it is already hard to define a specific region, while it may be easier for other examples, a certain corner of a room for instance). The system would need to build a world model of the room and divide the room into regions, which are understandable for the user. It cannot use objects for this, so this already increases the difficulty of this system. The system then needs to detect objects and detect in what region(s) they are, this is also more difficult when the object is in multiple regions at once.
Scenario 7
We now consider the situation where the system states the direction the user should be walking towards. This can be done using voice commands or an arrow on a screen, it does not really matter for now. So imagine you are standing in the top left corner of the image and you want to find the TV remote. The system will indicate that the user has to move forward until he reaches the coach. This may sound nice when you state it like this, as the user does not have to think at all when finding on object, he/she just has to follow the steps. However This gives rise to lots of issues in technology and user friendliness. It is very difficult to track what direction a user is facing, and to indicate in what (exact) direction a user has to walk. The system would need to have a perfect world model and understand depth and distance. What will happen when multiple users are on screen? THe system would have trouble to redirect a certain user, unless user recognition is implemented. The user also obviously knows his own house, so finding an object this way is time consuming, this system would work for an unknown location. The system may not know where an object is exactly located (if it is using its last seen functionality) so it may direct a user to the wrong location as well. There will also be a delay present which decreases the user friendliness a lot, as the delay may direct the user to the wrong location.
Summary
In the table below we summarize the different scenarios with their advantages and disadvantages.
We also state whether we need a screen or voice control as user interaction, whether the interaction is live (the user communication is constantly updated) and whether it is required to build a world model of the room. These variants have advantages and disadvantages:
- Screen
- Advantages:
- We can easily communicate information to the user, a user can easily understand an image.
- User knows the room and hence it is relatively easy to find an object.
- Disadvantages:
- A screen needs to be installed or the system has to be able to work via wifi.
- User may obstruct view of camera.
- Advantages:
- Voice
- Advantages:
- Blind people can only use the system when voice is implemented.
- Disadvantages:
- Voice user interaction is difficult to implement.
- Speakers are required.
- Advantages:
- Live
- Advantages:
- It updates in real time so the user can be guided / guide him/herself
- Disadvantages:
- There will be a delay in output.
- It may be unnecessary.
- Advantages:
- World model
- Advantages:
- It can use the model of the room to describe item locations more understandable.
- Disadvantages:
- It requires very difficult implementations.
- Multiple rooms need to be combined with each other.
- Advantages:
Option | Screen/Voice | Live | World model | Additional Advantages | Additional Disadvantages |
---|---|---|---|---|---|
1. Show live footage and draw a box around the corresponding location of the object found. | Screen | Yes | No | ||
2. Show a picture of the room, which was made when the user asks where an object is located, and draw a box around the corresponding location. | Screen | No | No | User may obstruct view of camera, in which case a picture may have to be retaken. | |
3. Show a picture of the room, which was made when the object was last seen, and draw a box around the corresponding location. | Screen | No | No | The user can easily see how the object was lost. | Requires a lot of pictures to be made and saved. |
4. State n objects around the object that you wish to find. | Voice | No | No | Hard to determine what objects to state for the user. | |
May not be accurate enough. | |||||
5. State the object in which the searched for object is encapsulated in. | Voice | No | no | The object may not always be encapsulated by another object. | |
6. State the region of the room in which the object is located in. | Voice | No | Yes | Can be very clear if done correctly. | Room needs to be divided into subsections. |
7. State the direction the user should be walking towards. | Voice | Yes | Yes | Requires no thinking of the user. | System may misguide you if the object is off-screen. |
Very hard to implement as you need to see what the user is doing as well. | |||||
Slow technique of guiding when in a known location of the user. | |||||
The camera needs to know the distance of the object relative to the user. | |||||
Unclarity when multiple users are on screen. |
Discussion
Now that all the scenarios have been argumented, there has to be looked at which is most viable to use in our project. When looking at the scenarios there are some features that appear to be too difficult to implement. One of these is the world model. After discussion it seems like having the system monitor multiple rooms makes it too hard to build a world model around. This already eliminates scenarios 6 and 7. Regarding scenarios with a screen, 1, 2, and 3 are similar. However, scenario 1 doesn't have enough extra to contribute compared to the other two while there is more computation. Scenario 2 and 3 seem to be viable choices as they are user friendly and can clearly indicate where an object is. The last two scenarios, 4 and 5, are somewhat similar too. 4 seems to be the best choice based on the fact that scenario 5 doesn't guarantee a good location description if the object is not encapsulated. However due to project time constraints we do not have the time to implement a system that can communicate with the user. From scenario 2 and 3 we chose scenario 3 as storage is not a problem and a user can easily see what he was doing when the object was lost, reducing the time to search for the object.
Software
For our object recognizing framework we use TensorFlow object detection[29] which is an open source software library for deep learning. It was developed by researchers and engineers from the Google Brain team within Google’s AI organization. We use TensorFlow image recognition which uses deep convolutional neural networks to classify objects in camera footage. The basic implementation provided can run a camera and show boxes around the classified objects and say what they are with a computed certainty, as presented in the image. The API is written in Python, and we can alter the code for our own implementation.
We use the COCO[31] (Common Objects in Context) dataset for our impementations which can classify about 80 different objects, which includes certain animals. All objects of the COCO dataset are in the image on the right. This also results in certain limitations for our implementation, we can only use the objects which are already in the dataset as adding new objects would take too much time and resources. We would have to re-train the neural network with thousands of (good quality, distinct, general and different angle) images of every distinct object, as otherwise we could run into the risk of overfitting, which would make the neural network useless. We have thus only limited objects to detect, luckily these objects include some objects which are lost often as well, such as TV remotes, books, cell phones and even cats and dogs. However we cannot add the objects that we wanted to add such as keys, shoes and other often lost household items. It is also hard to combine multiple frameworks, as this is only possible if they are also written in Python because again we do not have the time and resources to understand and rewrite multiple frameworks into one framework. This means that we have to create our own tracking software with basic functionality which is less optimal than most other frameworks. Tracking is an implementation to keep track of an object's location even when it is moving, to prevent the classification of 'duplicate' objects.
We implemented the software such that when a camera detects a certain object, it will send a picture of its current location to the database every 10 seconds. The object recognition system uses footage of 10 frames per second to update the object locations. Via a website we can access the location of a certain object using these pictures (more on this later). The picture will be updated every time the same object is seen again. This of course raises several questions, as we need to be able to detect whether an object is the same object as before, even if it was moving. For this we use tracking, we do this via positions, colors, and time intervals of a seen object, as it is unlikely that an object changes color (with a light switch for example) and position (moving) in a short time interval. Hence we can guess whether an object is the same. This however becomes extra difficult when we have to take footage of multiple cameras into account, or the same cell phone has moved because someone put it in their pocket, as then the position and color could drastically change.
We have certain levels of testing to help with the implementation of the tracking. (checks ✔ are placed if we have this implemented)
- ✔ Place a stationary object in front of the camera and check whether it will not be recognized as 2 seperate objects over time.
- To make sure that 1 object is not recognized as 2 separate objects, which sometimes is the case in the same frame
- The system should know that an object that is standing still (in every frame) remains the same object over time.
- Place a stationary object in front of the camera and flick the lights on/off, check whether it will not be recognized as 2 seperate objects over time.
- The tracking should not depend too much on color, as a slight color change (lighting) can otherwise imply that there is a new object.
- Place two distinct and identical objects next to each other (like two identical chairs), and check whether they will be recognized as 2 seperate objects over time, not more or less.
- The system should know that two objects which are on a different location, are different objects.
- Place two not similar objects of the same class next to each other (like 2 distinct books) and switch their positions, check whether the correct books' locations are updated.
- The system should be able to track two moving objects, and not create new objects while they are moving.
- The system should update the locations of the objects with the correct corresponding IDs.
- Place an object in front of the camera, then stand in front of it for a while, if you leave the system should not think it is a new object.
- The system should know that it is the same object, based on color and location.
- Move an object (same angle to the camera) across the whole view of the camera at different speeds, check whether the system only recognized one object.
- The system should detect moving objects, and classify it as the same object.
- Place an object in front of the camera, then rotate the objects 180 degrees, the system should not think it is a new object.
- The system should know it is the same object, even at a different angle.
- Move a rotating object across the whole view of the camera at different speeds, check whether the system only recognized one object.
- The system should recognize objects at different angles and while moving.
- Show an object to the camera, then put it behind your back and move to the other side of the view with the object hidden, then reveal the object, check whether the system only recognized one object.
- The system has to know that when an object disappeared, it is possibly the same object that appeared somewhere else.
- Show an object to the camera, then move the object off-screen and return it, check whether the system only recognized one object.
- The system has to know that when an object is off-screen, it is possible that the same object enters the screen again.
- Show a room full of objects to the camera, then shut down the system and boot it again, check whether the system created new instances of every object.
- The system should work after a power outage.
- Place two cameras next to each other (each having a complete different view), move an object from one camera's view to the other, check whether the system only recognized one object.
- The system should be able to work with different cameras.
- The system should know that when an object is off-screen, it is possible that the same object enters the view of the other camera.
UI
We also have an additional challenge we have to face, the user should be able to easily determine where an object is located. We now must convert our database of pictures in a meaningful way, as the default presentation is not so useful for the average user. For this we are hosting a website (https://ol.pietervoors.com/) to easily navigate the object you are looking for. When going to the item list you are presented with a list of items detected in your home. An example is presented in the image below, where 2 different classes of items have been detected, namely a chair and a person.
We can now click on a certain object, like chair, this will give us a list of all different chairs detected in the system (in this case there is only one). As presented in the image below.
We can now click on show location to get a full view of the image in which the object was last seen. As presented in the image below.
Even though we are now using a web interface for this prototype, in the end system this can be prevented with a local based system.
Prototype
The prototype that we want to deliver as the result of this project is more of a proof of concept than showing the full purpose of the system. The setup consists out of two cameras connected to a laptop, pointed at a table. Not all objects in the dataset will be detectable, only the useful ones such as cups and scissors. The rest has been turned off in the search algorithm. In this camera view the prototype should be able to deal with multiple tests that examine the abilities and robustness of the system.
Week 1
Meeting on 23-4-2018 and 24-4-2018
- We brainstormed about several ideas for our project and chose the best one, which is described at the top.
- We discussed how to realize this project and how the wiki should look like.
AP For 26-4-2018
Pieter
- Do research into the Users of the project
Marijn
- Do research into Tensorlow image recognition
Tom
- Do research into Environment mapping
Stijn
- Do research into voice recognition
- Fix wiki
Rowin
- Do research into privacy matter
Meeting on 26-4-2018
- Discussed sources.
- Specified the chosen subject, users, their requirements, goals, milestones and deliverables.
- Prepared the feedback meeting.
- Discussed interview questions
AP For 30-4-2018
Everyone
- Make a summary for state of the art.
Week 2
We implemented object detection in a static image, which is based on TensorFlow: Object Detection Repository The output can be seen in the image below.
We also implemented it successfully on a live video feed, using this tutorial: live object detection The frame-rate and delay is still quite large, which needs to be fixed.
We also implemented DELF feature extraction, which uses feature detection in images which can be used to match two images containing the exact same object. An example matching can be seen in the image below, where we used a pre-trained model for architecture. In this image, lines are drawn between the matched feature points of the two images.
Meeting on 30-4-2018
- Weekly feedback meeting
- We subdivided tasks and worked on them as follows
- Get familiar with voice control (Tom and Stijn)
- Get familiar with object detection (Pieter and Marijn)
- Make and distribute a survey for user research (Rowin)
AP For 3-5-2018
Pieter and Marijn
- Investigate object detection
Tom and Stijn
- Investigate voice control
Rowin
- Analyze survey
Meeting on 3-5-2018
- Find more potential users and discuss results of the user research on the wiki (Rowin)
- Implementing DELF (Marijn and Pieter)
- Research state of the art of voice control to implement a framework (Stijn and Tom)
AP For 7-5-2018
Tom and Stijn
- fix audio ubuntu
Rowin
- Find more survey participants
Marijn
- Implement object detection on live video feed
Pieter
- Explore DELF Point feature extraction and matching
Week 3
Meeting on 7-5-2018
- Weekly feedback meeting
- Completed user research (first version) (Rowin)
- Implementing new voice commands on framework (Tom and Stijn)
- Point feature detection implementation (Pieter and Marijn)
AP For 14-5-2018
Pieter and Marijn
- Able to register new objects
Tom and Stijn
- Implement Coco database
- Look into security
- Create new voice commands
Rowin
- Find study for common lost household items
- Research object location description
- Research for visual impairment uses
Week 4
Meeting on 14-5-2018 and 17-5-2018
- We have specified what needs to be done and subdivided the tasks as seen below
- We have decided to stop with the voice control (for now) as we need to focus on one thing
Goals for week 5:
- Create a database of items that keeps track of where it has been seen last
- A web application will be implemented as interface that can read the database and returns a list of all the times that an item has been seen and its location
- Research the best way to communicate a location to the user (either through photo, a relative location or something else)
Role division
- Marrijn : Web Team
- Pieter : Software Team
- Stijn : Software Team
- Tom : User Research Team
- Rowin: Wiki Team
Week 5
ToDo
- Deploy website to VPS: http://ol.pietervoors.com
- Training own dataset
- Tracking of the same instance of an object
- Way of returning to the user where objects are
Deliverable and attention points for the remainder of the project:
A prototype system with multiple cameras in a home that are connected to a computer. The computer runs a program that keeps track of where items are. These items, because of the current limitation of the basic object detection setup, will be limited to a (to be-)specified set of items. There will be an interface through which the user can request info on where their item is. The system should have support for multiple cameras, as well as multiple instances of the same object (for example, multiple pairs of glasses) and items that have gone out of sight of the cameras. To start, the user interface is a graphical user interface.
The main attention points for the project are the following:
- Tracking of instances of items. This concerns the fact that when you have multiple pairs of glasses, the system needs to figure out which pair of glasses they are. Walking between different rooms, for example, also plays a role here.
- The user interface, which should be made to provide better user interaction, making the system more user-friendly. The way at which the location info is presented to the user plays a big role in this.
- Filtering out or handling wrong detections where the object detection detects items that are not there. The first approach to this will be to train our own dataset.
Week 6
- We have created a user interface at the website http://ol.pietervoors.com.
- The user interface is able to return pictures of last seen objects
- We have implemented basic tracking, which can recognize somewhat whether objects are the same.
- We tried training the network for additional objects, this however is not possible for us to efficiently do.
Week 7
- We improved the website interface, with bounding boxes, tables, etc.
- We improved the tracking of objects.
- We worked on multithreading to implement multiple cameras.
Week 8
Week 9
Results
Sources
To cite a new source: <ref name="reference name">reference link and description</ref>
To cite a previously cited source: <ref name="reference name" \>
- ↑ http://www.aaai.org/Papers/Symposia/Fall/1996/FS-96-05/FS96-05-007.pdf
- ↑ Chincha, R., & Tian, Y. (2011). Finding objects for blind people based on SURF features. 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW). doi:10.1109/bibmw.2011.6112423 ( https://ieeexplore.ieee.org/abstract/document/6112423/ )
- ↑ Nakada, T., Kanai, H., & Kunifuji, S. (2005). A support system for finding lost objects using spotlight. Proceedings of the 7th International Conference on Human Computer Interaction with Mobile Devices & Services - MobileHCI 05. doi:10.1145/1085777.1085846 ( https://dl.acm.org/citation.cfm?id=1085846 )
- ↑ Pak, R., Peters, R. E., Rogers, W. A., Abowd, G. D., & Fisk, A. D. (2004). Finding lost objects: Informing the design of ubiquitous computing services for the home. PsycEXTRA Dataset. doi:10.1037/e577282012-008
- ↑ Hersh, M. A., & Johnson, M. A. (2008). Assistive technology for visually impaired and blind people. Londres (Inglaterra): Springer - Verlag London Limited.
- ↑ Duckett, P. S., & Pratt, R. (2001). The Researched Opinions on Research: Visually impaired people and visual impairment research. Disability & Society, 16(6), 815-835. doi:10.1080/09687590120083976 ( https://www.tandfonline.com/doi/abs/10.1080/09687590120083976 )
- ↑ Schölvinck, A. M., Pittens, C. A., & Broerse, J. E. (2017). The research priorities of people with visual impairments in the Netherlands. Journal of Visual Impairment & Blindness, 237-261. Retrieved from https://files.eric.ed.gov/fulltext/EJ1142797.pdf.
- ↑ 9.0 9.1 Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the Inception Architecture for Computer Vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/cvpr.2016.308
- ↑ Image Recognition | TensorFlow. (n.d.). Retrieved April 26, 2018, from https://www.tensorflow.org/tutorials/image_recognition
- ↑ Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., . . . Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3), 211-252. doi:10.1007/s11263-015-0816-y
- ↑ 12.0 12.1 Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., . . . Murphy, K. (2017). Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/cvpr.2017.351
- ↑ Leal-Taixé, L., Milan, A., Schindler, K., Cremers, D., Reid, I., & Roth, S. (2017). Tracking the Trackers: An Analysis of the State of the Art in Multiple Object Tracking. CoRR. Retrieved April 26, 2018, from http://arxiv.org/abs/1704.02781
- ↑ Ferri, A. (2016). Object Tracking in Video with TensorFlow (Master's thesis, Universidad Politecnica de Catalunia Catalunya, Spain, 2016). Barcelona: UPCommons. Retrieved April 26, 2018, from http://hdl.handle.net/2117/106410
- ↑ Anne-Sophie Melenhorst, Arthur D. Fisk, Elizabeth D. Mynatt, & Wendy A. Rogers. (2004). Potential Intrusiveness of Aware Home Technology: Perceptions of Older Adults. Proceedings of the Human Factors and Ergonomics Society 48th Annual Meeting (2004).
- ↑ Kelly E. Caine, Arthur D. Fisk, and Wendy A. Rogers. (2006). Benefits and Privacy Concerns of a Home Equipped with a Visual Sensing System: a Perspective from Older Adults. Proceedings of the Human Factors and Ergonomics Society 50th Annual Meeting (2006).
- ↑ Martina Ziefle, Carsten Röcker, Andreas Holzinger (2011). Medical Technology in Smart Homes: Exploring the User's Perspective on Privacy, Intimacy and Trust. Computer Software and Applications Conference Workshops (COMPSACW), 2011 IEEE 35th Annual.
- ↑ Andrew Senior, Sharath Pankanti, Arun Hampapur, Lisa Brown, Ying-Li Tian, Ahmet Ekin. (2003). Blinkering Surveillance: Enabling Video Privacy through Computer Vision. IBM Research Report: RC22886 (W0308-109) August 28, 2003 Computer Science.
- ↑ Rabiner, L. R. (1990). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Readings in Speech Recognition, 267-296. doi:10.1016/b978-0-08-051584-7.50027-9 ( https://ieeexplore.ieee.org/document/18626/ )
- ↑ Sarkar, M., Haider, M. Z., Chowdhury, D., & Rabbi, G. (2016). An Android based human computer interactive system with motion recognition and voice command activation. 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV). doi:10.1109/iciev.2016.7759990 ( https://ieeexplore.ieee.org/document/7759990/ )
- ↑ Zhang, X., Tao, Z., Zhao, H., & Xu, T. (2017). Pathological voice recognition by deep neural network. 2017 4th International Conference on Systems and Informatics (ICSAI). doi:10.1109/icsai.2017.8248337 ( https://ieeexplore.ieee.org/document/8248337/ )
- ↑ Yi, C., Flores, R. W., Chincha, R., & Tian, Y. (2013). Finding objects for assisting blind people. Network Modeling Analysis in Health Informatics and Bioinformatics, 2(2), 71-79.(https://link.springer.com/article/10.1007/s13721-013-0026-x)
- ↑ Yabuta, K., & Kitazawa, H. (2008, May). Optimum camera placement considering camera specification for security monitoring. In Circuits and Systems, 2008. ISCAS 2008. IEEE International Symposium on (pp. 2114-2117). IEEE.(https://ieeexplore.ieee.org/abstract/document/4541867/)
- ↑ Bodor, R., Drenner, A., Schrater, P., & Papanikolopoulos, N. (2007). Optimal camera placement for automated surveillance tasks. Journal of Intelligent and Robotic Systems, 50(3), 257-295. (https://link.springer.com/article/10.1007%2Fs10846-007-9164-7)
- ↑ Henry, P., Krainin, M., Herbst, E., Ren, X., & Fox, D. (2010). RGB-D mapping: Using depth cameras for dense 3D modeling of indoor environments. In In the 12th International Symposium on Experimental Robotics (ISER. (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.91)
- ↑ https://getpixie.com/blogs/news/lostfoundsurvey
- ↑ https://www.123rf.com/photo_54669575_top-view-of-modern-living-room-interior-with-sofa-and-armchairs-3d-render.html, with unwatermarked found on http://www.revosense.com/2018/01/28/realize-your-desires-living-room-layout-ideas-with-these-5-tips/top-view-of-living-room-interior-3d-render/
- ↑ https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/img/kites_detections_output.jpg
- ↑ https://github.com/tensorflow/models/tree/master/research/object_detection
- ↑ http://cocodataset.org/#explore
- ↑ http://cocodataset.org/#home