Camera technology has been evolving for the past two centuries. Cameras can capture more information than our eyes can see. Photographers and film makers are utilising these technologies to create amazing works. The vision of this project is to build a camera that can capture images of the players during a sports game from all possible angles. Viewers could, if they want to, watch the game from the player's view without attaching wearable cameras to the players. They could also choose any angle they find exciting.
I started the project with realising how cheap and accessable cameras are nowadays. I was wondering what will happen if I get access to a lot of cameras. What would be the result of the change in quantity? This natually led me to think about the bullet time effect used in the movie Matrix. After the Matrix, the bullet time effect has been used in a variety of works.
Image on the left is a camera rig build by Canon for its advertisement. This setup includes 50 Canon EOS 1DX cameras which roughly costs around £300,000.
After realising how expensive this can get, it is obvious to think that there might be room for a cheaper alternative. Since cameras can be made very cheap but also delivering satisfactory result, it would be quite possible if they have a more dedicated purpose rather than being made into a DIY solution. As you can see the current solution uses consumer cameras, this is neither cost effective since most of the components are useless in this scenario nor user friendly as photographers have to set up each camera individually.
I started experimenting with the kinect from Microsoft. It essentially give each pixel depth information, so that the image can be shown in a 3d environment. The kinect has an IR emmiter and an IR camera. By calculating the time it took for each beam to be reflected back to the IR camera, the depth information is generated to be combined with each pixel.
This is quite a powerful tool for capturing images. Makers hack kinect to do a lot of artistic work and interesting experiments. However, I believe it should be able to do even more, something that is more meaningful or useful. I looked into image editing and how photoshop has helped photographers to manipulate photos. Photoshop could be painful if you know little about it or if you have little understanding of lights, shadows and perspectives. Even if you are skilled, it is still a lengthy process to edit photos. If you can take a picture with its depth information, the computer can potentially understand the 3d geometry of the object and the spatial relationship of the environment. This means when you are editing the image, you don't have to draw highlights or shadows on a 2d canvas. What you do will be simply telling the computer to put a virtual light in the virtual space. The demo below shows what I mean, you can click the buttons below to change the direction of the ligting to the plastic cup.
Light from 0°
Light from 30°
Light from 60°
Light from 90°
Light from 120°
Light from 150°
Light from 180°
Light from the back
The demo is only to show what will happen if depth data is introduced to image editing. If you have to do the same in photoshop by hand, you could imagine how much effort if could take. Changing the lighting angle is just one possible example, what this depth data does is essentially segmenting the pixels in space.Therefore you can start to imagine a series of other applications including changing aperture, grabbing object out of the background and changing focal length, etc. To be honest, this is nothing more than a smarter tool to do exactly what we do for decades. Below is an image before the digital cameras were invented, you can see the picture being segmented into sections for further devlepment. This is how people edit picture before photoshop.
The photo on the left is the original exposure without editing. You can see marks on it showing different exposure that should be done on that area and other notes. The result is what you see on the right. There is less contrast and the overall image is brighter so that you can see more details in the reflection on the car door.
This method of editing photo is great because it is fast. It is almost instant if the computer is fast enough to process the information. I thought about what real benefit will the instant editing bring and TV live broadcasting came to my mind. There is simply no time for anyone to edit the image before broadcasting it to the audience. The editing method using depth data works pretty well within this scenario.I looked into TV live broad casting, especially sports live broadcasting. It is very interesting to see what is happening in this big industry. Live broadcasting has changed sports since its existence. When we talk about major sports events, most of people will be watching sports throught TVs or over the Internet. Cameras that help capture the events are getting more pixels than our eyes can see. Slow motion cameras captures actions with ultra high speed; skycams are 'flying' around the sports field to follow the players. However, I can still see an opportunity to improve it with a more dynamic viewing experience.
Currently, most of the cameras that are used in sports broadcasting are sitting from a distance to the sports field. You will get close up shots from these cameras if they have telescope lens. However, close up shots of the players through a telescope lens is different from being next to the players. Being inside a sports game is a quite powerful experience. Mordern sports broadcasting technology failed to deliver that experience. I went through a lot of sketches of ideas that all aimed to deliver a much more engaging viewing experience for the spots viewers.
After catergorising and filtering all the ideas I generated, I found three distinctive directions within all the sketches. The first one is about delivering a more dynamic viewing angle; the second is crowd sourcing images and videos from the audience and rearrange them in a more meaningful way; the third is about reveiling hidden stories and information in a sports game. I decided to focus on the first direction and take it forward. In the first direction, what I mean by a more dynamic viewing angle can be explained in the following video.
This video is the new Xbox EA sports trailer. It showed some great animation renderings from EA. If you watch it carefully, you will see many images that will never be captured in a real life sports game. These great animations had virtual viewing angles with virtual cameras very close to the players regardless the physical restrictions. As a result, these pictures are more exciting, more engaging and almost more 'physical'.
With this vision, I came up with a rought idea of how to deliver it. Based on some of the privious work in this project. If you remember, I showed a screen shot of a point cloud picture captured with Kinect. Since Kinect is only capturing images and depth information from one angle, it will only show one side of the objects that are being captured. If you can imagine a set-up that looks like the following picture. Three or more cameras can be setted up around the object covering the entire surface of the object in the middle.
The reason for having more than three cameras pointing at one object is to capture enough surface. Combined with the depth information captured at the same time, this system will be able to reconstruct the scene in a 3d environment. Taking movement into account, it is essentially creating a digital 3d animation of the action. What is the point of this set-up? Well, by creating a 3d animation, it allows the viewer to break the physical restriction of positioning the camera. You can put a virtual camera wherever you want without interfering the object(the player). This virtual camera can get as close to the object as possible, it can be located at the eyes of the player to show what they are seeing and it can move around the space freely.
In order for the whole system to work, there will be two parts of the system - the capture end and the control end. Then I started to design and prototype both end of this system.I started prototyping the capture end first by using one Kinect. As mentioned above, there is only a small part of the body that is been captured when the Kinect is facing the person whoever stands in front of it. Another issue is the low quality of the image that is generated by the kinect especially using point cloud to visualise the object. To get a better result, I put a Sony HD camera on top of a Kinect (shown below) to get a higher quality of RGB image and used the depth information from the Kinect to get the geomery. The first a few attempts were quite difficult as the physical setup is very rough and wasn't as stable as it needs to be.
By combining information generated by these two devices, I made a short video of myself dancing in front of it. With the masking tapes holding two cameras together, the calibration took a long time. Below is a screenshot of the 3D video generated by this setup
This screenshot of the short video is showing the video from a different perpective from where the camera is. I was facing the camera with a very awkward pose (as you have already noticed). The physical camera is located on the right of the image in the screenshot. The RGB image generated by the Sony camera was being mapped onto the geometry from the Kinect. The calibration wasn't that sucessful and you can see the edge of my leg is rendered with the colour of the background.I gave up on using this method to generate a higher quality image both because of that it is not real time processed and it has already proved the idea that the quality of the 3D image can be improved with higher quality imaging sensors. You can check out the whole video here.
In this video, you will only see part of the body of me as it is only been captured from one side where the camera is. In order to complete the image, you will need to have more than one cameras surrounding the object. I started connecting two Kinects together. The challenge is to stitch together the two images in the same virtual environment. First, I put two Kinects 4 metres away from each other and facing each other. By standing in the middle, the cameras captures both side of me from both front and back. The result is ok, but there is a big gap between the two images. Another issue is when you place two Kinects facing each other, the IR emitter on one of them is interfering with the IR sensor on the other one. With a few different attempts on the setup of two Kinects I was quite certain that there is need for more than two in order to reconstruct someone's body fully in 3D. The following image is shot with two Kinects with about 120 degrees apart from each other, the result is better as it covers more areas with more people in the setup. However, there are still a lot of surfaces thats missing.
When I started trying to connect the third Kinect to the system, I came across the biggest technical road block of the project(it is not event relevant to the project). You can skip this paragraph if you are not interested in computer hardwares. Each Kinect consumes a lot of data through the USB ports which means that each of them needs to be connected to a seperate USB bus. Most of the computers only have two seperate USB buses on the mother board. I have tried many ways including USB 3.0, connection through thunderbolt port and PCI/PCIe to USB card. Only the PCI to USB solution worked and it only worked on a Mac Pro work station. It worked perfectly on it and it means that I can connect more PCIe card to the mother board which will allow me to get the 4th Kinect running in the future.
The video on the left is showing the result of the camera setup that I have built. I have used three cameras in this setup, each of them is capturing video and applying it onto the geometry that is also generated by the IR sensor. By reconstructing all the images from all three cameras, you get a real time 3D model of myself. It was really cool to see it and interact with it.
I will add more stuff to this story very soon.To be continued...