CRA-W DREU 2018


University of Washington, Seattle

About


I'm Ananditha, a rising senior at NYU Courant Institute of Mathematical Sciences. I study Math & Computer Science, and will graduate in December 2019. I hope to get a PhD in Computer Science with a focus on improving healthcare through the use of vision and learning technologies. Previously, I've worked at NYU Langone on deep learning methods to estimate dental insurance levels of Twitter users, and Opioid Addiction assessments using natural language processing. Papers coming soon!

At the University of Washington, I'm working under Dr. Linda Shapiro, a pioneering Computer Vision researcher at the UW Allen School of Computing. Also mentors to me are Shima Nofallah and Sachin Mehta, both graduate students here in Electrical Engineering. My project involves writing a CNN to semantically segment images of skin cancer. I plan to submit the segmentation algorithm to the ISIC Challenge.

Week-by-Week


Here I'll give a week-by-week listing of what I get up to.


Week 1: I arrived in Seattle for the first time on May 29th- what a beautiful city! To get myself familiar with the field of computer vision, I started this tutorial which I highly recommend for anyone trying to learn OpenCV. I concurrently took the Stanford CNN for Visual Recognition course. It progresses very quickly, and is nearly 30 hours of video, but is a great resource for those that are looking to understand CNNs and how they play into classification and segmentation. My mission is to also read as many papers as I can, from this Github Repo and this Github repo. I also read about Image Segmentation in general and it's various uses using Prof. Shapiro’s textbook.

Week 2: I spent the first day of this week trying to write a cat versus dog image classifier using TensorFlow and TFLearn. I followed a tutorial , but quickly realized that TFLearn has bugs (and many open issues on GitHub) so that wasn't going to work in the immediate future. After the day's labor I ended up writing a vanilla ML classifier with an accuracy of around 75%. This was to be my baseline that I would test any deep learning classification models against. I wrote simple segmentation programs using the OpenCV tutorials. Since I use JupyterNotebooks to program in Python (easy visualization!) it turned out that functions like cv2.waitKey() weren't going to work, so when it came to implementation, skimage was the better library for my purposes. Having segmented multiple images of puppies using Otsu and Watershed, I moved to segmenting breast biopsy images that the group had used on a previous project. Shallow segmentation techniques didn't yield great results, thus I shifted to attempting a CNN.

Week 3: I was committed to learning as much about CNNs as I could, so I completed the amazing Fast.ai course on Deep Learning. This was super helpful and I can't recommend it more! It's taught in a great unassuming way and has greatly clear images to support the lectures. One example is here. This video makes CNNs crystal clear for anyone! My first task of the REU was to implement a semantic segmentation algorithm for the ISIC challenge 2018, which was to segment images of melanoma lesions on the skin. To this end, I got familiar with a CNN previously developed by the group called ESPNet. I trained it on the challenge's skin cancer training images. This took several days, even on a GPU. The baseline accuracy, having removed the decoder, augmentatation, and resizing code was around 83%. I retrained the model, adding back the augmentation code and increasing the size to 1028*512. On the previous run i'd downsampled to 100*100. The accuracy dropped to 78%, which was odd. I began thinking about why this was. I also made the website you are reading on Friday of this week!

Week 4: I still wasn’t sure why the accuracy had dropped after I augmented the images. If anything, it should have increased as now there were more training images and more pixels for the model to learn from. Then the decoder wasn’t attaching correctly, multiple errors were being thrown. I spent a whole day attempting to debug, but it wasn't going too well. I set myself a goal: to obtain 2 more baselines before I’d go on to debug the ESPNet code and alter it to suit the melanoma images. I first looked to the CSAIL Segmentation . This required various modifications to the data set: organizing it differently, and creating “.odgt” files. I wrote scripts to make the .odgt file, but still there were compatibility issues, the code wasn't running. Then I looked toward high ranking submissions of the ISIC challenge last year to help establish a baseline. I found this algorithm from the RECOD Titans. There were weird errors when I ran this code as well. I had to switch to older versions of python and various libraries. I even recreated this error that is a current issue in the Continuum project on GitHub. Overall this week I’ve found that it can be challenging working with other people’s code because the repositories on GitHub may not be maintained, thus there isn’t compatibility with updates/changes in the libraries that have been used in the code. And seeing that most of the libraries are open source, there are many changes to the interlinked libraries that are to be expected. I had to switch back and forth between versions of pytorch and other small internal libraries. This process was good though, because now I’m quick to realize dependencies and can alter what needs to be reinstalled/uninstalled before I run the code on the GPU. It’s also great to see the discussions and issues that are raised within the libraries’ repos on GitHub. Almost always there are more than 5 people with the same error and there seems to be support online for those that seek it.

I was also assigned a second project this week. It’s really novel and interesting work on tracking pathologists’ mouse clicks when detecting great cancer in digital images. We want to know what the typical “distractor” tumors have in common visually, i.e. when clinicians pick the benign (“distractor”) tumor to be malignant, what makes them think it’s cancerous instead of marking it correctly as normal. This is what I’ll work on once the segmentation work comes to a close.

Week 5 & 6: I've spent the past 2 weeks on the breast imaging project. The whole slide breast images I recived were in tiff format, and were the largest images I've ever tried to open, well over 20 MB each! When you think about images as arrays (which they all are under the hood) that's waay to big to have on the lower level caches at once, and this makes them incedibly hard to manipulate. They crashed my computer every time I tried to just open one to have a look. Thus I used images that were scaled down to 1/16 of their size. The diagnosis protocol through which these marked images were collected required study participants to look at the whole slide image, mark a rectangular region which they thought was cancerous, and then select from 3 categories: atypical ductal hyperplasia ("atypia"), ductal carcinoma in situ (DCIS), and invasive cancer. They also had the option to say that the slide contained no cancer, ie was benign without atypia. The marking and subsequent diagnosis made by the study participants was then cross verified again the markings of 3 expert pathologists. I did various data analysis on the images based on the expert pathologist marking and the study participant pathologist markings. The misdiagnosis % by participants on some of the images was as high as 80%, as lots of the categories were hard to differentiate between. We want to do more work to figure out what distinguishes "tricky" regions versus regions that have a very low misdiagnosis percent. It's really interesting to note that many participant pathologists mark the same region as the experts, but differ in their diagnosis. Further work on these images might be able to classify what sorts of cancers are often misdiagnosed and what subfeatures in the images trip up the pathologists. In terms of general feelings, the beginning of these past two weeks was hard - the images wouldn't open, the programs wouldn't load tiff files, and there was a lot of work I had to do to set up an adequate environment in order to make progress. Once I did this though, it was smooth sailing. The background work and setup is important, and is worth taking the time to do accurately. It is all part and parcel of the research process and obtaining good results.

Week 7, 8, and 9: These weeks flew by! The ISIC challenge that I had been working towards was coming due at the emd of the nineth week. I use a mix of code from various Fast.ai homeworks, added augmentations and resizing code, mixed in some elements of ESPNet, and finally had a working algorithm that segmented with 0.86% raw Jaccard score. I wrote a paper, and submitted it to the challenge. It took lots of trial and error but I finally got there!

Week 10: My last week in Seattle! I'm wrapping up the breast histopathology work that I started so that someone else can come in and take over this work. I took the time to meet with various graduate students and talk about their areas, met with my mentor for the last time, and cherished how much I had gained from this 10 week experience. It went by so fast, and though at times it was difficult to make sense of the new worlds of vision and convolutional neural networks, I always had support and resources that I could utilize to figure it all out. Overall this has been amazing! If you are a prospective student looking for an REU, this is a great one to apply to! The exposure of doing work with esteemed faculty at a different university than your own is really enlightening. It's also an excuse to explore a new city :D There really is no better way to figure out whether ou want a PhD and whether you want research to be a part of your life going forward!