2018년 7월 27일 금요일

5주 천문학자 아쉬쉬 마하발과 인터뷰

[커세라 강좌 소개] 자료기반 천문학(Data-Driven Astronomy)
-----------------------------------------------------------------
Week 5: Learning from data: regression
제5주차: 자료에서 정보 얻어내기(회귀/경향성 분석)
- Using machine learning tools to investigate your data
  수집한 자료를 조사하기 위해 기계학습 도구(분석 소프트웨어)를 활용하기
- Calculating the red-shifts of distant galaxies
  먼 은하의 적색편이 계산하기
-----------------------------------------------------------------
1강: 자료 가지고 학습하기
Lesson 1: Learning from Data / 한글자막
-----------------------------------------------------------------
2강: 우주의 규모, 거리측정
Lesson 2: The Cosmological Distance Scale / 한글자막
-----------------------------------------------------------------
3강: 기계 학습의 기초
Lesson 3: What is machine learning /  한글자막
-----------------------------------------------------------------
Lesson 4: Decision Tree Classifier / 한글자막
-----------------------------------------------------------------
Lesson 5: Estimating Redshifts using Regression / 한글자막
-----------------------------------------------------------------
5주 요약
Week 5: Module Summary / 한글자막 / 영문자막
-----------------------------------------------------------------
5주 천문학자 아쉬쉬 마하발과 인터뷰
Bonus Interview with Ashish Mahabal / 한글자막 / 영문자막


My name is Ashish Mahabal, I am a senior research scientist at Caltech, specifically at the Center for Data-Driven Discovery here and I have been working on large-scale sky surveys, which means that I've been using lot of mathematical and statistical techniques. And that has gotten me interested in methodology transfer, how other fields also use this, so I have also been applying some of these techniques to Earth science and health care data and some cancer - cancer research data as well. Since I came to Caltech in 1999, I started working on Big Data because the surveys that I was involved in the Palomar-Quest Survey, before that the Digitized Palomar Observatory Survey, more recently the Catalina Real-Time Transient Survey and a little bit on the Palomar Transient Factory. So what we do is that we observe large parts of the sky again and again. So essentially, this is like taking digital movies. And that has become possible only in the last several years. Until then, it was mainly that people would go out and look at small parts of the sky, specific samples and come back and study those. But these digital movies essentially give you lots of data and moreover, you can find what is changing at different levels in the universe, in our galaxy and our solar system - and also outside of our galaxy. First of all, what one needs do is that one has to make sure that the data are good quality, that there are not too many missing data points in what you have. And we mainly work - when it comes to large data from these surveys, is work on the time series of the data sets. So the time series can be very gappy. They are heteroscedastic in the sense that the error bars can vary on the same object depending on when you are observing it. And of course, the number of objects that we have vary in brightness quite a lot. So what that means is that the time series that we deal with are quite different from what the financial services people deal with, where you have very specific times when the data are taken. And so that provides new challenges, trying to - trying to figure out what objects are doing when you are not observing them. And that's most of the time, because the amount of time we observe is really small compared to the total time where the variations in the objects are taking place. Rather than study a single object which may be doing something weird, you try to do it in a statistical nature. So for instance - and I don't work on those, but let's take an example of how stars evolve. You have a star that spends its time in, say, the main sequence for a long time and then it will- it will evolve into a giant and so on. And those time scales are so long that you don't get to see them in your own lifetime. So what you instead do is observe millions of stars - and some of them are in one phase - and some of them are in the other. Similarly, when we are looking at objects that vary in brightness, consider supernova, for instance, the supernova stage would last only for a few weeks, but before that, if you had observations and if you're lucky, you can find the star that was the progenitor of it. And then by looking at the entire time series, you can try to design specific statistical features, which then you can look for in your entire data set. So once you start understanding a little bit more about the kinds of objects that you are interested in, you design these features or filters, then, that you can use across the data set to try to find more of them. And once you have a large enough sample, then you are in business. Because then you can start applying many of the standard techniques to the data set after that. Okay. So I can answer that on two different levels. One level is getting a good data set in the first place. Most surveys are designed with specific goals in mind and what that essentially means is that you are trying to go after either some low- hanging fruit, or some specific classes. So, the data in other classes also exist in that data set, but those may not have been observed optimally in order to go after those classes. And so what would be useful is you could combine different data sets to do that. And I'm also working on what is called Domain Adaptation or Model Adaptation, where you try to combine these data sets. And then that becomes very interesting. Because when - for instance, if you want to do classification, then you may find that objects that don't vary in brightness, they hugely outnumber all other classes. And within the classes that vary, there may be some classes like the flaring M stars, which would be far more than some other class. And so what that means is that the data sets are not balanced. And if they're not balanced, most techniques don't work directly on them easily. So what you need to do is then find artificial way to balance them and make sure that the technique that you are applying makes sense, because you don't want to find correlations that don't really mean anything. Because correlation is not causation and you are always going to find some correlation. So getting a good data set - order it in a way so hat you have good balanceness and there is proper meta data that tells you enough about the data set. I think those are the biggest challenges while you are pre-processing and during the process itself; making sure that you can follow-up each step with proven answer and make it reproducible. So that's the other angle of what - the challenges. So many times, what you do is that you start with simple correlations and simple visualization. And languages like Python and R are great for that. Python is becoming the workhorse for many, many things and things like scikit-learn that they have. It's lovely to just start playing around with. R has a large number of statistical libraries which have been written by statisticians. So that's the good part. And so playing around for - with a bunch of these different ones, I would say, should be one of the first things. And I would advise people to learn both of them - Python and R - because both of them have some good things and they should have them in their repertory with them,I think. Combining diverse data sets which were not taken with the same goals in mind. There are huge data sets that are out there which are - which have not been combined in that fashion and doing something like that remains a big challenge. And I hope to see more progress happening in that area. And there are many, many new tools that are coming up that are likely to help there; for instance, in the image domain. Deep learning is getting popular everywhere and there are very good tools out there to do that. But again, the basics are of physics and mathematics and so students would want to make sure that while it's easy to use online tools and simply connect them each other - to each other and do a lot of things, going back to the basic physics and statistics is something that they should keep in mind. So one thing that has been good in astronomy is that we have been good at maintaining meta data for our data sets - so data about data. When we take images, for instance, we have been using what is called the FITS format. And the FITS images have a very good header which has all kinds of information - where were the data taken and what telescope was it, what was the size of the mirror, what was the filter and what time it was taken and was the shutter open for this long or less than that and all that. Now what we find is that because of that, we have been able to build structures or the names of the columns that we use and then be able to transfer information from one data set to another easily. And the same is not true in some of the other fields; like geoscience is still good, but when it comes to health care science, for instance, then the meta data keeping there has been at least a few years behind what we astronomers have been doing. Astronomy is fantastic because you are trying to solve the origin of the universe, you are trying to figure out where did we come from, why are we here and all that. Whereas, in healthcare, especially when you work on something like cancer - early detection research network is one area I'm working in - you are trying to see how we can continue to be here longer. And so that is, in fact, rewarding. And when one sees that the same kinds of techniques can be applied, that's fantastic. Because once you take a data set and abstract it enough, then the tools that you are using are - they don't care where the data came from, so long as you are sure and you are careful about maintaining the domain knowledge and not going to, as I said before, noise levels that are too much or don't find trivial correlations. So it's highly rewarding to be able to work on these two completely different scales from the universe level to the cell level. 



댓글 없음:

댓글 쓰기