View on GitHub

Exoplanet Detection from Stellar Time Series Data

CS 4641 Machine Learning Project

Introduction

Kepler and the Hunt for Exoplanets

Discovering exoplanets, planets outside of the Solar System, is the first step to finding habitable planets and life beyond our system. However, due to the vast number of stars around us and a limited amount of resources for detailed observation and analysis, the number of stars to investigate has to be narrowed down before further observation. The Kepler space telescope was a project conducted by NASA and the University of Colorado to detect exoplanets. The telescope recorded the brightness of several thousand stars over a period of four years. This has provided researchers with a large corpus of star data. Astronomers have discovered that periodic changes in a star’s brightness may indicate the presence of an exoplanet. In particular, periodic drops in flux may be the result of a planet passing in front of the star.

Given the large amount of data gathered by the Kepler telescope, analyzing the data by hand is extremely difficult and an automated method to detect these “threshold crossing events” (TCEs) is needed. Past applications of machine learning models to exoplanet data have yielded significant results. Shallue and Vanderburg have found that convolutional neural nets can be used to detect exoplanets. Ansdell et al. continued this work by incorporating domain knowledge. We hope to develop a more robust model that relies on other regimes. Through the use of time series modeling and classification methods, we created an algorithm that can detect which flux time series corresponds to stars with exoplanets.

Data

Our dataset was obtained from NASA’s Mikulski Archive for Space Telescopes. The archive contains several hundred thousand stars that were observed by the Kepler telescope from 2014 to 2018. Each star’s respective brightness measurements (flux) is represented as a time series consisting of approximately four thousand time stamps. Roughly three hundred of the stars are confirmed to have exoplanets and an additional eight hundred are suspected to have exoplanets but are not confirmed.

Extracting the Data

Unfortunately, there is no way to directly filter for stars with confirmed exoplanets, so instead, our team created a csv list of stars with confirmed exoplanets and used the LightCurve API to download them individually. Due to how long it takes to download and preprocess the time series data, we decided to restrict our dataset to just 283 stars with confirmed planets and 402 stars without confirmed or suspected planets. After downloading, the light curve data was cleaned by removing NaN values and outliers.

Example Raw Light Curve light curve time series

Example Cleaned Light Curve cleaned light curve time series

Preprocessing and Initial Analysis

In order to analyze the data, we calculated 59 summary statistics for each time series. This was done with the Feature Analysis for Time Series (FATS) package for python. Parts of this package were optimized for speed. However, the package is still very slow. As a result our final dataset only consisted of 685 stars, 283 of which are confirmed to have at least one exoplanet. The table below lists the summary statistics that were calculated.

Statistic Description
Amplitude The amplitude is defined as the half of the difference between the median of the maximum 5% and the median of the minimum 5% magnitudes.
AndersonDarling The Anderson-Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution.
Autocor_length The autocorrelation, also known as serial correlation, is the cross-correlation of a signal with itself.
Beyond1Std Percentage of points beyond one standard deviation from the weighted mean.
CAR_mean In order to model the irregular sampled times series we use CAR(1) (Brockwell and Davis, 2002), a continious time auto regressive model. Mean is one of the three variables in the model.
CAR_sigma In order to model the irregular sampled times series we use CAR(1) (Brockwell and Davis, 2002), a continious time auto regressive model. Sigma, which is related to variance, is one of the three variables in the model.
CAR_tau In order to model the irregular sampled times series we use CAR(1) (Brockwell and Davis, 2002), a continious time auto regressive model. The relaxation time, tau, is one of the three variables in the model.
Con Index introduced for the selection of variable stars from the OGLE database (Wozniak 2000). To calculate Con, we count the number of three consecutive data points that are brighter or fainter than 2σ and normalize the number by N−2.
Eta_e An estimated variability index, which is the ratio of the mean of the square of successive differences to the variance of data points.
FluxPercentileRatioMid20 If F_5,95 is the difference between 95% and 5% magnitude values, the flux percentile ratio mid 20 is F_40,60/F_5,95.
FluxPercentileRatioMid35 If F_5,95 is the difference between 95% and 5% magnitude values, the flux percentile ratio mid 35 is F_32.5,67.5/F_5,95.
FluxPercentileRatioMid50 If F_5,95 is the difference between 95% and 5% magnitude values, the flux percentile ratio mid 50 is F_25,75/F_5,95.
FluxPercentileRatioMid65 If F_5,95 is the difference between 95% and 5% magnitude values, the flux percentile ratio mid 65 is F_17.5,82.5/F_5,95.
FluxPercentileRatioMid80 If F_5,95 is the difference between 95% and 5% magnitude values, the flux percentile ratio mid 80 is F_10,90/F_5,95.
Freq1_harmonics_amplitude_0 The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the first frequency component.
Freq1_harmonics_amplitude_1 The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the first harmonic of the first frequency component.
Freq1_harmonics_amplitude_2 The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the second harmonic of the first frequency component.
Freq1_harmonics_amplitude_3 The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the third harmonic of the first frequency component.
Freq1_harmonics_rel_phase_0* The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the first frequency component with respect to the phase of the first component.
Freq1_harmonics_rel_phase_1 The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the first harmonic of the first frequency component with respect to the phase of the first component.
Freq1_harmonics_rel_phase_2 The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the second harmonic of the first frequency component with respect to the phase of the first component.
Freq1_harmonics_rel_phase_3 The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the third harmonic of the first frequency component with respect to the phase of the first component.
Freq2_harmonics_amplitude_0 The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the second frequency component.
Freq2_harmonics_amplitude_1* The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the first harmonic of the second frequency component.
Freq2_harmonics_amplitude_2* The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the second harmonic of the second frequency component.
Freq2_harmonics_amplitude_3* The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the third harmonic of the second frequency component.
Freq2_harmonics_rel_phase_0* The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the second frequency component with respect to the phase of the first component.
Freq2_harmonics_rel_phase_1 The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the first harmonic of the second frequency component with respect to the phase of the first component.
Freq2_harmonics_rel_phase_2 The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the second harmonic of the second frequency component with respect to the phase of the first component.
Freq2_harmonics_rel_phase_3 The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the third harmonic of the second frequency component with respect to the phase of the first component.
Freq3_harmonics_amplitude_0 The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the third frequency component.
Freq3_harmonics_amplitude_1* The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the first harmonic of the third frequency component.
Freq3_harmonics_amplitude_2* The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the second harmonic of the third frequency component.
Freq3_harmonics_amplitude_3* The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the third harmonic of the third frequency component.
Freq3_harmonics_rel_phase_0* The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the third frequency component with respect to the phase of the first component.
Freq3_harmonics_rel_phase_1 The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the first harmonic of the third frequency component with respect to the phase of the first component.
Freq3_harmonics_rel_phase_2 The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the second harmonic of the third frequency component with respect to the phase of the first component.
Freq3_harmonics_rel_phase_3 The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the third harmonic of the third frequency component with respect to the phase of the first component.
LinearTrend Slope of a linear fit to the light-curve.
MaxSlope Maximum absolute magnitude slope between two consecutive observations.
Mean Mean magnitude.
Meanvariance This is a simple variability index and is defined as the ratio of the standard deviation, σ, to the mean magnitude, m.
MedianAbsDev The median absolute deviation is defined as the median discrepancy of the data from the median data.
MedianBRP Fraction of photometric points within amplitude/10 of the median magnitude.
PairSlopeTrend Considering the last 30 (time-sorted) measurements of source magnitude, the fraction of increasing first differences minus the fraction of decreasing first differences.
PercentAmplitude Largest percentage difference between either the max or min magnitude and the median.
PercentDifferenceFluxPercentile If F_5,95 is the difference between 95% and 5% magnitude values, percent difference flux percentile is the ratio of F_5,95 over the median magnitude.
PeriodLS The Lomb-Scargle (L-S) algorithm (Scargle, 1982) is a variation of the Discrete Fourier Transform (DFT), in which a time series is decomposed into a linear combination of sinusoidal functions. This transforms the data from the time domain to the frequency domain, allowing us to identify the period of light-curve.
Period_fit Returns the false alarm probability of the largest periodogram value.
Psi_CS Rcs applied to the phase-folded light curve (generated using the period estimated from the Lomb-Scargle method).
Psi_eta Eta_e index calculated from the folded light curve.
Q31 Q3−1 is the difference between the third quartile, Q3, and the first quartile, Q1, of a raw light curve.
Rcs Rcs is the range of a cumulative sum (Ellaway 1978) of each light-curve and is a measure of asymmetry.
Skew The skewness of a sample.
SlottedA_length Slotted autocorrelation: In slotted autocorrelation, time lags are defined as intervals or slots instead of single values. The slotted autocorrelation function at a certain time lag slot is computed by averaging the cross product between samples whose time differences fall in the given slot.
SmallKurtosis Small sample kurtosis of the magnitudes.
Std The standard deviation.
StetsonK Stetson K is a robust kurtosis (sharpness of the peak of a frequency-distrobution curve) measure.
StetsonK_AC Stetson K applied to the slotted autocorrelation function of the light-curve.

* These features were later removed because they were either always 0 or dependent on other features.

Note: Many of the descriptions are just short summaries of the descriptions given in the FATS documentation. Full descriptions can be found here.

After all the summary statistics were calculated for each star, the features were analyzed. Several features contained no meaningful data (i.e. the same value for all stars) and were removed. Other features were redundant and had a Pearson correlation of one with at least one other feature. These feature were also removed. The cleaned dataset contains 50 feature. Below is the correlation matrix.

correlation matrix

Modeling and Prediction

Models

We implemented five different models to predict whether a star had a planet. A random forest is an ensemble model that relies on many decision trees to make classifications. A support vector machine (SVM) is a non-probabilistic binary classification method that tries to separate groups into distinct spaces. A linear support vector classifier (Linear SVC) is similar to SVM, except that it has been designed to work better with high-dimension data sets. Naive Bayes is a classifier that assumes all features are independent. Finally, we performed several forms of regression. Here we are only including the regression model that performed the best, which was linear regression with lasso regularization.

We tuned the hyperparameters for our models by performing grid search with 10-fold cross validation on our training set. Grid search is a hyperparameter tuning method which tests every possible combination of hyperparameters in a specified grid. Using this method, we selected the best hyperparameters for each model.

Results

Below is the model summary that features precision, recall, F1 score, and accuracy for the five models we mentioned in the previous section.

Model Summary

These are the confusion matrices for each model. A confusion matrix simply displays the true-positive, true-negative, false-positive, and false-negative rates for a binary classifier.

Confusion Matrix

Confusion Matrix Table

Confusion Matrix Plot

Confusion Matrix Plot

Analysis

When evaluating our models we primarily relied on the F1 score due to the skewed balance between our positive and negative labels. The F1 score is the harmonic mean of precision and recall, and it works especially well on imbalanced datasets. As a result, we selected the random forest as the most effective model. In addition to having the highest F1 score, it also had the highest accuracy.

The model’s confusion matrix reveals that most of the misidentified stars were false positives. This is preferred to a model with a high false-negative rate since exoplanets are very rare. In the original dataset, only about 300 of the 50,000 stars had exoplanets (approximately 0.6%). It would be easier to weed out non-exoplanet stars from the positive predictions than exoplanet stars from the negative predictions.

Conclusion

Over the past few decades, the field of astronomy has changed due to the rise of big data and advanced algorithms. When Clyde Tombaugh discovered Pluto in 1930, he did so by manually comparing print photographs to detect the movement of the planet across the night sky. Today, computers can quickly sift through terabytes of photos for undiscovered celestial bodies.

The light curves extensively analyzed in this paper have become especially ubiquitous as astronomers seek to better understand stars. While more advanced models have been created for exoplanet detections and achieve above 90% accuracy, they are computationally intensive, make use of knowledge beyond the available data and they cannot be used on more general problems.

Our work does not rely on domain knowledge or a fixed machine learning paradigm. As a result, it is flexible enough to be applied to a wide range of problems using stellar light curve data. Since our work is applicable to any model, it can work with vast amounts of data and models can be chosen depending on computational constraints and the task being performed. Future applications include supernovae detection and the study of solar cycles. Future methodological research can examine how more complex models perform on the data and how using different summary statistics impacts the models’ robustness.

Authors and Contributors

Tan Gemicioglu, Zachary James and Seonghyun Lee are Computer Science majors at Georgia Tech. Andrew Yarovoi is a Mechanical Engineering major at Georgia Tech. Narae Lee is an Industrial Engineering major at Georgia Tech.

References