Introduction
Kepler and the Hunt for Exoplanets
Discovering exoplanets, planets outside of the Solar System, is the first step to finding habitable planets and life beyond our system. However, due to the vast number of stars around us and a limited amount of resources for detailed observation and analysis, the number of stars to investigate has to be narrowed down before further observation. The Kepler space telescope was a project conducted by NASA and the University of Colorado to detect exoplanets. The telescope recorded the brightness of several thousand stars over a period of four years. This has provided researchers with a large corpus of star data. Astronomers have discovered that periodic changes in a star’s brightness may indicate the presence of an exoplanet. In particular, periodic drops in flux may be the result of a planet passing in front of the star.
Given the large amount of data gathered by the Kepler telescope, analyzing the data by hand is extremely difficult and an automated method to detect these “threshold crossing events” (TCEs) is needed. Past applications of machine learning models to exoplanet data have yielded significant results. Shallue and Vanderburg have found that convolutional neural nets can be used to detect exoplanets. Ansdell et al. continued this work by incorporating domain knowledge. We hope to develop a more robust model that relies on other regimes. Through the use of time series modeling and classification methods, we created an algorithm that can detect which flux time series corresponds to stars with exoplanets.
Data
Our dataset was obtained from NASA’s Mikulski Archive for Space Telescopes. The archive contains several hundred thousand stars that were observed by the Kepler telescope from 2014 to 2018. Each star’s respective brightness measurements (flux) is represented as a time series consisting of approximately four thousand time stamps. Roughly three hundred of the stars are confirmed to have exoplanets and an additional eight hundred are suspected to have exoplanets but are not confirmed.
Extracting the Data
Unfortunately, there is no way to directly filter for stars with confirmed exoplanets, so instead, our team created a csv list of stars with confirmed exoplanets and used the LightCurve API to download them individually. Due to how long it takes to download and preprocess the time series data, we decided to restrict our dataset to just 283 stars with confirmed planets and 402 stars without confirmed or suspected planets. After downloading, the light curve data was cleaned by removing NaN values and outliers.
Example Raw Light Curve
Example Cleaned Light Curve
Preprocessing and Initial Analysis
In order to analyze the data, we calculated 59 summary statistics for each time series. This was done with the Feature Analysis for Time Series (FATS) package for python. Parts of this package were optimized for speed. However, the package is still very slow. As a result our final dataset only consisted of 685 stars, 283 of which are confirmed to have at least one exoplanet. The table below lists the summary statistics that were calculated.
Statistic  Description 

Amplitude  The amplitude is defined as the half of the difference between the median of the maximum 5% and the median of the minimum 5% magnitudes. 
AndersonDarling  The AndersonDarling test is a statistical test of whether a given sample of data is drawn from a given probability distribution. 
Autocor_length  The autocorrelation, also known as serial correlation, is the crosscorrelation of a signal with itself. 
Beyond1Std  Percentage of points beyond one standard deviation from the weighted mean. 
CAR_mean  In order to model the irregular sampled times series we use CAR(1) (Brockwell and Davis, 2002), a continious time auto regressive model. Mean is one of the three variables in the model. 
CAR_sigma  In order to model the irregular sampled times series we use CAR(1) (Brockwell and Davis, 2002), a continious time auto regressive model. Sigma, which is related to variance, is one of the three variables in the model. 
CAR_tau  In order to model the irregular sampled times series we use CAR(1) (Brockwell and Davis, 2002), a continious time auto regressive model. The relaxation time, tau, is one of the three variables in the model. 
Con  Index introduced for the selection of variable stars from the OGLE database (Wozniak 2000). To calculate Con, we count the number of three consecutive data points that are brighter or fainter than 2σ and normalize the number by N−2. 
Eta_e  An estimated variability index, which is the ratio of the mean of the square of successive differences to the variance of data points. 
FluxPercentileRatioMid20  If F_5,95 is the difference between 95% and 5% magnitude values, the flux percentile ratio mid 20 is F_40,60/F_5,95. 
FluxPercentileRatioMid35  If F_5,95 is the difference between 95% and 5% magnitude values, the flux percentile ratio mid 35 is F_32.5,67.5/F_5,95. 
FluxPercentileRatioMid50  If F_5,95 is the difference between 95% and 5% magnitude values, the flux percentile ratio mid 50 is F_25,75/F_5,95. 
FluxPercentileRatioMid65  If F_5,95 is the difference between 95% and 5% magnitude values, the flux percentile ratio mid 65 is F_17.5,82.5/F_5,95. 
FluxPercentileRatioMid80  If F_5,95 is the difference between 95% and 5% magnitude values, the flux percentile ratio mid 80 is F_10,90/F_5,95. 
Freq1_harmonics_amplitude_0  The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the first frequency component. 
Freq1_harmonics_amplitude_1  The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the first harmonic of the first frequency component. 
Freq1_harmonics_amplitude_2  The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the second harmonic of the first frequency component. 
Freq1_harmonics_amplitude_3  The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the third harmonic of the first frequency component. 
Freq1_harmonics_rel_phase_0*  The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the first frequency component with respect to the phase of the first component. 
Freq1_harmonics_rel_phase_1  The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the first harmonic of the first frequency component with respect to the phase of the first component. 
Freq1_harmonics_rel_phase_2  The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the second harmonic of the first frequency component with respect to the phase of the first component. 
Freq1_harmonics_rel_phase_3  The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the third harmonic of the first frequency component with respect to the phase of the first component. 
Freq2_harmonics_amplitude_0  The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the second frequency component. 
Freq2_harmonics_amplitude_1*  The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the first harmonic of the second frequency component. 
Freq2_harmonics_amplitude_2*  The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the second harmonic of the second frequency component. 
Freq2_harmonics_amplitude_3*  The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the third harmonic of the second frequency component. 
Freq2_harmonics_rel_phase_0*  The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the second frequency component with respect to the phase of the first component. 
Freq2_harmonics_rel_phase_1  The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the first harmonic of the second frequency component with respect to the phase of the first component. 
Freq2_harmonics_rel_phase_2  The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the second harmonic of the second frequency component with respect to the phase of the first component. 
Freq2_harmonics_rel_phase_3  The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the third harmonic of the second frequency component with respect to the phase of the first component. 
Freq3_harmonics_amplitude_0  The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the third frequency component. 
Freq3_harmonics_amplitude_1*  The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the first harmonic of the third frequency component. 
Freq3_harmonics_amplitude_2*  The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the second harmonic of the third frequency component. 
Freq3_harmonics_amplitude_3*  The light curve is modeled by a superposition of sines and cosines. This is the amplitude of the third harmonic of the third frequency component. 
Freq3_harmonics_rel_phase_0*  The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the third frequency component with respect to the phase of the first component. 
Freq3_harmonics_rel_phase_1  The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the first harmonic of the third frequency component with respect to the phase of the first component. 
Freq3_harmonics_rel_phase_2  The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the second harmonic of the third frequency component with respect to the phase of the first component. 
Freq3_harmonics_rel_phase_3  The light curve is modeled by a superposition of sines and cosines. This is the relative phase of the third harmonic of the third frequency component with respect to the phase of the first component. 
LinearTrend  Slope of a linear fit to the lightcurve. 
MaxSlope  Maximum absolute magnitude slope between two consecutive observations. 
Mean  Mean magnitude. 
Meanvariance  This is a simple variability index and is defined as the ratio of the standard deviation, σ, to the mean magnitude, m. 
MedianAbsDev  The median absolute deviation is defined as the median discrepancy of the data from the median data. 
MedianBRP  Fraction of photometric points within amplitude/10 of the median magnitude. 
PairSlopeTrend  Considering the last 30 (timesorted) measurements of source magnitude, the fraction of increasing first differences minus the fraction of decreasing first differences. 
PercentAmplitude  Largest percentage difference between either the max or min magnitude and the median. 
PercentDifferenceFluxPercentile  If F_5,95 is the difference between 95% and 5% magnitude values, percent difference flux percentile is the ratio of F_5,95 over the median magnitude. 
PeriodLS  The LombScargle (LS) algorithm (Scargle, 1982) is a variation of the Discrete Fourier Transform (DFT), in which a time series is decomposed into a linear combination of sinusoidal functions. This transforms the data from the time domain to the frequency domain, allowing us to identify the period of lightcurve. 
Period_fit  Returns the false alarm probability of the largest periodogram value. 
Psi_CS  Rcs applied to the phasefolded light curve (generated using the period estimated from the LombScargle method). 
Psi_eta  Eta_e index calculated from the folded light curve. 
Q31  Q3−1 is the difference between the third quartile, Q3, and the first quartile, Q1, of a raw light curve. 
Rcs  Rcs is the range of a cumulative sum (Ellaway 1978) of each lightcurve and is a measure of asymmetry. 
Skew  The skewness of a sample. 
SlottedA_length  Slotted autocorrelation: In slotted autocorrelation, time lags are defined as intervals or slots instead of single values. The slotted autocorrelation function at a certain time lag slot is computed by averaging the cross product between samples whose time differences fall in the given slot. 
SmallKurtosis  Small sample kurtosis of the magnitudes. 
Std  The standard deviation. 
StetsonK  Stetson K is a robust kurtosis (sharpness of the peak of a frequencydistrobution curve) measure. 
StetsonK_AC  Stetson K applied to the slotted autocorrelation function of the lightcurve. 
* These features were later removed because they were either always 0 or dependent on other features.
Note: Many of the descriptions are just short summaries of the descriptions given in the FATS documentation. Full descriptions can be found here.
After all the summary statistics were calculated for each star, the features were analyzed. Several features contained no meaningful data (i.e. the same value for all stars) and were removed. Other features were redundant and had a Pearson correlation of one with at least one other feature. These feature were also removed. The cleaned dataset contains 50 feature. Below is the correlation matrix.
Modeling and Prediction
Models
We implemented five different models to predict whether a star had a planet. A random forest is an ensemble model that relies on many decision trees to make classifications. A support vector machine (SVM) is a nonprobabilistic binary classification method that tries to separate groups into distinct spaces. A linear support vector classifier (Linear SVC) is similar to SVM, except that it has been designed to work better with highdimension data sets. Naive Bayes is a classifier that assumes all features are independent. Finally, we performed several forms of regression. Here we are only including the regression model that performed the best, which was linear regression with lasso regularization.
We tuned the hyperparameters for our models by performing grid search with 10fold cross validation on our training set. Grid search is a hyperparameter tuning method which tests every possible combination of hyperparameters in a specified grid. Using this method, we selected the best hyperparameters for each model.
Results
Below is the model summary that features precision, recall, F1 score, and accuracy for the five models we mentioned in the previous section.
These are the confusion matrices for each model. A confusion matrix simply displays the truepositive, truenegative, falsepositive, and falsenegative rates for a binary classifier.
Confusion Matrix
Confusion Matrix Plot
Analysis
When evaluating our models we primarily relied on the F1 score due to the skewed balance between our positive and negative labels. The F1 score is the harmonic mean of precision and recall, and it works especially well on imbalanced datasets. As a result, we selected the random forest as the most effective model. In addition to having the highest F1 score, it also had the highest accuracy.
The model’s confusion matrix reveals that most of the misidentified stars were false positives. This is preferred to a model with a high falsenegative rate since exoplanets are very rare. In the original dataset, only about 300 of the 50,000 stars had exoplanets (approximately 0.6%). It would be easier to weed out nonexoplanet stars from the positive predictions than exoplanet stars from the negative predictions.
Conclusion
Over the past few decades, the field of astronomy has changed due to the rise of big data and advanced algorithms. When Clyde Tombaugh discovered Pluto in 1930, he did so by manually comparing print photographs to detect the movement of the planet across the night sky. Today, computers can quickly sift through terabytes of photos for undiscovered celestial bodies.
The light curves extensively analyzed in this paper have become especially ubiquitous as astronomers seek to better understand stars. While more advanced models have been created for exoplanet detections and achieve above 90% accuracy, they are computationally intensive, make use of knowledge beyond the available data and they cannot be used on more general problems.
Our work does not rely on domain knowledge or a fixed machine learning paradigm. As a result, it is flexible enough to be applied to a wide range of problems using stellar light curve data. Since our work is applicable to any model, it can work with vast amounts of data and models can be chosen depending on computational constraints and the task being performed. Future applications include supernovae detection and the study of solar cycles. Future methodological research can examine how more complex models perform on the data and how using different summary statistics impacts the models’ robustness.
Authors and Contributors
Tan Gemicioglu, Zachary James and Seonghyun Lee are Computer Science majors at Georgia Tech. Andrew Yarovoi is a Mechanical Engineering major at Georgia Tech. Narae Lee is an Industrial Engineering major at Georgia Tech.
References

Ansdell, Megan, et al. “Scientific Domain Knowledge Improves Exoplanet Transit Classification with Deep Learning.” The Astrophysical Journal Letters, vol. 869, no. 1, Dec. 2018, p. L7. iopscience.iop.org, doi:10.3847/20418213/aaf23b.

Hinners, Trisha A., et al. “Machine Learning Techniques for Stellar Light Curve Classification.” The Astronomical Journal, vol. 156, no. 1, June 2018, p. 7. iopscience.iop.org, doi:10.3847/15383881/aac16d.

Lightkurve Collaboration, et al. Lightkurve: Kepler and TESS Time Series Analysis in Python. 2018. SAO/NASA Astrophysics Data System, http://adsabs.harvard.edu/abs/2018ascl.soft12013L.

Nun, Isadora, et al. “FATS: Feature Analysis for Time Series.” ArXiv:1506.00010 [AstroPh], Aug. 2015. arXiv.org, http://arxiv.org/abs/1506.00010.

Shallue, Christopher J., and Andrew Vanderburg. “Identifying Exoplanets with Deep Learning: A FivePlanet Resonant Chain around Kepler80 and an Eighth Planet around Kepler90.” The Astronomical Journal, vol. 155, 2018, p. 94.

Susto, Gian Antonio, et al. “Chapter 9  TimeSeries Classification Methods: Review and Applications to Power Systems Data.” Big Data Application in Power Systems, edited by Reza Arghandeh and Yuxun Zhou, Elsevier, 2018, pp. 179–220. ScienceDirect, doi:10.1016/B9780128119686.000097.