Lei Lei ,Bing-Qiu Chen ,Jin-Da Li ,Jin-Tai Wu,Si-Yi Jiang,and Xiao-Wei Liu
1 Department of Astronomy,Yunnan University,Kunming,Yunnan 650500,China
2 South-Western Institute for Astronomy Research,Yunnan University,Kunming,Yunnan 650500,China;bchen@ynu.edu.cn
Received 2021 September 11;revised 2021 November 13;accepted 2021 November 16;published 2022 February 2
Abstract We have investigated the feasibilities and accuracies of the identi fications of RR Lyrae stars and quasars from the simulated data of the Multi-channel Photometric Survey Telescope(Mephisto)W Survey.Based on the variable sources light curve libraries from the Sloan Digital Sky Survey(SDSS)Stripe 82 data and the observation history simulation from the Mephisto-W Survey Scheduler,we have simulated the uvgriz multi-band light curves of RR Lyrae stars,quasars and other variable sources for the first year observation of Mephisto W Survey.We have applied the ensemble machine learning algorithm Random Forest Classi fier(RFC)to identify RR Lyrae stars and quasars,respectively.We build training and test samples and extract~150 features from the simulated light curves and train two RFCs respectively for the RR Lyrae star and quasar classi fication.We find that,our RFCs are able to select the RR Lyrae stars and quasars with remarkably high precision and completeness,with purity=95.4%and completeness=96.9%for the RR Lyrae RFC and purity=91.4%and completeness=90.2%for the quasar RFC.We have also derived relative importances of the extracted features utilized to classify RR Lyrae stars and quasars.
Key words:methods:data analysis–surveys–catalogs–stars:variables:RR Lyrae–(galaxies:)quasars:general
The Multi-channel Photometric Survey Telescope(Mephisto;Yuan et al.2020)is a wide-field survey telescope with a 1.6 m primary mirror.Mephisto has a field of view of~2.36 deg2.It is equipped with three CCD cameras and is capable of imaging the same patch of sky in three bands simultaneously.The telescope will be installed at Lijiang Observatory in the Southwest of China before the end of 2021.During 2022 and 2031,Mephisto will carry out a ten-year survey program which have two components:the Mephisto-W survey and the Mephisto-D,H and M surveys(Er et al.in preparation).All the observing time of the first year of the survey(2022)will be allocated to the Mephisto-W survey.The full survey area(the northern sky of~27,000 deg2of decl.between-21°and 75°)will be imaged several times in both theugiandvrzfilter combinations over the year,using pairs of 20-second exposures(Lei et al.2021;Chen et al.submitted).Two key science goals of the Mephisto-W Survey are the Galactic archeology,and the studies of the distant galaxies and cosmology.The RR Lyrae variable stars are important tracers for the study of the Milky Way(e.g.,Sesar et al.2010;Ablimit&Zhao 2017;Ablimit&Zhao 2018;Griv et al.2020;Hattori et al.2020;Liu et al.2020;del Pino et al.2021;Ablimit et al.2021).Large sample of quasars will allow us to probe their nature (e.g.,Kuo & Hirashita 2012;Pasquet-Itam &Pasquet 2018)and to constrain the cosmological parameters(e.g.,Khadka et al.2021;Mediavilla&Jiménez-Vicente 2021).Thus to identify the RR Lyrae stars and quasars from the data of the Mephisto-W survey and to obtain their complete and uncontaminated samples are fundamental to achieve those key science goals of the Mephisto-W survey.
Chen et al.(submitted)have presented the the Mephsito-W Survey Scheduler(MWSS)and provide the simulations of the first year observations of the Mephisto-W Survey.In the current work,we have simulated the Mephisto-W survey observations of variable objects,including the RR Lyare stars,quasars and other variable sources,based on Chen et al.simulation and the light curve libraries of variable objects from the literature.We have trained Random Forest Classi fiers(RFCs)to identify RR Lyrae stars and quasars from the simulated data of Mephisto-W Survey and obtained the accuracies and completeness of the classi fiers.
In Section 2,we introduce how we simulate the observations of different variable objects of the Mephisto-W survey.In Section 3 we describe the RFCs we adopted to identify RR Lyrae stars and quasars.In Section 4 we show our results,which are discussed and summarized in Section 5.
The process of the realizations of the Mephisto-W Survey observed RR Lyrae stars,quasars and other variable sources includes two steps:the simulation of the observing cadence of the Mephisto-W Survey and that of the light curves of the individual variable sources.
For the cadence simulation,we adopt the Simulation 1 from Chen et al.(submitted)in the current work.Chen et al.(submitted)have presented an adaptive scheduling algorithm for the Mephisto-W Survey.The scheduler can simulate the observational results of the Mephito-W survey with giving models of the telescope,weather conditions and other environmental variables.Chen et al.have provided two sets of simulation results for the first year observation of the Mephisto-W Survey.In the current work,we adopt the first simulation,i.e.,Simulation 1 from Chen et al.For Simulation 1,48.1%and 30.7%of the survey fields would be targeted by the Mephisto respectively in theugiandvrzfilter combinations for more than five times.In the current work,we focus on the Sloan Digital Sky Survey(SDSS;York et al.2000)Stripe 82 region,where most of the fields will be targeted by the Mephisto five times in a year for both theugiandvrzfilter combinations.
We have adopted a method similar to that of Oluseyi et al.(2012)to simulate the Mephisto-W Survey observations of variable sources.To assess the capability of characterizing RR Lyrae stars from the Legacy Survey of Space and Time(LSST),Oluseyi et al.(2012)have undertaken extensive simulations of RR Lyrae starlight curves from the LSST operation simulations and the SDSS Stripe 82 photometric measurements.In the current work,the simulations are also based on the SDSS Stripe 82 observations.Ivezi?et al.(2007)have provided SDSSugrizlight curves of 67,507 variable sources in the SDSS Stripe 82 region,including 483 RR Lyrae stars(Sesar et al.2010;Süveges et al.2012),9,258 quasars(Palanque-Delabrouille et al.2011),and 57,766 other variable sources.All objects have an average of ten observations in each of theugrizpassbands.
The filter set of the Mephisto includes sixuvgrizpassbands,which are very similar to that of the SkyMapper(Bessell et al.2011;Wolf et al.2018).As the Mephisto filters are still under developing,in the current work,we simply adopt the SkyMapperuvgrizbands as the Mephisto filters.We first transform the SDSSugrizphotometric magnitudes to the SkyMapperuvgrizmagnitudes.We cross-match the SDSS Stripe 82 Standard Star Catalog(Ivezi?et al.2007)to the SkyMapper Southern Survey Data Release 2(SMSS DR2;Onken et al.2019).In Figure 1,we show the correlations between the SkyMapper and SDSS magnitudes.The Sky-Mapperu,v,g,r,iandzmagnitudes are simply converted from the SDSSu,u,g,r,iandzrespectively by linear transforming relationships,as,
Based on the above equations,we are then able to obtain the idealized Mephistouvgrizlight curves of the RR Lyrae stars,quasars and other variable sources from their SDSSugrizlight curves and finally produce the Mephisto“observed”light curves of the individual objects.
The cadence simulation from Chen et al.(submitted)provides us the observing time and the observing conditions of the fields in the SDSS Strip 82 for the first year observation of the Mephisto-W Survey.For the periodic objects such as the RR Lyrae stars,Cepheids and eclipsing binaries,etc,we calculated their phasesφat the individual epochs based on their periodsPand the start time of each periodφ0.We then derived the idealized magnitudes of the individual objects at each epoch based on linear interpolation of their phase-folded light curves.To produce realistic observations,random Gaussian noises are added to the idealized magnitudes based on the photometric errors calculated from the observing conditions(Lei et al.2021).
For the non-periodic objects,such as the quasars,we are not able to predict their magnitudes at given epochs.We thus randomly selected five SDSS observations which were taken within one calendar year and manually changed their observing time to the same time of the same day of the year 2022.Similar as the periodic objects,the Gaussian random noises were added.In Figure 2 we show two examples of the simulated light curves in the Mephistouvgrizbands for both the periodic and non-periodic objects.
Figure 1.Relationships between the SkyMapper uvgriz magnitudes and the SDSS ugriz magnitudes for the individual stars in the SDSS Stripe 82 Standard Star Catalog.The black lines show the best-fit linear relations.
Figure 2.Examples of simulated light curves for a periodic object(RR Lyrae star;upper panels)and a non-periodic object(quasar;bottom panels).For the periodic object,its observed(left)and simulated(right)light curves are plotted as functions of phase.For the non-periodic object,its observed(left)and simulated(right)light curves are plotted as functions of modi fied Julian dates.For the quasar,we also show the best-fitted DRW models.
We use a machine learning algorithm,the Random Forest Classi fier(RFC;Breiman 2001),to identify the RR Lyrae stars and quasars in the current work.RFC is an ensemble learning method for classi fication which fits a number of decision tree classi fiers and uses all the weak classi fiers collaboratively to improve the predictive accuracy and control over-fitting.The SCIKIT-LEARN package for PYTHON(Pedregosa et al.2011)is adopted to build the RFCs in the current work.Based on the simulated Mephisto light curves of different variable sources,we have built separate RFC models for identifying the RR Lyrae stars and quasars,respectively.For the identi fication of RR Lyrae stars,the sample containing all the 483 RR Lyrae stars in the SDSS Strip 82 region(Sesar et al.2010)is adopted as the positive sample and a sample containing 483 randomly selected non-RR Lyrae stars from Ivezi?et al.(2007)is adopted as the negative sample.For the identi fication of quasars,a sample containing 91073The Palanque-Delabrouille et al.(2011)catalog contains 9258 quasars,among which 9107 have more than five visits during a calendar year.quasars is adopted as the positive sample and a sample containing 9107 randomly selected non-quasars is adopted as the negative sample.
The simulated light curves of the objects in the positive and negative samples have been transformed into sets of features,which are adopted as the input parameters of the RFC models.We adopt different sets of training features for the RR Lyrae star and quasar RFC models,respectively.
Vicedomini et al.(2021)have transferred the LSST simulated light curves into a set of features that represent the peculiar characteristics of the variables.With the extracted features as input parameters,Vicedomini et al.(2021)have carried out several Machine Learning algorithms to identify different types of supernovae.In the current work,we adopted all the statistical parameters from Vicedomini et al.(2021)which are listed as follows for the RR Lyrae star RFC.
1.Amplitude(ampl):half of the difference between the maximum and the minimum magnitudes.
2.Beyond1std(b1std):the fraction of observations that have magnitudes outside the 1σrange from the mean value.
3.Flux percentage ratio(fpr):the ratio between two f lux percentilesFn,m,whereFn,mis the difference between the f ulx values atnth andmth percentiles.In the current work,we adopt f ive f ulx percentile ratios:fpr20=F40,60/F5,95,fpr35=F32.5,67.5/F5,95,fpr50=F25,75/F5,95,fpr65=F17.5,82.5/F5,95,andfpr80=F10,90/F5,95.
4.Lomb–Scargle periodogram(ls):the period from the Lomb–Scargle periodogram.For the identi f ication of RR Lyrae stars,we adopted period limits from 0.2 to 1.2 day.We note that for both the RR Lyrae stars and quasars,we are not likely to obtain the true periods of the objects.This is because that we have only simulated measurements at four to f ive epochs.
5.Linear trend(lt):the slope of the light curve by a linear f it.
6.Median absolute deviation(mad):the median value of the f luxes deviated from the median value.
7.Median Buffer range percentage(mbrp):the fraction of observations that have magnitudes with 10%from the median value.
8.Magnitude ratio(mr):the fraction of observations that have magnitudes above the median value.
9.Maximum slope(ms):the maximum value of slopes calculated from the observations at successive epochs.
10.Percent difference f lux percentile(pdfp):the ratio between the difference of the f ifth and the 95th percentile f lux (converted to magnitudes),and the median magnitude.
11.Pair slope trend(pst):the fraction of f lux measurements that are larger than the former ones in the last 30 couples of consecutive observations.
12.R Cor Bor(rcb):the fraction of observations that have magnitudes below 1.5 mag with respect to the median value.
13.Small Kurtosis(kurt):the fourth-order momentum divided by the square of the variance.
14.Skewness(skew):the third-order momentum divided by the variance to the third power.
15.Standard deviation(std):the standard deviation of the observed fluxes.
We have light curves of objects in sixuvgrizpassbands,which resulted 114 input features as the Vicedomini et al.statistical parameters for each RR Lyrae star or non RR Lyrae star.
In addition to the Vicedomini et al.(2021)statistical parameters,we have also adopted the statistical parameters listed as follows.
In total,we have adopted 141 input parameters for the RR Lyrae star RFC.
For the quasar RFC model,we used also all the 141 parameters adopted by the RR Lyrae star RFC.In addition,similar as in the works of MacLeod et al.(2010)and Yang et al.(2021),we have adopted the Damped Random Walk(DRW)parameters,including the timescale of DRWτand the long-term deviation of variabilityσ,as the input features of the quasar RFC model.The JAVELIN program is adopted to fit the light curves in each passband to calculate the DRW parametersτandσ(Zu et al.2013),which resulted 12 additional input features.
The performances of the RR Lyrae star and quasar RFCs are based on some statistical estimators.For a given class(i.e.,RR Lyrae star or quasar),we de fineTruePositiveas the number of objects which are correctly classi fied as the class;FalsePositiveas the number of objects which are wrongly classi fied as the class,but their correct classi fications are not the class;TrueNegativeas the number of objects which are correctly classi fied as not the class,andFalseNegativeas the number of objects which are wrongly classi fied as not the class,but their correct classi fication are the class.We then have:
Purityof the RFC model is also named asprecision.It is the percentage of that a certain type of classi fication is true.Completenessof the RFC model is also named asrecall.It isthe percentage of the correctly classi fied objects for a given class of objects.
We divided both the positive and negative samples into the same number of subsets.Each time,we select some of the subsets for RFC model training and the remaining subsets for testing the trained classi fiers.The values ofpurityandcompletenessof each classi fier are recorded and finally we present the averaged performances.
The RR Lyrae star positive and negative samples contains both 483 objects.They are divided into 48 subsets,which are noted asS1,S2,S3,...,S47 andS48.The last subset(S48)contains 13 RR Lyrae stars and 13 non RR Lyrae objects;and the other 47 subsets all contain 10 RR Lyrae stars and 10 non RR Lyrae objects.We train the RR Lyrae star RFC model 48 times.At each time,36 subsets are selected as the training sample and the other 12 subsets as the test sample.For example,at the first time,the subsetsS1,S2,S3,...,S35 andS36 are adopted as the training sample and the remaining subsets(S37,S38,S39,...,S47 andS48)as the test sample.At the second time,the subsetsS2,S3,S4,...,S36 andS37 are adopted as the training sample and the remaining subsets(S38,S39,S40,...,S48 andS1)as the test sample.
We present the averaged performance of our RR Lyrae classi fiers in Table 1.We find a high performance of our RR Lyrae RFC.The precision of RR Lyrae star classi fication can achieve 95.4%and the recall 96.9%,which clearly demonstrates the high ef ficiency of selecting RR Lyrae star from the data of the Mephisto-W Survey.
Table 1The Averaged Values of Purity and Completeness of the RR Lyrae Star RFC
For the RR Lyrae star RFC,we have adopted 141 input features for classi fier training.We have examined the relative importance of these input features.Because we have trained the RR Lyrae star RFC 48 times,for each trial,we also record the important score of every input feature.We show the averaged scores of 20 most important features in the upper panel of Figure 3.The most important features are standard deviations(std),percent difference flux percentiles(pdfp),amplitudes(ampl),maximum slopes(ms),colors(color)and mean values of the real-time colors(mrcolor).In particular the std and pdfp in theg-band are two most important features.
Figure 3.Important scores of 20 most important input parameters for the RR Lyrae star(upper)and the quasar(bottom)RFCs,respectively.
The quasar positive and negative samples contains both 9107 objects.They are divided into 91 subsets.The last subset contains 107 quasars and 107 non quasars,while the other subsets contain 100 quasars and 100 non quasars.Similar to the training of the RR Lyrae star RFC,we have trained the quasar RFC 91 times.At each time,68 subsets are adopted as the training sample and the remaining 23 subsets as the test sample.We present the averaged performance of the quasar classi fiers in Table 2.The precision of quasar classi fication is 91.4%and the recall is 90.2%.The performance of the quasar classi fiers are not as good as the RR Lyrae star classi fiers.However,it is still possible for us to select the quasar candidates from the Mephisto-W Survey for the considerably high precision and recall.
Table 2The Average Values of Purity and Completeness of the Quasar RFC
We have also examined the relative importance of the input features for the quasar RFC,which is presented in the bottom panel of Figure 3.The most important features are the colors(color),mean values of the real-time colors(mrcolor)and the DRW parameters(τandσ).Particularly,the colorand mrcoloru-gare two most important features.
The Mephisto-W survey will target the whole northern sky of~27,000 deg2.All the available time in the first year of the survey will be dedicated to Mephisto-W.The full survey areawill be imaged four to five times over the year,in both theugiandvrzfilter combinations.The present work is related to the key sciences of the Mephisto-W survey,with special emphasis to the identi fications of RR Lyrae stars and quasars.
In order to explore the feasibilities and accuracies of selecting RR Lyrae star and quasar from the first year observation of the Mephisto-W Survey,we have simulated theuvgrizmultiband light curves of the RR Lyare stars,quasars and other variable objects based on the Mephisto-W Survey Scheduler simulation and the light curve catalogs of the variable sources from the SDSS Stripe 82 observations.We then trained RFCs for the RR Lyrae stars and quasars and investigated the accuracies and recalls of the classi fiers.
For the RR Lyrae star identi fication,we have built positive and negative samples containing 483 RR Lyrae stars and 483 non RR Lyrae stars,respectively.141 observation features were extracted from their simulated light curves and were applied to the RR Lyrae star RFC training.We have obtained average values of 95.4 and 96.9%respectively for theprecisionandcompletenessof the RR Lyrae star RFC,which indicate that we are able to select RR Lyrae star from the Mephisto-Wsurvey data with very high ef ficiency.For the quasar identi fication,we have built positive and negative samples containing 9107 quasars and 9107 non quasars,respectively.153 training features are adopted.The trained RFC can select the quasars with aprecisionof 91.4%and acompletenessof 90.2%.
RFC adopts bagging and random feature sampling methods,which has good resistance to noise.Using the same method as Breiman(2001),we have tested the noise effect of our classi fiers.We arti ficially set the input labels of 5%objects in the training sample to the wrong labels.This noise injection leads to errors of 0.04%and 0.4%for the RR Lyrae star and quasar RFCs,respectively.This indicates that the RFC method is insensitive to noises and the classi fier is stable.
The Mephisto telescope is planned to obtain its first light in the end of 2021 and the Mephisto-W Survey will target the whole northern sky of 27,000 deg2.Although the Mephisto-W survey fields would be targeted by the telescope for only four to five times in a year,we are still able to identify the RR Lyrae stars and quasars with high accuracies.This is bene fited from the high accuracy real-time colors obtained by the Mephisto-W survey,Comparing to the traditional method which select RR Lyrae stars and quasars from(period)analysis of light curves of the individual objects,our machine learning algorithm takes much less time and computing resources.It will be powerful for the modern large-scale time domain surveys,which will deliver observations of billions sources.In addition,our method do not require many epochs observations,which saves the telescope time and enables us to cover much larger areas.
Our method can be applied directly to the Mephisto data once it is available.The algorithm can also be applied to the data of other time-domain surveys,such as the Zwicky Transient Facility(ZTF;Mahabal et al.2019;Graham et al.2019;Bellm et al.2019),Wide Field Survey Telescope(WFST;Chen et al.2019;Lou et al.2020),LSST and China Space Station Telescope(CSST;Zhao et al.2016;Yuan et al.2021;Sun et al.2021;Cao et al.2021a;Cao et al.2021b).
Acknowledgments
This work is funded by the National Natural Science Foundation of China(NSFC)Nos.11803029,11833006 and 12173034,the National Training Program of Innovation and Entrepreneurship for Undergraduates of China No.201910673001,Yunnan University grant C176220100007 and the National Key R&D Program of China No.2019YFA0405500.We acknowledge the science research grants from the China Manned Space Project with Nos.CMS-CSST-2021-A09,CMS-CSST-2021-A08 and CMS-CSST-2021-B03.
Funding for SDSS-III has been provided by the Alfred P.Sloan Foundation,the Participating Institutions,the National Science Foundation,and the U.S.Department of Energy Of fice of Science.The SDSS-III website is http://www.sdss3.org/.SDSS-III is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS-III Collaboration including the University of Arizona,the Brazilian Participation Group,Brookhaven National Laboratory,Carnegie Mellon University,University of Florida,the French Participation Group,the German Participation Group,Harvard University,the Instituto de Astro fisica de Canarias,the Michigan State/Notre Dame/JINA Participation Group,Johns Hopkins University,Lawrence Berkeley National Laboratory,Max Planck Institute for Astrophysics,Max Planck Institute for Extraterrestrial Physics,New Mexico State University,New York University,Ohio State University,Pennsylvania State University,University of Portsmouth,Princeton University,the Spanish Participation Group,University of Tokyo,University of Utah,Vanderbilt University,University of Virginia,University of Washington,and Yale University.
The national facility capability for SkyMapper has been funded through ARC LIEF grant LE130100104 from the Australian Research Council,awarded to the University of Sydney,the Australian National University,Swinburne University of Technology,the University of Queensland,the University of Western Australia,the University of Melbourne,Curtin University of Technology,Monash University and the Australian Astronomical Observatory.SkyMapper is owned and operated by the Australian National University’s Research School of Astronomy and Astrophysics.The survey data were processed and provided by the SkyMapper Team at ANU.The SkyMapper node of the All-Sky Virtual Observatory(ASVO)is hosted at the National Computational Infrastructure(NCI).Development and support of the SkyMapper node of the ASVO has been funded in part by Astronomy Australia Limited(AAL)and the Australian Government through the Commonwealth’s Education Investment Fund (EIF)and National Collaborative Research Infrastructure Strategy(NCRIS),particularly the National eResearch Collaboration Tools and Resources(NeCTAR)and the Australian National Data Service Projects(ANDS).
ORCID iDs
Lei Lei https://orcid.org/0000-0003-4631-1915
Bing-Qiu Chen https://orcid.org/0000-0003-2472-4903
Jin-Da Li https://orcid.org/0000-0003-1725-0519
Research in Astronomy and Astrophysics2022年2期