Celestial Body

Predicting Celestial Body using RandomForest

Installing libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(randomForest)
randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:dplyr':

    combine

The following object is masked from 'package:ggplot2':

    margin

Importing training set as a dataframe

celestial_training <- read_csv("celestial_train.csv")
Rows: 50000 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): class
dbl (17): id, alpha, delta, u, g, r, i, z, run_ID, rerun_ID, cam_col, field_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
celestial_training
ABCDEFGHIJ0123456789
id
<dbl>
alpha
<dbl>
delta
<dbl>
u
<dbl>
g
<dbl>
r
<dbl>
i
<dbl>
0135.689106603.249463e+0123.8788222.2753020.3950119.16573
1144.826100553.127418e+0124.7775922.8318822.5844421.16812
2338.74103775-4.028276e-0122.1368223.7765621.6116220.50454
3340.995120512.058948e+0123.4882723.3377621.3219520.25615
4200.290475394.719940e+0124.4028622.3566920.6103219.46490
539.149690602.810284e+0121.7466920.0349319.1755318.81823
6328.092076171.822031e+0125.7716322.5204220.6388419.78071
7331.502029981.003580e+0120.8294018.7509117.5111817.01631
8344.98477027-3.526158e-0123.2091122.7929122.0858921.86282
9171.975424576.774745e+0122.1336720.8477218.9653718.31696

Making sure that the response variable is a factor

celestial_training <- celestial_training %>% 
  mutate(class = as.factor(class))

celestial_training %>% 
  select(class) %>% 
  distinct()
ABCDEFGHIJ0123456789
class
<fct>
GALAXY
QSO
STAR
Creating model using randomForest
model.2 <- randomForest(class ~ .-(id+spec_obj_ID), data = celestial_training, ntree = 750, mtry = 4)
model.2

Call:
 randomForest(formula = class ~ . - (id + spec_obj_ID), data = celestial_training,      ntree = 750, mtry = 4) 
               Type of random forest: classification
                     Number of trees: 750
No. of variables tried at each split: 4

        OOB estimate of  error rate: 2.2%
Confusion matrix:
       GALAXY  QSO  STAR class.error
GALAXY  29314  341    74 0.013959434
QSO       682 8885     1 0.071383779
STAR        4    0 10699 0.000373727

Importing Testing Dataset

celestial_testing = read_csv("celestial_test.csv")
Rows: 50000 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (17): id, alpha, delta, u, g, r, i, z, run_ID, rerun_ID, cam_col, field_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Making predictions

testing_predictions.2 <- data.frame(predict(model.2, newdata = celestial_testing)) %>%
  rename("output" = "predict.model.2..newdata...celestial_testing.")

submission.2 <- data.frame(c(celestial_testing, testing_predictions.2)) %>%
  select(id, output)
submission.2
ABCDEFGHIJ0123456789
id
<dbl>
output
<fct>
50000QSO
50001GALAXY
50002QSO
50003GALAXY
50004STAR
50005GALAXY
50006GALAXY
50007QSO
50008GALAXY
50009GALAXY

Exporting predictions as csv

write.csv(submission.2, "submission_2.csv", row.names = FALSE)