[1]:
%matplotlib inline

OpenML Datasets: Plant Shapes Example

How to list and download datasets.

[2]:
import openml
import pandas as pd
from rerf.rerfClassifier import rerfClassifier

# Import scikit-learn dataset library
from sklearn import datasets

# Import train_test_split function
from sklearn.model_selection import train_test_split

RS = 21208

Download datasets


[3]:
# This is done based on the dataset ID ('did').
dataset = openml.datasets.get_dataset(1492)

# Print a summary
print("This is dataset '%s', the target feature is '%s'" %
      (dataset.name, dataset.default_target_attribute))
print("URL: %s" % dataset.url)
print(dataset.description[:500])
This is dataset 'one-hundred-plants-shape', the target feature is 'Class'
URL: https://www.openml.org/data/v1/download/1592284/one-hundred-plants-shape.arff
**Author**: James Cope, Thibaut Beghin, Paolo Remagnino, Sarah Barman.
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/One-hundred+plant+species+leaves+data+set) - 2010
**Please cite**: Charles Mallah, James Cope, James Orwell. Plant Leaf Classification Using Probabilistic Integration of Shape, Texture and Margin Features. Signal Processing, Pattern Recognition and Applications, in press. 2013.

### Description

One-hundred plant species leaves dataset (Class = Shape).

###

Get the actual data.

Returned as numpy array, with meta-info (e.g. target feature, feature names,…)

[4]:
X, y, attribute_names,_ = dataset.get_data(
    target=dataset.default_target_attribute
)

dat = pd.DataFrame(X)
Y = [int(yi) - 1 for yi in y]


# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.25, random_state = RS
)  # 75% training and 25% test

  • Explore the data visually.
[5]:
dat = dat.sample(n=1000, random_state = RS)
_ = pd.plotting.scatter_matrix(
    dat.iloc[:100, :8],
    c=dat[:100]['V64'],
    figsize=(10, 10),
    marker='o',
    hist_kwds={'bins': 20},
    alpha=.8,
    cmap='magma'
)
../_images/demos_openML_plant_shapes_8_0.png
[6]:
# Create a RerF Classifier
clf = rerfClassifier(n_estimators=500, max_features=16, feature_combinations=4, n_jobs = 2, random_state = RS)
[7]:
clf.fit(X_train, y_train)
[7]:
rerfClassifier(feature_combinations=4, image_height=None, image_width=None,
               max_depth=None, max_features=16, min_samples_split=1,
               n_estimators=500, n_jobs=2, oob_score=False,
               patch_height_max=None, patch_height_min=1, patch_width_max=None,
               patch_width_min=1, projection_matrix='RerF', random_state=21208)
[8]:
y_pred = clf.predict(X_test)
[9]:
# Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
[10]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.6525