Skip to main content

One post tagged with "clustering"

View All Tags

Forecasting and Clustering in Google Colab

· 5 min read
Vincenzo Manto
Founder @ Datastripes
Alessia Bogoni
Chief Data Analyist @ Datastripes

Data analysis often involves multiple steps — cleaning, exploring, visualizing, modeling. Two common and powerful techniques are forecasting (predicting future trends) and clustering (grouping similar data points).

In this post, we’ll show how to do both using Google Colab, walk through the code, and highlight the complexity involved — then reveal how Datastripes can simplify this to just a couple of visual nodes, no code required.


Time Series Forecasting with Prophet in Colab

Suppose you have daily sales data, and you want to forecast the next 30 days. Prophet, a tool developed by Facebook, is great for this.

The Data

Imagine a CSV like this:

dsy
2024-01-01200
2024-01-02220
2024-01-03215
......

Where ds is the date and y is the sales.

Step-by-step Code Walkthrough

# Install Prophet - this runs only once in the Colab environment
!pip install prophet

This command installs Prophet in the Colab environment. It might take a minute.

import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt

Here we import the necessary libraries:

  • pandas for data handling
  • Prophet for forecasting
  • matplotlib for plotting
# Load your sales data CSV into a DataFrame
df = pd.read_csv('sales.csv')

You’ll need to upload your sales.csv file to Colab or provide a link.

# Take a peek at your data to ensure it loaded correctly
print(df.head())

Always check your data early! Look for correct date formats, missing values, or typos.

# Initialize the Prophet model
model = Prophet()

This creates the Prophet model with default parameters. You can customize it later.

# Fit the model on your data
model.fit(df)

This is where the magic happens — Prophet learns the patterns from your historical data.

# Create a DataFrame with future dates to forecast
future = model.make_future_dataframe(periods=30)
print(future.tail())

make_future_dataframe adds 30 extra days beyond your data so the model can predict future values.

# Use the model to predict future sales
forecast = model.predict(future)

forecast now contains predicted values (yhat) and confidence intervals (yhat_lower and yhat_upper).

# Visualize the forecast
model.plot(forecast)
plt.title('Sales Forecast')
plt.show()

You get a clear graph showing past data, predicted future, and uncertainty.

Tips for Better Forecasts

  • Ensure your dates (ds) are in datetime format.
  • Check for missing or outlier data points before fitting.
  • Tune Prophet’s parameters like seasonality or holidays for your context.

Clustering Customers Using KMeans in Colab

Now, let’s say you want to segment customers based on income and spending behavior.

The Data

A CSV with columns:

CustomerIDAnnual Income (k$)Spending Score (1-100)
11539
21681
3176
.........

Step-by-step Code Walkthrough

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import pandas as pd

We import KMeans for clustering, matplotlib for plotting, and pandas to load data.

# Load the customer data CSV
df = pd.read_csv('customers.csv')
print(df.head())

Always check the data to understand its shape and content.

# Select the two features to cluster on
X = df[['Annual Income (k$)', 'Spending Score (1-100)']]

These columns will form a 2D space for clustering.

# Initialize KMeans with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)

Choosing number of clusters is a key step. Here we pick 3 for illustration.

# Fit the model and predict cluster assignments
kmeans.fit(X)
df['Cluster'] = kmeans.labels_

Each customer gets assigned a cluster label (0,1,2).

# Plot clusters with colors
plt.scatter(df['Annual Income (k$)'], df['Spending Score (1-100)'], c=df['Cluster'], cmap='viridis')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Customer Segmentation')
plt.show()

The scatter plot shows customers grouped by clusters in different colors.

Tips for Better Clustering

  • Normalize or scale features if they have different units.
  • Experiment with cluster counts and validate with metrics like silhouette score.
  • Visualize results to make business sense of clusters.

Why Is This Hard for Most People?

If you’re not a coder, these steps look intimidating: installing packages, writing code, understanding APIs, and debugging errors.

Even for tech-savvy folks, repeating these steps every time the data updates is tedious.

It takes time away from what really matters: interpreting results and making decisions.


How Datastripes Makes This Effortless

With Datastripes, you don’t need to write or understand code:

  • Upload your data.
  • Drag a "Forecast" node and configure date and value columns.
  • Drag a "Cluster" node, pick features, and watch clusters appear.
  • Everything updates live and visually, directly in your browser.
  • No installs, no scripts, no errors.

Datastripes is built to turn these complex workflows into intuitive flows — freeing you to focus on insight, not syntax.

Try the live demo at datastripes.com and see how forecasting and clustering go from tens of lines of code to just two nodes.


When data analysis becomes simple, you can explore more, decide faster, and actually enjoy the process.