Contact Us

Energy Company Customer Clustering Analysis

General Steps and PECO.  I did this several years ago and if I had it to do over again I would do it in R with some better graphics.  General methodology is below though.

Goal of this process is a basic clustering analysis to identify a customer via a Profile or Cluster identifier to easily be able to predict their utilization practices/forecast what their utilization will be.

I had to do this without Tableau at the time so please forgive the excell...

General process Flow for creation/setup/training/feedback loop

  1. Data components need to consist of the customers utilization broken down by hour or some time frame measure of time is irrelevant.
  2. Normalize and cleaned data example
    1. This is the comma delimited normalized file needed to move forward with analysis:
      1. Final_cleaned_Normalized_WED_2015_PECO.csv (removed) proprietary
    2. Excel File where I did the cleaning etc. Each tab was a step from beginning to end.
      1. Peco_interval_2015_wed.xlsx  (Removed) proprietary
  3. Import the Final_cleaned_Normalized.csv file into R to manipulate/run the algorithms against.
    1. Launch R and change to the directory you are working out of
      1. > utilization_normalized <- read.csv(file="Final_cleaned_Normalized_WED_2015_PECO.csv",head=TRUE,sep=",")
      2. The attributes command on any R object will tell you a lot about it.  Make sure you review your newly imported data.frame object to make sure it looks the way you expect.  This could save you a lot of time later down the road.

      3. > attributes(utilization_normalized)
        $names
        [1] "H1" "H2" "H3" "H4" "H5" "H6" "H7" "H8" "H9" "H10" "H11" "H12" "H13" "H14" "H15" "H16" "H17" "H18" "H19" "H20" "H21" "H22" "H23" "H24"

        $class
        [1] "data.frame"

        $row.names
        [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
        [45] 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

  4. Load and Run the DTW (Dynamic Time Warp) library and algorythm against the population 
    1. This algorythm was picked because it does a similarity analysis of time series data while taking into account relative changes in volume and spits out a square cross reference table measurement of how similar/dissimilar two rows are.  
    2. K-Means should not be utilized it doesn't do the same level of relative leveling
    3. you can't just do a strait KNN analysis on the original set of data either, since if you skip this step it will treat each data point individually, and really you want to compare the whole timeline, not just a point on the timeline.
    4. Load the DTW Package:
      1. > library(dtw)
        Loading required package: proxy

        Attaching package: ‘proxy’

        The following objects are masked from ‘package:stats’:

        as.dist, dist

        The following object is masked from ‘package:base’:

        as.matrix

        Loaded dtw v1.18-1. See ?dtw for help, citation("dtw") for use in publication.

         

    5. Execute the DTW algorythm on the dataset

    6. Gives you a distance measure for each account to every other account.
      1. Looks like a giant square cross reference table
  5. Plug the Square distance matrix provided by the distDTW function into the PAM Algorythm(Partitioning Around Medoids) to identify the most likely clusters
    1. This algorythm utilizes the distance matrix to identify the most likely center for each cluster and then groups the individual accounts based on their distance from that center(Medoids).
    2. > library(cluster)  ---Loads the clustering library
      > un_dist_cluster <- pam(un_dist,5)   -----Begins the analysis on the distance matrix we created via the DTW algorythm

      >plot(un_dist_cluster)  – Provides a plot of the created cluster matrix

       

      >summary(un_dist_cluster)   ---- Provides information around the cluster makeup/stats

      Numerical information per cluster:
      size max_diss av_diss diameter separation
      [1,] 939 1.879640 0.8537481 2.490347 0.1932138
      [2,] 450 51.806884 2.6514870 53.339309 0.2610087
      [3,] 1039 1.508784 0.8075434 2.237245 0.1546269
      [4,] 615 2.968381 1.1822063 3.966381 0.2140588
      [5,] 760 2.259450 1.1314709 3.114355 0.1546269

      >attributes(un_dist_cluster) — Provides a list of attributes available for the cluster obje

      $names
      [1] "medoids" "id.med" "clustering" "objective" "isolation" "clusinfo" "silinfo" "diss" "call" "data"

      $class
      [1] "pam" "partition"

       

    3. > un_dist_cluster$clustering ---This prints out the cluster membership in a list format.
    4. turned_clustermembership <- t(un_dist_cluster$clustering)  – Turns the cluster membership into a list and puts it in an object to be printed to a file.
    5. write(turned_clustermembership, file = "Clustermembership.txt",ncolumns = 1) — Write the cluster membership in order to a file.
    6. Import the cluster membership back into the original Data set (In Order since we had to lose the customer Identifier)
  6. Plot out the averages for each cluster by % of total utilization and total volume of utilization to get some pretty graphs

                                            By % Of total Utilization                                                                                                    By Total Volume
  7. Plug the distance measure into the KNN Algorythm to identifify it's K Nearest neighbors.
    1. Requires multiple runs with different K values to get a grasp on what the Clusters are.
    2. This will give you a model that you can use.
  8. The newly created Neural network model is trained and available for use!
  9. Plug "unclassified" accounts into the KNN generated model to classify new accounts
  10. With the cluster identifier, an account can be grouped with similar accounts and targetted for marketing/sales, or utilized by load forecasting simplification