✨ Readme and refactor

1 year ago · 006770bf43
parent cad4dc579f
commit 006770bf43
10 changed files with 21618 additions and 161 deletions
--- a/.DS_Store
+++ b/.DS_Store
--- a/README.md
+++ b/README.md
@ -0,0 +1,59 @@
 # Introduction
 Voici MMIX, une application IA permettant de prédire à hauteur de 63% la probabilité de victore d'un combat entre deux combattants MMA. 
 Les statistiques sont issu d'un dataframe disponible sur [Kaggle](https://www.kaggle.com/datasets/rajeevw/ufcdata).
 Le model utilisé est une forêt aléatoire.
 # Les données
 Les données utilisées sont issu d'un dataframe reprenant les statistiques de la plus grande organisation MMA au monde, l'UFC (Ultimate Fighting Championship).
 ## Qu'est ce que le MMA ?
 Le MMA, ou arts martiaux mixtes, est un sport de percussion-préhension (debout et au sol). 
 C'est un sport mis en avant par la médiatisation, comme le catch il y a quelques années. Il s'est démocratisé il y a peu de temps en France, car il n'est autorisé que depuis 2020. 
 Avant cette date, ce sport était pratiquer par nos français à l'extérieur de l'héxagone.
 Ce sport est très complexe, et seul les combattants maitrisant plusieurs style de combat peuvent gravir les échellons.
 Le MMA permet de réunir plein de sorte de style de combat comme la boxe anglaise, le Jiu-jitsu brésilien, la lutte et le sambo.
 Le MMA, comme en boxe, permet de combattre seulement contre sa catégorie de poids, sauf exception (devenir double champion, montée de catégories...).
 ## Et nos français ?
 Actuellement à l'UFC, nous possédons 4 dans le top 15 mondial de l'UFC.
 [Ciryl Gane](https://www.ufc.com/athlete/ciryl-gane) - Top 2 dans la catégorie Poids Lourd Homme
 [Manon Fiorot](https://www.ufc.com/athlete/manon-fiorot) - Top 3 dans la catégorie Poids Mouche Femme
 [Nassourdine Imavov](https://www.ufc.com/athlete/nassourdine-imavov) - Top 8 dans la catégorie Poids Moyen Homme
 [Benoit Saint Denis](https://www.ufc.com/athlete/mariya-agapova-0) - Top 12 dans la catégorie Poids Léger Homme, il est en série de 3 victoires très impréssionantes et va combattre début avril contre le Top 3 [Dustin Poirier](https://www.ufc.com/athlete/dustin-poirier) qui est une légende du MMA
 # Que retenir du dataset ?
 # LISTE DES VISUALISATIONS A PREVOIR
 **Taux de victoire par méthode de finition** : Analyser la fréquence à laquelle les combats se terminent par soumission, KO, décision unanime, décision partagée, etc
 **Durée moyenne des combats** : Calculer la durée moyenne des combats pour différentes catégories de poids ou pour l'ensemble de l'UFC
 **Taux de réussite des takedowns** : Examiner le pourcentage de tentatives de takedown réussies par les combattants
 **Taux de réussite des frappes** : Analyser le pourcentage de coups réussis par rapport au nombre total de coups tentés
 **Distribution des finitions par round** : Déterminer dans quel round les combats sont le plus souvent terminés (par exemple, soumission au premier round, KO au deuxième round, etc.)
 **Variation des performances avec l'âge** : Vérifier s'il existe une corrélation entre l'âge des combattants et leur succès dans l'UFC
 ![LE DARON À ZAK](https://upload.wikimedia.org/wikipedia/commons/e/ec/Dana_White_-_London_2015_%28cropped%29.jpg)
 # DANA CACA WHITE
--- a/pycache/test.cpython-311.pyc
+++ b/pycache/test.cpython-311.pyc
--- a/archive/data.csv
+++ b/archive/data.csv
--- a/archive/fighter.csv
+++ b/archive/fighter.csv
--- a/archive/preprocessed.csv
+++ b/archive/preprocessed.csv
--- a/archive/totalfight.csv
+++ b/archive/totalfight.csv
--- a/image.png
+++ b/image.png
--- a/server.py
+++ b/server.py
@ -7,11 +7,20 @@ app = Flask(__name__)
 # Charger le DataFrame une seule fois pour économiser des ressources
 df = pd.read_csv('archive/data.csv')  # Assurez-vous de spécifier le bon chemin vers votre fichier de données
 # Before April 2001, there were almost no rules in UFC (no judges, no time limits, no rounds, etc.). 
 #It's up to this precise date that UFC started to implement a set of rules known as 
 #"Unified Rules of Mixed Martial Arts".
 #Therefore, we delete all fights before this major update in UFC's rules history.
 # Using this old data would not be representative of current fights, especially since this 
 #sport has become one of the most regulated due to its mixity and complexity.
 limit_date = '2001-04-01'
 df = df[(df['date'] > limit_date)]
 # Display NaN values
 displayNumberOfNaNValues(df)
 # Define the list of important features to impute
 imp_features = ['R_Weight_lbs', 'R_Height_cms', 'B_Height_cms', 'R_age', 'B_age', 'R_Reach_cms', 'B_Reach_cms']
 imp_median = SimpleImputer(missing_values=np.nan, strategy='median')
@ -48,22 +57,34 @@ dfWithoutString = df.select_dtypes(include=['float64', 'int64'])
 plt.figure(figsize=(50, 40))
 corr_matrix = dfWithoutString.corr(method='pearson').abs()
 sns.heatmap(corr_matrix, annot=True)
 ## Show the correlation matrix of the dataframe
 ## Very laggy feature
 # plt.show()
 # Last year when data fight was not full and correct
 fighters = list_fighters(df,'2015-01-01')
 # Get all fight of every fighters
 df_train = build_df_all_but_last(df, fighters)
 # Get the last fight of every fighters for test the model
 df_test = build_df(df, fighters,0)
 #Creates a column transformer that encodes specified categorical columns ordinally 
 #while leaving other columns unchanged
 preprocessor = make_column_transformer((OrdinalEncoder(), ['weight_class', 'B_Stance', 'R_Stance']), remainder='passthrough')
-
+#These lines of code utilize LabelEncoder to encode the 'Winner' column into numerical labels for 
 #both training and testing datasets, followed by the separation of features and target variable for 
 #further processing.
 label_encoder = LabelEncoder()
 y_train = label_encoder.fit_transform(df_train['Winner'])
 y_test = label_encoder.transform(df_test['Winner'])
 X_train, X_test = df_train.drop(['Winner'], axis=1), df_test.drop(['Winner'], axis=1)
-# Random Forest composed of 100 decision trees. We optimized parameters using cross-validation and GridSearch tool paired together
+# Random Forest composed of 100 decision trees. We optimized parameters using cross-validation 
 #and GridSearch tool paired together
 random_forest = RandomForestClassifier(n_estimators=100, 
                                       criterion='entropy', 
                                       max_depth=10, 
@ -71,6 +92,7 @@ random_forest = RandomForestClassifier(n_estimators=100,
                                       min_samples_leaf=1, 
                                       random_state=0)
 # Train data
 model = Pipeline([('encoding', preprocessor), ('random_forest', random_forest)])
 model.fit(X_train, y_train)
@ -79,30 +101,25 @@ accuracies = cross_val_score(estimator=model, X=X_train, y=y_train, cv=5)
 print('Accuracy mean : ', accuracies.mean())
 print('Accuracy standard deviation : ', accuracies.std())
 # Test
 y_pred = model.predict(X_test)
 print('Testing accuracy : ', accuracy_score(y_test, y_pred), '\n')
 # Class definition
 target_names = ["Blue","Red"]
 print(classification_report(y_test, y_pred, labels=[0,1], target_names=target_names))
 # Declare feature
 feature_names = [col for col in X_train]
 # Set importances for every feature
 feature_importances = model['random_forest'].feature_importances_
 # Sort importances
 indices = np.argsort(feature_importances)[::-1]
 n = 30 # maximum feature importances displayed
 idx = indices[0:n] 
 # Standard deviation
 std = np.std([tree.feature_importances_ for tree in model['random_forest'].estimators_], axis=0)
-
+# Select tree from model
 #for f in range(n):
 #    print("%d. feature %s (%f)" % (f + 1, feature_names[idx[f]], feature_importances[idx[f]])) 
 # plt.figure(figsize=(30, 8))
 # plt.title("Feature importances")
 # plt.bar(range(n), feature_importances[idx], color="r", yerr=std[idx], align="center")
 # plt.xticks(range(n), [feature_names[id] for id in idx], rotation = 45) 
 # plt.xlim([-1, n]) 
 # plt.show()
 # Sélectionnez un arbre de votre modèle
 tree_estimator = model['random_forest'].estimators_[10]
--- a/test.py
+++ b/test.py
@ -37,72 +37,6 @@ def displayNumberOfNaNValues(df):
    print('Number of features with NaN values:', len([x[1] for x in na if x[1] > 0]))
    print("Total NaN in dataframe :" , df.isna().sum().sum())
 # Before April 2001, there were almost no rules in UFC (no judges, no time limits, no rounds, etc.). 
 #It's up to this precise date that UFC started to implement a set of rules known as 
 #"Unified Rules of Mixed Martial Arts".
 #Therefore, we delete all fights before this major update in UFC's rules history.
 # Using this old data would not be representative of current fights, especially since this 
 #sport has become one of the most regulated due to its mixity and complexity.
    #limit_date = '2001-04-01'
    #df = df[(df['date'] > limit_date)]
 # Display NaN values
    #displayNumberOfNaNValues(df)
 # Define the list of important features to impute
 #imp_features = ['R_Weight_lbs', 'R_Height_cms', 'B_Height_cms', 'R_age', 'B_age', 'R_Reach_cms', 'B_Reach_cms']
 # Initialize a SimpleImputer to impute missing values with median
 #imp_median = SimpleImputer(missing_values=np.nan, strategy='median')
 # Iterate over each feature to impute missing values
 #for feature in imp_features:
    # Fit and transform the feature using median imputation
    #imp_feature = imp_median.fit_transform(df[feature].values.reshape(-1,1))
    # Assign the imputed values back to the DataFrame
    #df[feature] = imp_feature
 # Impute missing values for 'R_Stance' using most frequent strategy
 #imp_stance_R = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
 #imp_R_stance = imp_stance_R.fit_transform(df['R_Stance'].values.reshape(-1,1))
 #
 ## Impute missing values for 'B_Stance' using most frequent strategy
 #imp_stance_B = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
 #imp_B_stance = imp_stance_B.fit_transform(df['B_Stance'].values.reshape(-1,1))
 #
 ## Create DataFrames for imputed stances
 #df['R_Stance'] = pd.DataFrame(imp_R_stance, columns=['R_Stance'])
 #df['B_Stance'] = pd.DataFrame(imp_B_stance, columns=['B_Stance'])
 #
 ## drop B_avg_BODY_att values in the dataframe
 #    # List of features with NaN values to drop
 #    #na_features = ['B_avg_BODY_att', 'R_avg_BODY_att']
 #
 #    # Drop rows with NaN values in specified features
 #    #df.dropna(subset=na_features, inplace=True)
 #
 ## Drop columns 'Referee' and 'location' from the DataFrame
 ## The value of references and location has a low impact in battles, which makes it irrelevant to keep
 #df.drop(['Referee', 'location'], axis=1, inplace=True)
 #
 ## Drop column 'B_draw' and 'R_draw' and 'Draw' fight and 'Catch Weight' fight
 #df.drop(['B_draw', 'R_draw'], axis=1, inplace=True)
 #df = df[df['Winner'] != 'Draw']
 #df = df[df['weight_class'] != 'Catch Weight']
 #
 ## Remove column when data type is not float or int
 #dfWithoutString = df.select_dtypes(include=['float64', 'int64'])
 #
 #plt.figure(figsize=(50, 40))
 #corr_matrix = dfWithoutString.corr(method='pearson').abs()
 #sns.heatmap(corr_matrix, annot=True)
 #
 ## Show the correlation matrix of the dataframe
 ## Very laggy feature
 #
 # plt.show()
 #  i = index of the fighter's fight, 0 means the last fight, -1 means first fight
 def select_fight_row(df, name, i): 
@ -114,10 +48,6 @@ def select_fight_row(df, name, i):
    arr = df_temp.iloc[i,:].values
    return arr
 #  we get the last fight of Khabib :'(
 #print(select_fight_row(df, 'Khabib Nurmagomedov', 0))
 # get all active UFC fighters (according to the limit_date parameter)
 def list_fighters(df, limit_date):
    # Filter the DataFrame to include only fights occurring after the specified limit date
@ -133,9 +63,6 @@ def list_fighters(df, limit_date):
    # Return the list of unique fighters
    return fighters
 # Last year when data fight was not full and correct
 #fighters = list_fighters(df,'2015-01-01')
 def build_df(df, fighters, i):      
    arr = [select_fight_row(df, fighters[f], i) for f in range(len(fighters)) if select_fight_row(df, fighters[f], i) is not None]
    cols = [col for col in df] 
@ -168,77 +95,6 @@ def build_df_all_but_last(df, fighters):
    return df_fights
 #
 #df_train = build_df_all_but_last(df, fighters)
 #df_test = build_df(df, fighters,0)
 #
 #preprocessor = make_column_transformer((OrdinalEncoder(), ['weight_class', 'B_Stance', 'R_Stance']), remainder='passthrough')
 #
 ## If the winner is from the Red corner, Winner label will be encoded as 1, otherwise it will be 0 (Blue corner)
 #label_encoder = LabelEncoder()
 #y_train = label_encoder.fit_transform(df_train['Winner'])
 #y_test = label_encoder.transform(df_test['Winner'])
 #
 #X_train, X_test = df_train.drop(['Winner'], axis=1), df_test.drop(['Winner'], axis=1)
 #
 ## Random Forest composed of 100 decision trees. We optimized parameters using cross-validation and GridSearch tool paired together
 #random_forest = RandomForestClassifier(n_estimators=100, 
 #                                       criterion='entropy', 
 #                                       max_depth=10, 
 #                                       min_samples_split=2,
 #                                       min_samples_leaf=1, 
 #                                       random_state=0)
 #
 #model = Pipeline([('encoding', preprocessor), ('random_forest', random_forest)])
 #model.fit(X_train, y_train)
 #
 ## We use cross-validation with 5-folds to have a more precise accuracy (reduce variation)
 #accuracies = cross_val_score(estimator=model, X=X_train, y=y_train, cv=5)
 #print('Accuracy mean : ', accuracies.mean())
 #print('Accuracy standard deviation : ', accuracies.std())
 #
 #y_pred = model.predict(X_test)
 #print('Testing accuracy : ', accuracy_score(y_test, y_pred), '\n')
 #
 #target_names = ["Blue","Red"]
 #print(classification_report(y_test, y_pred, labels=[0,1], target_names=target_names))
 #
 ## cm = confusion_matrix(y_test, y_pred) 
 ## ax = plt.subplot()
 # sns.heatmap(cm, annot = True, ax = ax, fmt = "d")
 # ax.set_xlabel('Actual')
 # ax.set_ylabel('Predicted')
 # ax.set_title("Confusion Matrix")
 # ax.xaxis.set_ticklabels(['Blue', 'Red'])
 # ax.yaxis.set_ticklabels(['Blue', 'Red'])
 # plt.show()
 #feature_names = [col for col in X_train]
 #feature_importances = model['random_forest'].feature_importances_
 #indices = np.argsort(feature_importances)[::-1]
 #n = 30 # maximum feature importances displayed
 #idx = indices[0:n] 
 #std = np.std([tree.feature_importances_ for tree in model['random_forest'].estimators_], axis=0)
 #for f in range(n):
 #    print("%d. feature %s (%f)" % (f + 1, feature_names[idx[f]], feature_importances[idx[f]])) 
 # plt.figure(figsize=(30, 8))
 # plt.title("Feature importances")
 # plt.bar(range(n), feature_importances[idx], color="r", yerr=std[idx], align="center")
 # plt.xticks(range(n), [feature_names[id] for id in idx], rotation = 45) 
 # plt.xlim([-1, n]) 
 # plt.show()
 # Sélectionnez un arbre de votre modèle
 #tree_estimator = model['random_forest'].estimators_[10]
 # Tracez l'arbre
 # plt.figure(figsize=(1, 1))
 # plot_tree(tree_estimator, feature_names=df_train.columns, filled=True, rounded=True, fontsize=10)
 # plt.savefig('tree.png', dpi=600)  # Enregistrez l'image au format PNG
 # plt.show()
 def predict(df, pipeline, blue_fighter, red_fighter, weightclass, rounds, title_bout=False): 
    try:
        #We build two dataframes, one for each figther 
@ -285,7 +141,6 @@ def predict(df, pipeline, blue_fighter, red_fighter, weightclass, rounds, title_
 #predict(df, model, 'Leon Edwards', 'Belal Muhammad', 'Welterweight', 3, True)
 #predict(df, model, 'Conor McGregor', 'Khabib Nurmagomedov', 'Lightweight', 5, True)
 #predict(df, model, 'Conor McGregor', 'Tai Tuivasa', 'Heavyweight', 5, True)
 #
 #predict(df,model,'Charles Oliveira','Conor McGregor','Lightweight',5,True)
 #predict(df,model,'Charles Oliveira','Khabib Nurmagomedov','Lightweight',5,True)
 #predict(df, model, 'Leon Edwards', 'Kamaru Usman', 'Welterweight', 5, True)