This study was conducted to determine what the predictors of Body Mass Index are. There were two research questions of this study. First research question was How well the type of chocolate and frequency of chocolate consumption predict body mass index, after controlling for gender physical activity? Second research question was “How well do fat percentage and cacao percentage in chocolate explain body mass index, after controlling the results of the first research question?” In order to reveal the predictors hierarchical regression analysis was used. In this study BMI was outcome variable; gender, type of chocolate, fat rate in chocolate, cocoa rate in chocolate, frequency of chocolate consumption and frequency of physical activity in a week were predictor variables. The study was conducted with 600 university students.

Method

Participants and the Variables

The sample of the study was consisted of 600 Middle East Technical University students; 46.3% (n=278) were male and 53.7% (n=322) were female. Convenience sampling method was used to determine the participants. The most crowded places of the university, such as library, market area, dormitory area, were selected as data collection areas.

Requisite sample size for multiple regression could be calculated with the formula of number of predictors * 8 + 50. According to formula required sample size is 106 (7*8+50). While there are 600 students, sample size is quite enough to conduct multiple regression.

The questionnaire used in this study was consisted of seven items which are presented in Table 1. Moreover, there is an id number for each participant. Totally, there were six continuous and two categorical variables on data file.

Table 1

List of variables and brief descriptions in the data file

Variable Name

Description of the variable

Id

Identity number of each participant

BMI

Body Mass Index

Gender

Gender (1: Male; 2: Female)

Type

Type of chocolate ( 1: Milk; 2: Berry; 3: Peanut)

Fat

Fat rate (%) in chocolate

Cacao

Cacao rate (%)in chocolate

Frequency

Frequency of chocolate consumption (number of chocolates eaten in the last week)

Activity

Frequency of physical activity in a week

Data Analysis Plan

In this study hierarchical regression will be held to find out how much the predictors can explain the dependent variable, BMI. In hierarchical regression different models are tested sequentially. In contrast to stepwise regression, researcher decides the sequence of the predictors that included the model.

Three different models will be used to determine how much these independent variables predict the dependent variable. In the first model gender and frequency of physical activity in a week will be included into analysis. In the second model, gender and frequency of physical activity in a week will be controlled; type of chocolate and frequency of chocolate consumption will be included into analysis. In the third model, gender, frequency of physical activity in a week, type of chocolate and frequency of chocolate consumption will be controlled, fat percentage and cacao percentage in chocolate will be included into analysis.

To conduct the regression analysis, categorical data should be recoded. There are three different ways to do this; dummy coding, effects coding and contrast coding. In this study, dummy coding will be used to recode categorical data. In dummy coding, one categorical variable recode into different variables that the number of new variables are one less than the number of categories. Nevertheless, a categorical variable should have at least three levels to be recoded. A categorical variable with two levels such as gender needn’t to be recoded. In this study there were two categorical data; gender and type of chocolate. As it mentioned before, gender needn’t to be recoded. The other categorical variable, type of chocolate, should be recoded. Milk chocolate will be selected as reference variable; and, two other variables will be coded as milkvsberry and milkvspeanut.

Likewise all other multivariate statistical methods, Multiple Regression has various assumptions; and, all these assumptions should be checked before conducting the analysis. First assumption of multiple regression is normality. Unlike other multivariate analysis, regression analysis checks whether the error distributes normally or not. Secondly, multicollinearity, which is high level of intercorrelation among predictor variables, should be checked. Thirdly, assumption of homoscedasticity should be checked. Homoscedasticity assumes that the variance of the error term is constant across each value of the predictor. This means that there should not be seen a pattern on scatter plot. Fourth assumption is independence, that the error term is independent of the predictors in the model and of the values of the error term for other cases. The fifth assumption of multiple regression is linearity. Lastly, outliers should be check whether they affect the results or not. Partial plots, leverage statistics, Cook’s D, DFBeta and Mahalonobis distance could be used to determine outliers.

Results

Descriptive Statistics

Table 2 shows the descriptive statistics of the study. Table 2 shows that there is no missing data; mean of dependent variable, BMI, is 24.65 and the standard deviation is 4.48.

Table 2

Descriptive Statistics

Mean

Std. Deviation

N

body mass index

24.65

4.48

600

Gender

1.54

.50

600

physical activity in a week

2.62

.74

600

milk chocolate vs berry chocolate

.25

.44

600

milk chocolate vs peanut chocolate

.27

.45

600

frequency of chocolate consumption

4.66

.73

600

fat rate (%) in chocolate

51.70

9.69

600

cacao rate (%) in chocolate

51.95

9.96

600

Table 3 shows the correlations between the variables. If the table is examine it is seen that the best predictor of BMI is fat rate in chocolate. There is a positive and high correlation between the BMI and fat rate in chocolate. On the other hand, there is no correlation between BMI and gender, physical activity in a week, milk chocolate vs berry chocolate. Moreover, there is no correlation higher than .90 between the independent variables.

Table 3

Correlation Matrix

1

2

3

4

5

6

7

8

Pearson Correlation

body mass index (1)

1.00

Gender (2)

-.03

1.00

physical activity in a week (3)

.04

-.13

1.00

milk chocolate vs berry chocolate (4)

-.03

.03

-.11

1.00

milk chocolate vs peanut chocolate (5)

.23

-.02

.12

-.36

1.00

frequency of chocolate (6) consumption

.31

.12

.15

-.05

.19

1.00

fat rate (%) in chocolate (7)

.64

-.12

.08

.02

.21

.30

1.00

cacao rate (%) in chocolate (8)

.52

.08

.03

-.04

.22

.28

.51

1.00

Assumptions

The first assumption of multiple regression to be checked is normality. Unlike other analysis, normality of residuals is checked whether errors normally distributed or not. Normality of residuals could be checked via two different ways; histogram and P-P plot. Figure 1 shows the histogram of regression standardized residuals. The histogram shows that there is a normal distribution of residuals. The frequency distribution of residuals is close to normal distribution line. Moreover, figure 2 shows the P-P plot of regression standardized residuals and it shows that distribution of errors is normal. It can be said that first assumption of multiple regression, normality, is not violated.

Figure 1 Histogram of Regression Standardized Residual

Figure 2 P-P Plot of Regression Standardized Residual

The second assumption of multiple regression to be checked is multicollinearity. Multicollinearity could be checked with correlation matrix, VIF or tolerance values. There should not be any correlation that is higher than .90 between two independent variables. When the correlation matrix (Table 3) is examined there is no correlation higher than .90 between two independent variables. Table 4 shows the collinearity statistics of all three models. VIF values more than four or tolerance values higher than .20 are indicators of multicollinearity. Table 4 shows that there is no VIF value higher than four or tolerance value higher than .20. So, assumption of multicollinearity is not violated.

Table 4

Collinearity Statistics

Model

Collinearity Statistics

Tolerance

VIF

1

(Constant)

Gender

.98

1.02

physical activity in a week

.98

1.02

2

(Constant)

Gender

.96

1.04

physical activity in a week

.94

1.06

milk chocolate vs berry chocolate

.87

1.15

milk chocolate vs peanut chocolate

.84

1.19

frequency of chocolate consumption

.93

1.08

3

(Constant)

Gender

.92

1.08

physical activity in a week

.94

1.06

milk chocolate vs berry chocolate

.86

1.17

milk chocolate vs peanut chocolate

.80

1.24

frequency of chocolate consumption

.84

1.19

fat rate (%) in chocolate

.67

1.49

cacao rate (%) in chocolate

.70

1.43

The third assumption of multiple regression to be checked is homoscedasticity. Scatter plot of predicted value and residual is used to control homoscedasticity. Any pattern should not be seen on the scatter plot. Figure 4 shows that there is no pattern on the scatter plot; so, there is not homoscedasticity.

Figure 4 Scatter plot of predicted value and residual

The fourth assumption of multiple regression to be checked is independence. Independence is affected by the order of the independent variables and can be ignored if the order of independent variables is not important. Order of the independent variables is important in this study; so, independence should be checked in this study. Independence is checked with Durbin-Watson value that should be between 1.5 and 2.5. Durbin-Watson value of the model is 1.88; so, independence assumption is not violated.

The last assumption of multiple regression is linearity. We assume that linearity is not violated in this study.

Influential Observations

Data should be checked whether there are outliers or not. Outliers could cause misleading results. There are different ways of checking outliers in multiple regression such as Partial plots, leverage statistics, Cook’s D, DFBeta and Mahalonobis distance. Each method uses a different calculation method; so, multiple methods should be used and then make a decision whether a data is outlier or not.

At first, partial plots of the dependent variable with each of the independent variable is examined (see on figure 5,6,7,8 and 9). Some cases that could be outliers are seen on each partial plot; but, this should not be forgotten, making decision over partial plots is a subjective way and other ways of controlling outliers should be used. A decision could be made even after all methods were conducted.

Figure 5 Partial Plot of BMI and physical activity in a week

Figure 6 Partial Plot of BMI and milk chocolate vs peanut chocolate

Figure 7 Partial Plot of BMI and frequency of chocolate consumption

Figure 8 Partial Plot of BMI and fat rate in chocolate

Figure 9 Partial Plot of BMI and cacao rate in chocolate

After controlling partial plots, leverage value could be controlled to identify the outliers. It is seen that there is no case, leverage value of which is higher than .50. According to leverage test results there is no outlier.

Table 5

Extreme Values of Leverage Test

Case Number

Value

Centered Leverage Value

Highest

1

448

.04

2

384

.04

3

141

.03

4

324

.03

5

592

.03

Lowest

1

196

.00

2

103

.00

3

535

.05

4

160

.05

5

8

.05

After controlling leverage values, Cook’s distance could be controlled. In Cook’s Distance, a value greater than the value, calculated with the formula of mean + 2 * standard deviation, can be admitted as outlier. In this study critical value is .008 (.002+2*(.003)). Maximum value of Cook’s distance is .03; so, it is expected that there will be outliers. Boxplot of Cook’s distance (figure 10) shows that the cases 499, 438, 449, 236, 284, 484, 37, 354, 137, 97, 324 and 165 could be outliers. On the other hand, according to Cook and Weisberg (1982) values greater than 1 could be admitted as outlier. So, it can be assumed that there is no outlier.

Figure 10 Boxplot of Cook’s distance

After controlling Cook’s Distance, DF Beta values of each independent variable could be checked. DF Beta value shows the change in regression coefficient due to deletion of that row with outlier. According to Field (2009) a case can be outlier if absolute value of DF Beta is higher than one. According to Stevens (2002) a case can be outlier if absolute value of DF Beta is higher than two. In this study there is no case that has DF Beta value higher than one (see figure 11). According to DF Beta test values there is no outlier in this study.

Figure 11 Boxplots of DF Beta values of Independent Variables

Lastly, Mahalanobis Distance could be controlled to identify the outliers. If there is any case that is greater than the value of chi square at I±=.001 that could be admitted as outlier. The critical value at I±=.001 with seven predictors is 24.32. Table 6 shows the extreme values for this study and there is no value greater than 24.32. According to Mahalanobis distance test there is no outlier.

Table 6

Extreme Values of Mahalanobis Distance

Case Number

Value

Mahalanobis Distance

Highest

1

448

23.72

2

384

20.90

3

141

20.50

4

324

19.15

5

592

17.99

Lowest

1

196

2.62

2

103

2.62

3

535

2.78

4

160

2.78

5

8

2.78

If the results of each test is summarized;

Partial plots shows that there could be outliers,

Leverage values show that there is no outliers,

Cook’s distance values show that there is no outlier,

DF Beta values show that there is no outlier.

According to results of the tests, it could be assumed that there is no outlier.

Regression Results

A hierarchical regression analysis was conducted to identify the predictors of BMI. Three different models were examined to understand which predictor explains has how much variance. Table 7 shows the summary of three models. Among three models, the first model is not statistically significant; the second and third models are significant.

In the first model; gender and physical activity in a week were the predictors. This model explains the .2% of total variance, but insignificant; F (2, 597) = .67; p > .05.

In the second model, milk chocolate vs berry chocolate, milk chocolate vs peanut chocolate and frequency of chocolate consumption are the predictors after controlling for the effect of gender and physical activity in a week. This model explains 13% of total variance explained significantly, F (3, 594) = 28.901; p < .01.

In the third model, cacao rate (%) in chocolate, fat rate (%) in chocolate are the predictors of BMI after controlling for the effect of gender, physical activity in a week, milk chocolate vs berry chocolate, milk chocolate vs peanut chocolate and frequency of chocolate consumption. This model explains 34% of total variance explained significantly, F (2, 592) = 189.154, p < .01.

Table 7

Regression Analysis Model Summary

Model

R

R2

Change Statistics

Durbin-Watson

I”R2

I”F

df1

df2

I” Sig. F

1

.05a

.00

.00

.69

2

597

.50

2

.36b

.13

.13

28.90

3

594

.00

3

.69c

.47

.34

189.15

2

592

.00

1.879

a. Predictors: (Constant), physical activity in a week, gender

b. Predictors: (Constant), physical activity in a week, gender, milk chocolate vs berry chocolate, frequency of chocolate consumption, milk chocolate vs peanut chocolate

c. Predictors: (Constant), physical activity in a week, gender, milk chocolate vs berry chocolate, frequency of chocolate consumption, milk chocolate vs peanut chocolate, cacao rate (%) in chocolate, fat rate (%) in chocolate

d. Dependent Variable: body mass index

Table 8 shows the Coefficients of Hierarchical Regression Analysis that shows the significance and total variance explained by each predictor. In the first model any of the predictors significantly predicts the dependent variable, BMI. It can be said that neither the model, nor the predictors are statistically significant and do not predict the outcome variable, F (2, 597) = .67; p > .05.

In the second model, overall model is significant, F (3, 594) = 28.901; p < .01). In this model, only one predictor, milk chocolate vs berry chocolate, is not statistically significant. As milk chocolate is the reference category and milk chocolate vs berry chocolate was found to be not significant that means there is no significant difference between milk chocolate and berry chocolate. As milk chocolate as the reference category and milk chocolate vs peanut chocolate was found to be significant that means there is a significant difference between milk chocolate and peanut chocolate levels. As positive relationship was found peanut chocolate's mean is higher than milk chocolate's mean. Moreover there is positive relationship between the frequency of chocolate consumption and BMI. When one increases, the other will increase. Milk chocolate vs peanut chocolate explains 3% of total variance uniquely. Frequency of chocolate consumption explains 7% of total variance explained.

In the third model, overall model is significant, F (2, 592) = 189.154, p < .01. In this model all predictors are significantly predicts the dependent variable. There are positive relationships between fat rate in chocolate and BMI; and, cacao rate in chocolate and BMI. Fat rate in chocolate explains 15% of total variance uniquely. Cacao rate in chocolate explains 4% of total variance uniquely.

Table 8

Coefficients of Hierarchical Regression Analysis

Model

Unstandardized Coefficients

Standardized Coefficients

t

p

Correlations

B

Std. Error

Beta

Part

1

(Constant)

24.419

.941

25.938

.000

Gender

-.232

.370

-.026

-.628

.530

-.026

physical activity in a week

.226

.251

.037

.900

.369

.037

2

(Constant)

17.165

1.309

13.110

.000

milk chocolate vs berry chocolate

.539

.423

.052

1.273

.204

.049

milk chocolate vs peanut chocolate

1.943

.420

.193

4.629

.000

.177

frequency of chocolate consumption

1.751

.245

.283

7.135

.000

.273

3

(Constant)

5.426

1.191

4.557

.000

fat rate (%) in chocolate

.221

.017

.477

13.033

.000

.390

cacao rate (%) in chocolate

.109

.016

.242

6.766

.000

.203

a. Dependent Variable: body mass index

Discussion

Two different research questions were tried to be answered in this study. First research question was “How well the type of chocolate and frequency of chocolate consumption predict body mass index, after controlling for gender physical activity?”. Second research question was “How well do fat percentage and cacao percentage in chocolate explain body mass index, after controlling the results of the first research question?”.

A hierarchical regression analysis was conducted to answer the research questions. Three models were examined to find the predictors and their contribution to these models. The first model that examines that how well gender and physical activity in a week predict the dependent variable. Result of the first model shows that neither model nor predictors significantly predict the BMI.

The second model examined to answer the first research question. This model predicts 13% of total variance explained. Milk chocolate vs berry chocolate does not significantly explain the BMI. Milk chocolate vs peanut chocolate explains 3%, frequency of chocolate consumption explains 7% of total variance explained.

The third model examined to answer the second research question. This model predicts 47% of total variance explained and 34% of total variance explained uniquely. Fat rate in chocolate explains 15% and cacao rate in chocolate explains 4% of total variance uniquely.

When all models were examined it is seen that fat rate in chocolate is the best predictor of BMI by explaining 15% of total variance explained. Frequency of chocolate consumption is the second by explaining 7% of total variance explained. Cacao rate is the third predictor by explaining 4% of total variance explained.