Medicinal Chemistry Applet

Quantitative Structure-Activity Relationships (QSAR)
multiple linear regression

back to applets page

Introduction

While QSAR analysis as described by Hansch can be used to analyze the effect of just one independent variable, typically multiple independent variables are investigated simultaneously.  The so-termed "Hansch equation" (eq. 1) demonstrates this point as it invokes three independent variables.  In this equation, the three variable, σ, π, and π2, related to electronic effects and lipophilicity. k, k', ρ, and k" are all regression coefficients.  Whether involving just one or multiple independent variables, the theory behind QSAR analysis is the same (see QSAR applet).

log (1/C) = -kπ + k'π2 + ρσ + k"     (1)

Linear regressions on simple x,y-data (one independent variable) are trivial and may be readily solved algebraically.  Solving arrays of variables is traditionally performed through matrices and linear algebra.  At a minimum, the number of data points (observations) must be one larger than the number of independent variables.  The extra data point is required to accommodate the added k" term.

Multiple linear regressions are often described with Equation 2.  The x-variables contribute to a greater or lesser degree to the y-value.  The degree of contribution is measured through the coefficients on the x-variables (a1 through an).  In QSAR uses, y is equivalent to log (1/C) and the x-variables are all parameters.

y = a0 + a1x1 + a2x2 + ... + anxn     (2)

Determining the values of a-coefficients in Equation 2 requires assembling matrices that are filled with values of the x-variables and corresponding y-values.  The ultimate equation required is shown in Equation 3.  X is a matrix that contains the x-variable data with an extra column for the a0 term.  Xt is the transpose matrix of X.  Y is a one-column matrix containing all the y-values.  β is a one-column matrix containing the values of a0 through an.

β = (XtX)-1XtY     (3)

Matrix operations do not directly afford the best fit line for the data.  Once the coefficients have been determined, these coefficients are used to determine calculated y-values.  Plotting the calculated y-values against experimental y-values gives a scatter plot.  These data points then can be fit to a line.  A perfect fit would reveal data with a r2 of 1.0.  The r2 value is a crude measure for the goodness of the fit, and it roughly gives the fraction of the variance that can be approximated with the included X-variables.  For example, if an line has a r2 of 0.75, then the included x-variables account for 75% of the variance in the y-data.

Applet

This applet accepts x,y-coordinate data with the possibility of up to four independent variables (x0 through x4) and up to ten data points.  The regression is performed through matrix operations with the x-variable coefficents placed in the table to the right of the graph.  A best-fit line is then determined with its r2-value also placed in the table.  The calculated/theoretical points are then plotted with the best-fit line.

parameter value
a0
a1
a2
a3
a4
r2
point y x1 x2 x3 x4
1
2
3
4
5
6
7
8
9
10
calculation may be slow

Problem information

The antibacterial activity of a series of compounds (1) against Staphylococcus aureus has been reported by Hansch.  The values for parameters π, π2, and σ are given for six compounds in the series.  Also included are the log A values (log A is simply a measure of activity).

Table 1.  Activity of 2 against Staphylococcus aureas
R log A σ π π2
NO22.000.710.060.0036
CN1.400.68-0.310.0961
SO2Me1.040.65-0.470.2209
CO2Me1.000.32-0.040.0016
Cl1.000.370.700.49
OMe0.740.12-0.040.0016

Problems

  1. Using the applet, enter the log A, σ, π, and π2 values into the table and perform the regression.  Based on the r2 value, what fraction of the y-variance is covered by electronics and lipophilicity?
  2. Repeat question 1 for σ alone and then π with π2.  Which variable, electronics or lipophilicity, contributes more to the activity of this series of compounds?
  3. Interestingly, although the R-groups occupy the para position, the Hansch reference uses σ-values for the meta position with essentially no justification.  This is one reason why QSAR has not received more widespread use.  The para values are given below.  Repeat question 1 with these σ values.  What is the new r2 value?
  4. R-groupσp
    NO20.78
    CN0.66
    SO2Me0.73
    CO2Me0.52
    Cl0.23
    OMe-0.27

References

  • Hansch, C.; Muir, R. M.; Fujita, T.; Maloney, P.; Geiger, E.; Streich, M. J. Am. Chem. Soc. 1963, 85, 2817.

back to applets page