Computation of Mean Square Displacement and Diffusion Coefficient

MSD and DC analysis

Introduction

After running an MD (Molecular Dynamic) Simulation, we obtain a large amount of data about the simulated system. This data can be used to gain more knowledge about the system. One of the primary advantages of MD simulations is that the extensive data obtained can predict various transport phenomena like viscosity, conductivity, and diffusivity. This work represents my attempt at analyzing the DCD files obtained after simulating a water box for 12 nanoseconds at 310 K. We utilized VMD to create the PSF files and NAMD to run the MD simulation. More details about the configuration file are at Water Box System. We attempt to obtain the MSD (Mean Square Displacement) and subsequently the diffusion coefficient of the ensemble of molecules of the system. To learn how to run a simple simulation using a remote high performance computing cluster read this.

Why MSD?

Mean Square Displacement for an MD simulation is defined as the measure of the change in a molecule’s position over time with respect to a reference point/position. One of the major reasons to calculate MSD is the calculation of the diffusivity (or the diffusion coefficient) of a molecule in a system. Knowing the diffusivity gives us knowledge about how the mobility of that particular molecule in the system. Diffusion is the process by which material is transported by the random thermal motion of the molecules within the fluid, even in the absence of any mean flow. For example, if a molecule has low diffusivity, then one can infer that the system could be highly viscous for that particular molecule.

Equation for calculation of diffusion coefficient : $D = \frac{<R^2>}{6t}$

Preprocessing Data

Once the MD simulation is completed, VMD provides an option (‘save coordinates’) to export the values of the coordinates of all the molecules or specific molecules. The trajectory (DCD) files and the structure (PSF) files should be loaded. Then we use this option to get the oxygen coordinate data for MSD analysis. We export the coordinate files in ‘xyz’ format. The coordinates obtained are in angstrom. Now Python is used to read the coordinate files for further analysis.

The figure below shows how the exported coordinate file looks.

coordinate file

The start of the file shows the number of molecules in the system and how the file has been generated. In our case, we have 1981 molecules, and VMD has generated the coordinate files. This information is displayed in the file after every frame. Since this information is redundant for our analysis, it is removed from the file, and a new file is created by running the following code:

# Read in the file
with open('oxygen.xyz', 'r') as file :
  filedata = file.read()


# reading the first two lines to get knowledege about the atoms present
lines = filedata.split('\n')
num_mol = int(lines[0])
str_to_rep = lines[0] + '\n' + lines[1] + '\n'

# Replace the target string
filedata = filedata.replace(str_to_rep, '')

# Write the file out again
with open('oxygen_edited.xyz', 'w') as file:
  file.write(filedata)

If we were only to observe the coordinates of the first molecule, we would see that after every 1981 row, the coordinate of the next time step is given. This is validated by the ‘num_mol’ variable, which contains the 1981 value. The ‘oxygen.xyz’ is the original file, and after removal of the undesired information, a new file, ‘oxygen_edited.xyz’, is written and contains only the coordinates and the molecule name.

Analysis

To read and perform a preliminary analysis of the preprocessed coordinate files, we utilize the Pandas, Matplotlib, and Seaborn libraries. First, we obtain information on the coordinates of the first molecule. Since we know the total number of rows and the total number of molecules, we can acquire the total number of frames by dividing the number of rows by the total number of molecules. So now we have the total number of frames, we can validate the number of coordinates obtained for the first molecule. To get only the data for the first molecule, we store the value and skip 1981 rows as we iterate through the whole dataset.

def getFirstMoleculeData(db, num_mol):
  '''
  input
  db --> the original dataset with undesired information removed
  num_mol --> total number of molecules

  output
  Returns the coordinate data of the first molecule as pandas dataframe
  '''
  totalRows = len(db)

  coordX = []
  coordY = []
  coordZ = []

  for i in range(0, totalRows, num_mol):

    coordX.append(db.iloc[i][1])
    coordY.append(db.iloc[i][2])
    coordZ.append(db.iloc[i][3])


  df_final = pd.DataFrame(data = {'X': coordX, 'Y': coordY, 'Z': coordZ})

  return df_final

Now that we have successfully read and stored the data, we explore the trajectory of the oxygen atom of the first water molecule in the X, Y, and Z directions. Matplotlib and Seaborn libraries are utilized to plot the trajectory for the same.

X coordinates vs Time (in nanoseconds)

Y coordinates vs Time (in nanoseconds)

Z coordinates vs Time (in nanoseconds)

We extend the same function for one molecule to get data for all the molecules. We implement the following to obtain the data for all the molecules of the system.

def getAMoleculeData(db, num_mol, idx):
  '''
  input
  db --> the dataset
  num_mol --> total number of molecules
  idx --> the molecule who's data is to be extracted

  output
  Returns the coordinate data of a particular molecule as a pandas dataframe
  '''
  print('Processing the data of molecule: ', idx)
  totalRows = len(db)

  coordX = []
  coordY = []
  coordZ = []

  for i in range(idx, totalRows, num_mol):
    # print(i)
    coordX.append(db.iloc[i][1])
    coordY.append(db.iloc[i][2])
    coordZ.append(db.iloc[i][3])

  diff = np.diff(disp_R, append = disp_R[0]) #this calculates r(t + dt) - r(t)
  diff_sq = diff**2

  df_final = pd.DataFrame(data = {'X': coordX, 'Y': coordY, 'Z': coordZ})

  return df_final

def getAllMolData(db, num_mol):
  mols_data = []

  for i in range(0, num_mol):
    mols_data.append(getAMoleculeData(db, num_mol, i))

  return mols_data

MSD Calculation

Since we have the data for all the molecules over the whole simulation duration, we now compute the MSD. The Algorithm implemented for MSD is shown in the figure below.

MSD Calculation

Code for the same is shown below.

def getMSD(db):
  msd_data = []

  for iterone in range(len(db)):
    sum = 0.0
    for itertwo in range(iterone):
      sum = sum + db['SquareDiff'][itertwo]

    msd_data.append(sum)

  db_final = db
  db_final['MSD'] = msd_data

  return db_final

We extend this algorithm to get the MSD for all molecules and finally calculate the overall MSD.

MSD (for one molecule) (in Angstom) vs Time (in nanoseconds)

Equation for calculation of diffusion coefficient : $D = \frac{<R^2>}{6t}$

To calculate the diffusion coefficient, we use the formula mentioned above. We know that the diffusion coefficient is the slope of the equation relating MSD to time. Since we already have the data for MSD and time, we try to obtain the line of the best fit line without an intercept that passes through the origin (0, 0) and get the value of the slope.

y = df_msd['GlobalMSD']
x = df_msd['time']
# We only need a*x. so to figure out a we use lstsq from numpy
# Our x matrix is one dimensional, it needs to be two dimensional to use lstsq so:
x = x[:,np.newaxis]
a, _, _, _ = np.linalg.lstsq(x, y)

plt.plot(x, y, 'b-')
plt.plot(x, a*x, 'r-')
plt.show()
print(f"y = {a} x + 0")


Overall MSD (in Angstom) vs Time (in nanoseconds) (Line of best fit)

# the slope value we obtained after obtaining the line of best fit
y = [1078.36350041] x + 0

Now, this slope value is in the unit Angstrom square per nanoseconds. We convert this into centimeter square per second as the diffusion coefficient is generally represented using centimeter square per second. After conversion and division by 6 (as given in the equation), we obtain a value of 1.793-centimeter square per second, which is close to the value obtained by Jorgensen et al. for the TIPS2 model of water at 293 K, that is 3.2-centimeter square per seconds. The possible reason for this difference in value could be running the system for very little time (around eight nanoseconds) and the difference in temperature since we ran our system at 310 K instead of 293 K.

Code

The code is written in Python 3, and run on a jupyter notebook, which can be accessed here - Water Box System.