Nov 1, 2016

How to code a file splitter and joiner in Python

fsj is a file splitter and joiner written in Python. I ported afsj (written in Java) to fsj (written in Python).

Java is a good designed language. Implementing in Java needs to write much code than Python. I love C/C++, Java, Python, Perl and Delphi. The fact that, Delphi generates execution code is quite optimized, just slower than C a little. And FreePascal execution code is slower than Delphi. And Java execution code is slower than in FreePascal (just a little).

In general, Java generates code run very fast. Web development in Java seems leading performance. If you code much Python, Ruby, you will recognize that Java is quite verbose.

Python code is shorter (quite much) but execution is slower than Java.

You should read my related article to compare: Code a file splitter n joiner in Java.

1. Code a file extension generator


class ExtGenerator:
 def __init__(self, start):
  self.current = start
 
 def next(self):
  result = self.current
  self.current += 1
  return str(result).zfill(3)
Method next() returns next file extension in string. For example if the self.current is 1, next() return "002". In other words, if we create an ExtGenerator instance by passing 0, next() returns "001". Coding a thing called ExtGenerator because we want our file joiner automatically find the next part to join by providing first part (normally first part's extension is ".001").

2. Code SequenceFileExists

This is a important class which makes our splitter and joiner nice hierarchies and design patterns.

class SequenceFileExists:
 '''
 fileStart should end with .00x such as .001
 '''
 def __init__(self, fileStart):
  self.extGen = ExtGenerator(
   int(
    fileStart[fileStart.rindex('.')+1:]
   )
  )
  self.pathWithoutExt = fileStart[:fileStart.rindex('.')]
  self.current = None
  
 ''' 
 Return True if next file exists
 '''
 def hasNext(self):
  self.current = self.pathWithoutExt + '.' + self.extGen.next()
  return True if isfile(self.current) else False
 
 ''' 
 Use with hasNext() for checking next file exists   
 ''' 
 def next(self):
  return self.current if self.current <> None else self.pathWithoutExt + '.' + self.extGen.next()
ExtGenerator only return next file extension (in string) but we don't know that file exists or not. I will pass first part (for example: "file.001") to the constructor and method hasNext() checks whether next file (on this case: next file is file.002) exists or not. Use hasNext() with next(). Method next() return next part for our joiner to join.

3. Code a file joiner


def join(fileStart, fileOutput, joinMode, chunkSize = 1024*4, autoFind = True): 
 fw = open(fileOutput, 'wb' if joinMode <> 'append' or not isfile(fileOutput) else 'ab')
 sf = SequenceFileExists(fileStart) 
 try:
  while sf.hasNext():
   try:
    fr = open(sf.next(), 'rb')
    while True:
     chunk = fr.read(chunkSize)
     if chunk:
      fw.write(chunk)
     else:
      break
    '''
    If we use autoFind flag, system will find next file to append to output file
    '''
    if not autoFind:
     break    
   finally:
    fr.close()
 finally:
fw.close() 
Now writing a file joiner could not be simpler and easier. Technique used is read a chunk of bytes into a buffer and write this chunk of bytes from buffer to the destination (writing binary to file in append mode). This technique used everywhere and this is basic technique. Put these read-write into a loop. Loop ends when hasNext() return false (occurs when no next file part found).

4. Code a file splitter


def splitBySize(fileSource, fileOutputStart, partSize, chunkSize):
 sf = SequenceFileExists(fileOutputStart + '.001')
 
 if partSize < chunkSize:
  chunkSize = partSize
 
 fileSize = getsize(fileSource)
 if partSize > fileSize:
  partSize = fileSize 
 
 splitted = 0
 
 fr = open(fileSource, 'rb')
 try:
  while splitted < fileSize:
   fw = open(sf.next(), 'wb')
   try:
    splitted_inner = 0
    while True and splitted_inner < partSize:
     chunk = fr.read(chunkSize)
     if chunk:
      fw.write(chunk)
      splitted += len(chunk)
      splitted_inner += len(chunk)
     else:
      break
   finally:
    fw.close()
 finally:
  fr.close()

def splitByParts(fileSource, fileOutputStart, numParts, chunkSize):
 splitBySize(fileSource, fileOutputStart, ceil(getsize(fileSource)/numParts), chunkSize) 

We have 2 function splitBySize and splitByParts. splitBySize splits origin file to each xxx-bytes part. We have splitByParts by wrapping splitBySize.

Variable splitted tracks the total bytes read (from origin) and written to parts (output). When splitted equals fileSize, loop ends. Each output part created by open(sf.next(), 'wb') (assign file to write, so create new file). Each file path is initialized by sf.next() which sf is instance of SequenceFileExists.

No comments:

Post a Comment