You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

319 lines
13 KiB

6 years ago
  1. # Quick Start: Distributed Training on the Oxford-IIIT Pets Dataset on Google Cloud
  2. This page is a walkthrough for training an object detector using the Tensorflow
  3. Object Detection API. In this tutorial, we'll be training on the Oxford-IIIT Pets
  4. dataset to build a system to detect various breeds of cats and dogs. The output
  5. of the detector will look like the following:
  6. ![](img/oxford_pet.png)
  7. ## Setting up a Project on Google Cloud
  8. To accelerate the process, we'll run training and evaluation on [Google Cloud
  9. ML Engine](https://cloud.google.com/ml-engine/) to leverage multiple GPUs. To
  10. begin, you will have to set up Google Cloud via the following steps (if you have
  11. already done this, feel free to skip to the next section):
  12. 1. [Create a GCP project](https://cloud.google.com/resource-manager/docs/creating-managing-projects).
  13. 2. [Install the Google Cloud SDK](https://cloud.google.com/sdk/downloads) on
  14. your workstation or laptop.
  15. This will provide the tools you need to upload files to Google Cloud Storage and
  16. start ML training jobs.
  17. 3. [Enable the ML Engine
  18. APIs](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component&_ga=1.73374291.1570145678.1496689256).
  19. By default, a new GCP project does not enable APIs to start ML Engine training
  20. jobs. Use the above link to explicitly enable them.
  21. 4. [Set up a Google Cloud Storage (GCS)
  22. bucket](https://cloud.google.com/storage/docs/creating-buckets). ML Engine
  23. training jobs can only access files on a Google Cloud Storage bucket. In this
  24. tutorial, we'll be required to upload our dataset and configuration to GCS.
  25. Please remember the name of your GCS bucket, as we will reference it multiple
  26. times in this document. Substitute `${YOUR_GCS_BUCKET}` with the name of
  27. your bucket in this document. For your convenience, you should define the
  28. environment variable below:
  29. ``` bash
  30. export YOUR_GCS_BUCKET=${YOUR_GCS_BUCKET}
  31. ```
  32. It is also possible to run locally by following
  33. [the running locally instructions](running_locally.md).
  34. ## Installing Tensorflow and the Tensorflow Object Detection API
  35. Please run through the [installation instructions](installation.md) to install
  36. Tensorflow and all it dependencies. Ensure the Protobuf libraries are
  37. compiled and the library directories are added to `PYTHONPATH`.
  38. ## Getting the Oxford-IIIT Pets Dataset and Uploading it to Google Cloud Storage
  39. In order to train a detector, we require a dataset of images, bounding boxes and
  40. classifications. For this demo, we'll use the Oxford-IIIT Pets dataset. The raw
  41. dataset for Oxford-IIIT Pets lives
  42. [here](http://www.robots.ox.ac.uk/~vgg/data/pets/). You will need to download
  43. both the image dataset [`images.tar.gz`](http://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz)
  44. and the groundtruth data [`annotations.tar.gz`](http://www.robots.ox.ac.uk/~vgg/data/pets/data/annotations.tar.gz)
  45. to the `tensorflow/models/research/` directory and unzip them. This may take
  46. some time.
  47. ``` bash
  48. # From tensorflow/models/research/
  49. wget http://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz
  50. wget http://www.robots.ox.ac.uk/~vgg/data/pets/data/annotations.tar.gz
  51. tar -xvf images.tar.gz
  52. tar -xvf annotations.tar.gz
  53. ```
  54. After downloading the tarballs, your `tensorflow/models/research/` directory
  55. should appear as follows:
  56. ```lang-none
  57. - images.tar.gz
  58. - annotations.tar.gz
  59. + images/
  60. + annotations/
  61. + object_detection/
  62. ... other files and directories
  63. ```
  64. The Tensorflow Object Detection API expects data to be in the TFRecord format,
  65. so we'll now run the `create_pet_tf_record` script to convert from the raw
  66. Oxford-IIIT Pet dataset into TFRecords. Run the following commands from the
  67. `tensorflow/models/research/` directory:
  68. ``` bash
  69. # From tensorflow/models/research/
  70. python object_detection/dataset_tools/create_pet_tf_record.py \
  71. --label_map_path=object_detection/data/pet_label_map.pbtxt \
  72. --data_dir=`pwd` \
  73. --output_dir=`pwd`
  74. ```
  75. Note: It is normal to see some warnings when running this script. You may ignore
  76. them.
  77. Two 10-sharded TFRecord files named `pet_faces_train.record-*` and
  78. `pet_faces_val.record-*` should be generated in the
  79. `tensorflow/models/research/` directory.
  80. Now that the data has been generated, we'll need to upload it to Google Cloud
  81. Storage so the data can be accessed by ML Engine. Run the following command to
  82. copy the files into your GCS bucket (substituting `${YOUR_GCS_BUCKET}`):
  83. ```bash
  84. # From tensorflow/models/research/
  85. gsutil cp pet_faces_train.record-* gs://${YOUR_GCS_BUCKET}/data/
  86. gsutil cp pet_faces_val.record-* gs://${YOUR_GCS_BUCKET}/data/
  87. gsutil cp object_detection/data/pet_label_map.pbtxt gs://${YOUR_GCS_BUCKET}/data/pet_label_map.pbtxt
  88. ```
  89. Please remember the path where you upload the data to, as we will need this
  90. information when configuring the pipeline in a following step.
  91. ## Downloading a COCO-pretrained Model for Transfer Learning
  92. Training a state of the art object detector from scratch can take days, even
  93. when using multiple GPUs! In order to speed up training, we'll take an object
  94. detector trained on a different dataset (COCO), and reuse some of it's
  95. parameters to initialize our new model.
  96. Download our [COCO-pretrained Faster R-CNN with Resnet-101
  97. model](http://storage.googleapis.com/download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_coco_11_06_2017.tar.gz).
  98. Unzip the contents of the folder and copy the `model.ckpt*` files into your GCS
  99. Bucket.
  100. ``` bash
  101. wget http://storage.googleapis.com/download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_coco_11_06_2017.tar.gz
  102. tar -xvf faster_rcnn_resnet101_coco_11_06_2017.tar.gz
  103. gsutil cp faster_rcnn_resnet101_coco_11_06_2017/model.ckpt.* gs://${YOUR_GCS_BUCKET}/data/
  104. ```
  105. Remember the path where you uploaded the model checkpoint to, as we will need it
  106. in the following step.
  107. ## Configuring the Object Detection Pipeline
  108. In the Tensorflow Object Detection API, the model parameters, training
  109. parameters and eval parameters are all defined by a config file. More details
  110. can be found [here](configuring_jobs.md). For this tutorial, we will use some
  111. predefined templates provided with the source code. In the
  112. `object_detection/samples/configs` folder, there are skeleton object_detection
  113. configuration files. We will use `faster_rcnn_resnet101_pets.config` as a
  114. starting point for configuring the pipeline. Open the file with your favourite
  115. text editor.
  116. We'll need to configure some paths in order for the template to work. Search the
  117. file for instances of `PATH_TO_BE_CONFIGURED` and replace them with the
  118. appropriate value (typically `gs://${YOUR_GCS_BUCKET}/data/`). Afterwards
  119. upload your edited file onto GCS, making note of the path it was uploaded to
  120. (we'll need it when starting the training/eval jobs).
  121. ``` bash
  122. # From tensorflow/models/research/
  123. # Edit the faster_rcnn_resnet101_pets.config template. Please note that there
  124. # are multiple places where PATH_TO_BE_CONFIGURED needs to be set.
  125. sed -i "s|PATH_TO_BE_CONFIGURED|"gs://${YOUR_GCS_BUCKET}"/data|g" \
  126. object_detection/samples/configs/faster_rcnn_resnet101_pets.config
  127. # Copy edited template to cloud.
  128. gsutil cp object_detection/samples/configs/faster_rcnn_resnet101_pets.config \
  129. gs://${YOUR_GCS_BUCKET}/data/faster_rcnn_resnet101_pets.config
  130. ```
  131. ## Checking Your Google Cloud Storage Bucket
  132. At this point in the tutorial, you should have uploaded the training/validation
  133. datasets (including label map), our COCO trained FasterRCNN finetune checkpoint and your job
  134. configuration to your Google Cloud Storage Bucket. Your bucket should look like
  135. the following:
  136. ```lang-none
  137. + ${YOUR_GCS_BUCKET}/
  138. + data/
  139. - faster_rcnn_resnet101_pets.config
  140. - model.ckpt.index
  141. - model.ckpt.meta
  142. - model.ckpt.data-00000-of-00001
  143. - pet_label_map.pbtxt
  144. - pet_faces_train.record-*
  145. - pet_faces_val.record-*
  146. ```
  147. You can inspect your bucket using the [Google Cloud Storage
  148. browser](https://console.cloud.google.com/storage/browser).
  149. ## Starting Training and Evaluation Jobs on Google Cloud ML Engine
  150. Before we can start a job on Google Cloud ML Engine, we must:
  151. 1. Package the Tensorflow Object Detection code.
  152. 2. Write a cluster configuration for our Google Cloud ML job.
  153. To package the Tensorflow Object Detection code, run the following commands from
  154. the `tensorflow/models/research/` directory:
  155. ```bash
  156. # From tensorflow/models/research/
  157. bash object_detection/dataset_tools/create_pycocotools_package.sh /tmp/pycocotools
  158. python setup.py sdist
  159. (cd slim && python setup.py sdist)
  160. ```
  161. This will create python packages dist/object_detection-0.1.tar.gz,
  162. slim/dist/slim-0.1.tar.gz, and /tmp/pycocotools/pycocotools-2.0.tar.gz.
  163. For running the training Cloud ML job, we'll configure the cluster to use 5
  164. training jobs and three parameters servers. The
  165. configuration file can be found at `object_detection/samples/cloud/cloud.yml`.
  166. Note: The code sample below is supported for use with 1.12 runtime version.
  167. To start training and evaluation, execute the following command from the
  168. `tensorflow/models/research/` directory:
  169. ```bash
  170. # From tensorflow/models/research/
  171. gcloud ml-engine jobs submit training `whoami`_object_detection_pets_`date +%m_%d_%Y_%H_%M_%S` \
  172. --runtime-version 1.12 \
  173. --job-dir=gs://${YOUR_GCS_BUCKET}/model_dir \
  174. --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
  175. --module-name object_detection.model_main \
  176. --region us-central1 \
  177. --config object_detection/samples/cloud/cloud.yml \
  178. -- \
  179. --model_dir=gs://${YOUR_GCS_BUCKET}/model_dir \
  180. --pipeline_config_path=gs://${YOUR_GCS_BUCKET}/data/faster_rcnn_resnet101_pets.config
  181. ```
  182. Users can monitor and stop training and evaluation jobs on the [ML Engine
  183. Dashboard](https://console.cloud.google.com/mlengine/jobs).
  184. ## Monitoring Progress with Tensorboard
  185. You can monitor progress of the training and eval jobs by running Tensorboard on
  186. your local machine:
  187. ```bash
  188. # This command needs to be run once to allow your local machine to access your
  189. # GCS bucket.
  190. gcloud auth application-default login
  191. tensorboard --logdir=gs://${YOUR_GCS_BUCKET}/model_dir
  192. ```
  193. Once Tensorboard is running, navigate to `localhost:6006` from your favourite
  194. web browser. You should see something similar to the following:
  195. ![](img/tensorboard.png)
  196. Make sure your Tensorboard version is the same minor version as your Tensorflow (1.x)
  197. You will also want to click on the images tab to see example detections made by
  198. the model while it trains. After about an hour and a half of training, you can
  199. expect to see something like this:
  200. ![](img/tensorboard2.png)
  201. Note: It takes roughly 10 minutes for a job to get started on ML Engine, and
  202. roughly an hour for the system to evaluate the validation dataset. It may take
  203. some time to populate the dashboards. If you do not see any entries after half
  204. an hour, check the logs from the [ML Engine
  205. Dashboard](https://console.cloud.google.com/mlengine/jobs). Note that by default
  206. the training jobs are configured to go for much longer than is necessary for
  207. convergence. To save money, we recommend killing your jobs once you've seen
  208. that they've converged.
  209. ## Exporting the Tensorflow Graph
  210. After your model has been trained, you should export it to a Tensorflow graph
  211. proto. First, you need to identify a candidate checkpoint to export. You can
  212. search your bucket using the [Google Cloud Storage
  213. Browser](https://console.cloud.google.com/storage/browser). The file should be
  214. stored under `${YOUR_GCS_BUCKET}/model_dir`. The checkpoint will typically
  215. consist of three files:
  216. * `model.ckpt-${CHECKPOINT_NUMBER}.data-00000-of-00001`
  217. * `model.ckpt-${CHECKPOINT_NUMBER}.index`
  218. * `model.ckpt-${CHECKPOINT_NUMBER}.meta`
  219. After you've identified a candidate checkpoint to export, run the following
  220. command from `tensorflow/models/research/`:
  221. ```bash
  222. # From tensorflow/models/research/
  223. gsutil cp gs://${YOUR_GCS_BUCKET}/model_dir/model.ckpt-${CHECKPOINT_NUMBER}.* .
  224. python object_detection/export_inference_graph.py \
  225. --input_type image_tensor \
  226. --pipeline_config_path object_detection/samples/configs/faster_rcnn_resnet101_pets.config \
  227. --trained_checkpoint_prefix model.ckpt-${CHECKPOINT_NUMBER} \
  228. --output_directory exported_graphs
  229. ```
  230. Afterwards, you should see a directory named `exported_graphs` containing the
  231. SavedModel and frozen graph.
  232. ## Configuring the Instance Segmentation Pipeline
  233. Mask prediction can be turned on for an object detection config by adding
  234. `predict_instance_masks: true` within the `MaskRCNNBoxPredictor`. Other
  235. parameters such as mask size, number of convolutions in the mask layer, and the
  236. convolution hyper parameters can be defined. We will use
  237. `mask_rcnn_resnet101_pets.config` as a starting point for configuring the
  238. instance segmentation pipeline. Everything above that was mentioned about object
  239. detection holds true for instance segmentation. Instance segmentation consists
  240. of an object detection model with an additional head that predicts the object
  241. mask inside each predicted box once we remove the training and other details.
  242. Please refer to the section on [Running an Instance Segmentation
  243. Model](instance_segmentation.md) for instructions on how to configure a model
  244. that predicts masks in addition to object bounding boxes.
  245. ## What's Next
  246. Congratulations, you have now trained an object detector for various cats and
  247. dogs! There different things you can do now:
  248. 1. [Test your exported model using the provided Jupyter notebook.](running_notebook.md)
  249. 2. [Experiment with different model configurations.](configuring_jobs.md)
  250. 3. Train an object detector using your own data.