Python 程式設計/字串

概述

Python 中的字串概覽

str1 = "Hello"                # A new string using double quotes
str2 = 'Hello'                # Single quotes do the same
str3 = "Hello\tworld\n"       # One with a tab and a newline
str4 = str1 + " world"        # Concatenation
str5 = str1 + str(4)          # Concatenation with a number
str6 = str1[2]                # 3rd character
str6a = str1[-1]              # Last character
#str1[0] = "M"                # No way; strings are immutable
for char in str1: print(char) # For each character
str7 = str1[1:]               # Without the 1st character
str8 = str1[:-1]              # Without the last character
str9 = str1[1:4]              # Substring: 2nd to 4th character
str10 = str1 * 3              # Repetition
str11 = str1.lower()          # Lowercase
str12 = str1.upper()          # Uppercase
str13 = str1.rstrip()         # Strip right (trailing) whitespace
str14 = str1.replace('l','h') # Replacement
list15 = str1.split('l')      # Splitting
if str1 == str2: print("Equ") # Equality test
if "el" in str1: print("In")  # Substring test
length = len(str1)            # Length
pos1 = str1.find('llo')       # Index of substring or -1
pos2 = str1.rfind('l')        # Index of substring, from the right
count = str1.count('l')       # Number of occurrences of a substring

print(str1, str2, str3, str4, str5, str6, str7, str8, str9, str10)
print(str11, str12, str13, str14, list15)
print(length, pos1, pos2, count)

另請參閱章節正則表示式，瞭解 Python 中關於字串的更高階模式匹配。

字串操作

相等性

兩個字串相等，如果它們具有完全相同的內容，這意味著它們的長度相同，並且每個字元都具有一一對應的位置關係。許多其他語言透過標識來比較字串；也就是說，只有當兩個字串佔用記憶體中的同一空間時，它們才被視為相等。Python 使用 is 運算子來測試字串的標識，以及任何兩個物件的標識。

例子

>>> a = 'hello'; b = 'hello' # Assign 'hello' to a and b.
>>> a == b                   # check for equality
True
>>> a == 'hello'             #
True
>>> a == "hello"             # (choice of delimiter is unimportant)
True
>>> a == 'hello '            # (extra space)
False
>>> a == 'Hello'             # (wrong case)
False

數值

有兩個偽數值操作可以在字串上執行 - 加法和乘法。字串加法只是連線的另一個名稱，它只是將字串粘在一起。字串乘法是重複的加法或連線。所以

>>> c = 'a'
>>> c + 'b'
'ab'
>>> c * 5
'aaaaa'

包含性

有一個簡單的運算子 'in'，如果第一個運算元包含在第二個運算元中，則返回 True。這也適用於子字串

>>> x = 'hello'
>>> y = 'ell'
>>> x in y
False
>>> y in x
True

請注意，'print(x in y)' 也會返回相同的值。

索引和切片

與其他語言中的陣列非常類似，字串中的各個字元可以透過表示其在字串中位置的整數來訪問。字串 s 中的第一個字元將是 s[0]，而第 n 個字元將在 s[n-1] 處。

>>> s = "Xanadu"
>>> s[1]
'a'

與其他語言中的陣列不同，Python 還使用負數向後索引陣列。最後一個字元的索引為 -1，倒數第二個字元的索引為 -2，依此類推。

>>> s[-4]
'n'

我們還可以使用“切片”來訪問 s 的子字串。s[a:b] 將給我們一個從 s[a] 開始並以 s[b-1] 結束的字串。

>>> s[1:4]
'ana'

這些都不可分配。

>>> print(s)
>>> s[0] = 'J'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object does not support item assignment
>>> s[1:3] = "up"
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object does not support slice assignment
>>> print(s)

輸出（假設錯誤被抑制）

Xanadu
Xanadu

切片的另一個特點是，如果開頭或結尾為空，它將根據上下文預設為第一個或最後一個索引

>>> s[2:]
'nadu'
>>> s[:3]
'Xan'
>>> s[:]
'Xanadu'

您也可以在切片中使用負數

>>> print(s[-2:])
'du'

要理解切片，最簡單的方法是不計算元素本身。這有點像不是用你的手指計數，而是在它們之間的空間計數。列表按如下方式索引

Element:     1     2     3     4
Index:    0     1     2     3     4
         -4    -3    -2    -1

因此，當我們要求 [1:3] 切片時，這意味著我們從索引 1 開始，到索引 2 結束，並取它們之間的所有內容。如果您習慣於在 C 或 Java 中使用索引，這可能會有點令人不安，直到您習慣它為止。

字串常量

字串常量可以在標準字串模組中找到。例如，string.digits 等於 '0123456789'。

連結

"string" 模組的 Python 文件 -- python.org

字串方法

有許多方法或內建字串函式

capitalize
center
count
decode
encode
endswith
expandtabs
find
index
isalnum
isalpha
isdigit
islower
isspace
istitle
isupper
join
ljust
lower
lstrip
replace
rfind
rindex
rjust
rstrip
split
splitlines
startswith
strip
swapcase
title
translate
upper
zfill

只有強調的專案將被涵蓋。

is*

isalnum()、isalpha()、isdigit()、islower()、isupper()、isspace() 和 istitle() 屬於此類別。

被比較的字串物件的長度必須至少為 1，否則 is* 方法將返回 False。換句話說，長度為 len(string) == 0 的字串物件被認為是“空”或 False。

isalnum 如果字串完全由字母和/或數字字元組成（即沒有標點符號），則返回 True。
isalpha 和 isdigit 對僅字母字元或僅數字字元分別以類似的方式工作。
isspace 如果字串完全由空格組成，則返回 True。
islower、isupper 和 istitle 分別返回 True，如果字串為小寫、大寫或標題大小寫。未大小寫的字元是“允許的”，例如數字，但字串物件中必須至少有一個大小寫字元才能返回 True。標題大小寫表示每個單詞的第一個大小寫字元為大寫，緊隨其後的任何大小寫字元為小寫。奇怪的是，'Y2K'.istitle() 返回 True。這是因為大寫字元只能跟隨未大小寫字元。同樣，小寫字元只能跟隨大寫或小寫字元。提示：空格未大小寫。

例子

>>> '2YK'.istitle()
False
>>> 'Y2K'.istitle()
True
>>> '2Y K'.istitle()
True

Title, Upper, Lower, Swapcase, Capitalize

分別返回轉換為標題大小寫、大寫、小寫、反轉大小寫或首字母大寫的字串。

title 方法將字串中每個單詞的第一個字母大寫（並將其餘字母小寫）。單詞被識別為由非字母字元（例如數字或空格）分隔的字母字元的子字串。這會導致一些意外的行為。例如，字串 "x1x" 將轉換為 "X1X" 而不是 "X1x"。

swapcase 方法使所有大寫字母小寫，反之亦然。

capitalize 方法類似於 title，只是它將整個字串視為一個單詞。（即它使第一個字元大寫，其餘字元小寫）

例子

s = 'Hello, wOrLD'
print(s)             # 'Hello, wOrLD'
print(s.title())     # 'Hello, World'
print(s.swapcase())  # 'hELLO, WoRld'
print(s.upper())     # 'HELLO, WORLD'
print(s.lower())     # 'hello, world'
print(s.capitalize())# 'Hello, world'

關鍵詞：轉換為小寫，轉換為大寫，小寫，大寫，downcase，upcase。

count

返回字串中指定子字串的數量。即

>>> s = 'Hello, world'
>>> s.count('o') # print the number of 'o's in 'Hello, World' (2)
2

提示：.count() 區分大小寫，因此此示例將只計算小寫字母 'o' 的數量。例如，如果您執行

>>> s = 'HELLO, WORLD'
>>> s.count('o') # print the number of lowercase 'o's in 'HELLO, WORLD' (0)
0

strip, rstrip, lstrip

返回字串的副本，其中刪除了前導（lstrip）和尾隨（rstrip）空格。strip 刪除兩者。

>>> s = '\t Hello, world\n\t '
>>> print(s)
         Hello, world

>>> print(s.strip())
Hello, world
>>> print(s.lstrip())
Hello, world
        # ends here
>>> print(s.rstrip())
         Hello, world

注意前導和尾隨的製表符和換行符。

strip 方法也可以用於刪除其他型別的字元。

import string
s = 'www.wikibooks.org'
print(s)
print(s.strip('w'))                # Removes all w's from outside
print(s.strip(string.lowercase))   # Removes all lowercase letters from outside
print(s.strip(string.printable))   # Removes all printable characters

輸出

www.wikibooks.org
.wikibooks.org
.wikibooks.

注意，string.lowercase 和 string.printable 需要匯入 string 語句。

ljust, rjust, center

左、右或居中對齊字串到給定的欄位大小（其餘部分用空格填充）。

>>> s = 'foo'
>>> s
'foo'
>>> s.ljust(7)
'foo    '
>>> s.rjust(7)
'    foo'
>>> s.center(7)
'  foo  '

join

使用字串作為分隔符將給定的序列連線在一起。

>>> seq = ['1', '2', '3', '4', '5']
>>> ' '.join(seq)
'1 2 3 4 5'
>>> '+'.join(seq)
'1+2+3+4+5'

map 這裡可能會有用：（它將 seq 中的數字轉換為字串）

>>> seq = [1,2,3,4,5]
>>> ' '.join(map(str, seq))
'1 2 3 4 5'

現在，seq 中可以包含任意物件，而不僅僅是字串。

find, index, rfind, rindex

find 和 index 方法返回給定子序列在字串中第一次出現的位置索引。如果未找到，find 返回 -1，但 index 則會引發 ValueError。rfind 和 rindex 與 find 和 index 相同，只是它們從右到左搜尋字串（即它們找到最後出現的）。

>>> s = 'Hello, world'
>>> s.find('l')
2
>>> s[s.index('l'):]
'llo, world'
>>> s.rfind('l')
10
>>> s[:s.rindex('l')]
'Hello, wor'
>>> s[s.index('l'):s.rindex('l')]
'llo, wor'

由於 Python 字串接受負索引，因此在像上面所示的情況中使用 index 可能更好，因為使用 find 會產生意外的值。

replace

replace 的工作原理與它的字面意思一樣。它返回字串的副本，其中所有出現的第一個引數都被第二個引數替換。

>>> 'Hello, world'.replace('o', 'X')
'HellX, wXrld'

或者，使用變數賦值

string = 'Hello, world'
newString = string.replace('o', 'X')
print(string)
print(newString)

輸出

Hello, world
HellX, wXrld

請注意，在呼叫 replace 後，原始變數（string）保持不變。

expandtabs

將製表符替換為適當數量的空格（預設每個製表符的空格數為 8；可以透過將製表符大小作為引數傳遞來更改）。

s = 'abcdefg\tabc\ta'
print(s)
print(len(s))
t = s.expandtabs()
print(t)
print(len(t))

輸出

abcdefg abc     a
13
abcdefg abc     a
17

注意，儘管這兩個字串看起來相同，但第二個字串 (t) 的長度不同，因為每個製表符都是由空格表示，而不是製表符字元。

要使用製表符大小為 4 而不是 8

v = s.expandtabs(4)
print(v)
print(len(v))

輸出

abcdefg abc a
13

請注意，每個製表符並不總是被計為八個空格。相反，製表符會將計數推送到下一個八的倍數。例如

s = '\t\t'
print(s.expandtabs().replace(' ', '*'))
print(len(s.expandtabs()))

輸出

 ****************
 16

s = 'abc\tabc\tabc'
print(s.expandtabs().replace(' ', '*'))
print(len(s.expandtabs()))

輸出

 abc*****abc*****abc
 19

split, splitlines

split 方法返回字串中單詞的列表。它可以接受一個分隔符引數，而不是空格。

>>> s = 'Hello, world'
>>> s.split()
['Hello,', 'world']
>>> s.split('l')
['He', '', 'o, wor', 'd']

請注意，在這兩種情況下，分隔符都不包含在分割的字串中，但空字串是允許的。

splitlines 方法將多行字串分解成許多單行字串。它類似於 split('\n')（但也接受 '\r' 和 '\r\n' 作為分隔符），但如果字串以換行符結尾，splitlines 則會忽略該最終字元（參見示例）。

>>> s = """
... One line
... Two lines
... Red lines
... Blue lines
... Green lines
... """
>>> s.split('\n')
['', 'One line', 'Two lines', 'Red lines', 'Blue lines', 'Green lines', '']
>>> s.splitlines()
['', 'One line', 'Two lines', 'Red lines', 'Blue lines', 'Green lines']

split 方法也接受多字元字串字面量

txt = 'May the force be with you'
spl = txt.split('the')
print(spl)
# ['May ', ' force be with you']

Unicode

在 Python 3.x 中，所有字串（型別 str）預設情況下都包含 Unicode。

在 Python 2.x 中，除了 str 型別之外，還存在一個專門的 unicode 型別：u = u"Hello"; type(u) 是 unicode。

內部幫助中的主題名稱是 UNICODE。

Python 3.x 的示例

v = "Hello Günther"
- 直接在原始碼中使用 Unicode 程式碼點；這必須以 UTF-8 編碼。
v = "Hello G\xfcnther"
- 使用 \xfc 指定 8 位 Unicode 程式碼點。
v = "Hello G\u00fcnther"
- 使用 \u00fc 指定 16 位 Unicode 程式碼點。
v = "Hello G\U000000fcnther"
- 使用 \U000000fc 指定 32 位 Unicode 程式碼點，其中 U 大寫。
v = "Hello G\N{LATIN SMALL LETTER U WITH DIAERESIS}nther"
- 使用 \N 後跟 unicode 點名稱指定 Unicode 程式碼點。
v = "Hello G\N{latin small letter u with diaeresis}nther"
- 程式碼點名稱可以是小寫。
n = unicodedata.name(chr(252))
- 給定一個 Unicode 字元（這裡為 ü），獲取 Unicode 程式碼點名稱。
v = "Hello G" + chr(252) + "nther"
- chr() 接受 Unicode 程式碼點並返回一個包含一個 Unicode 字元的字串。
c = ord("ü")
- 產生程式碼點編號。
b = "Hello Günther".encode("UTF-8")
- 從 Unicode 字串建立位元組序列（位元組）。
b = "Hello Günther".encode("UTF-8"); u = b.decode("UTF-8")
- 透過 decode() 方法將位元組解碼為 Unicode 字串。
v = b"Hello " + "G\u00fcnther"
- 引發 TypeError：無法將位元組連線到 str。
v = b"Hello".decode("ASCII") + "G\u00fcnther"
- 現在它可以正常工作了。
f = open("File.txt", encoding="UTF-8"); lines = f.readlines(); f.close()
- 使用特定編碼開啟檔案以供讀取，並從中讀取。如果未指定編碼，則使用 locale.getpreferredencoding() 的編碼。
f = open("File.txt", "w", encoding="UTF-8"); f.write("Hello G\u00fcnther"); f.close()
- 使用指定編碼寫入檔案。
f = open("File.txt", encoding="UTF-8-sig"); lines = f.readlines(); f.close()
- -sig 編碼意味著任何前導位元組順序標記 (BOM) 將自動被剝離。
f = tokenize.open("File.txt"); lines = f.readlines(); f.close()
- 根據檔案中存在的編碼標記（如 BOM）自動檢測編碼，剝離標記。
f = open("File.txt", "w", encoding="UTF-8-sig"); f.write("Hello G\u00fcnther"); f.close()
- 使用 UTF-8 寫入檔案，在開頭寫入 BOM。

Python 2.x 的示例

v = u"Hello G\u00fcnther"
- 使用 \u00fc 指定 16 位 Unicode 程式碼點。
v = u"Hello G\U000000fcnther"
- 使用 \U000000fc 指定 32 位 Unicode 程式碼點，其中 U 大寫。
v = u"Hello G\N{LATIN SMALL LETTER U WITH DIAERESIS}nther"
- 使用 \N 後跟 unicode 點名稱指定 Unicode 程式碼點。
v = u"Hello G\N{latin small letter u with diaeresis}nther"
- 程式碼點名稱可以是小寫。
unicodedata.name(unichr(252))
- 給定一個 Unicode 字元（這裡為 ü），獲取 Unicode 程式碼點名稱。
v = "Hello G" + unichr(252) + "nther"
- chr() 接受 Unicode 程式碼點並返回一個包含一個 Unicode 字元的字串。
c = ord(u"ü")
- 產生程式碼點編號。
b = u"Hello Günther".encode("UTF-8")
- 從 Unicode 字串建立位元組序列 (str)。type(b) 是 str。
b = u"Hello Günther".encode("UTF-8"); u = b.decode("UTF-8")
- 透過 decode() 方法將位元組（型別 str）解碼為 Unicode 字串。
v = "Hello" + u"Hello G\u00fcnther"
- 連線 str（位元組）和 Unicode 字串，不會出現錯誤。
f = codecs.open("File.txt", encoding="UTF-8"); lines = f.readlines(); f.close()
- 使用特定編碼開啟檔案以供讀取，並從中讀取。如果未指定編碼，則使用 locale.getpreferredencoding() 的編碼 [驗證]。
f = codecs.open("File.txt", "w", encoding="UTF-8"); f.write(u"Hello G\u00fcnther"); f.close()
- 使用指定編碼寫入檔案。
- 與 Python 3 變體不同，如果告訴它透過 \n 寫入換行符，它不會寫入作業系統特定的換行符，而是寫入字面意義上的 \n；這會造成差異，例如在 Windows 上。
- 為了確保像文字模式一樣的操作，可以寫入 os.linesep。
f = codecs.open("File.txt", encoding="UTF-8-sig"); lines = f.readlines(); f.close()
- -sig 編碼意味著任何前導位元組順序標記 (BOM) 將自動被剝離。

連結

Unicode HOWTO for Python 3, docs.python.org
Unicode HOWTO for Python 2, docs.python.org
在 Python 3 中處理文字檔案，curiousefficiency.org
PEP 263 – 定義 Python 原始碼編碼，python.org
unicodedata — Unicode 資料庫在 Python 庫參考中，docs.python.org
獲取 Python 可以編碼的編碼列表，stackoverflow.com

外部連結

"字串方法" 章節 -- python.org
"string" 模組的 Python 文件 -- python.org

前一個: 數字

索引

下一個: 列表